[GitHub] [druid] utsavjha opened a new issue #11447: Clustered Druid without Zookeeper Keeps Throwing NullPointerException

GitBox Wed, 14 Jul 2021 21:46:01 -0700


utsavjha opened a new issue #11447:
URL: https://github.com/apache/druid/issues/11447

Please provide a detailed title (e.g. "Clustered Druid, without ZooKeeper,
throwing intermittent `NullPointerException`").

### Affected Version: 0.21.1, deployed on Redhat Openshift 3.11.
Using PostgresSQL on TLS as MetadataStore.
Using S3 as DeepStorage

### Description

Hello, I am trying to run a Native-Parallel Batch Ingestion Job in my
cluster (2x Historicals, 5x MiddleManagers, 1 Coordinator, 1 Overlord, 1 Router
and 2 Brokers), while connecting to an S3 Input source. I am running the Druid
cluster on Openshift, utilising the `druid-kubernetes-extension` to run a Druid
cluster without any ZooKeeper dependency.

I placed the appropriate configs for removal of ZK and using HTTP instead
for LeaderElection. Added the necessary ServiceAccount Roles and the deployment
occurs smoothly. No errors/exceptions thrown, no Liveness/Readiness Probe
failed, either during deployment or during ingestion.

However, after 1/3 of elapsed time during the ingestion job, the Coordinator
and Overlord somehow lose state and stop being visible to the Druid-Cluster.
The ingestion itself does not fail, but is full of error messages, like the
ones attached.

This problem continues thereafter for the entire ingestion, just dirtying
the logs, everywhere. It doesnt resolve itself even after a re-deployment, so
the issue is really bugging me out.

Metrics of Ingestion:
- 400GB Input Data on S3
- 34GB Total Segments on S3
- Number of Segments for each ingestion: 36
- Total Segments from start of time: 108
- Used to run for ~12 Hours with Zookeeper.

Possible fixes I tried:
- Checking if any resource contention is occuring during the ingestion: No,
both Coordinator and Overlord pods have sufficient memory, compute and
disk-space available. (For both, Xms/Xmx = 2g/4g, pod has on average ~1200MiB
available during ingestion).
- Previously, I was running both Coordinator and Overlord within a single
process (although with greater resources). I assumed the error is caused due to
this coupliing and hence, sought to remove it. The split was successful, I was
able to see the OverLord handling ingestion-tasks, while Coordinator only
dealing with the created segments. But the error persisted.
- Running a smaller ingestion job: THIS SUCCEEDED WITHOUT ANY ERRORS (Tried
with the Wikipedia Source)
- Running Overlord and Coordinator as a StatefulSet because I assumed that
re-running a deployment could mean that a different Node is scheduling the
pods, thereby causing some SynchronisationIssue with the unaffected pods.

I'm attaching my Config Files(they were ymls but GH didnt allow me to upload
YML) and error screenshots. Any help would be greatly appreciated.

[coordinator-ss.txt](https://github.com/apache/druid/files/6820419/coordinator-ss.txt)

[druid-configmap.txt](https://github.com/apache/druid/files/6820420/druid-configmap.txt)

[historical.txt](https://github.com/apache/druid/files/6820421/historical.txt)
[overlord.txt](https://github.com/apache/druid/files/6820422/overlord.txt)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] utsavjha opened a new issue #11447: Clustered Druid without Zookeeper Keeps Throwing NullPointerException

Reply via email to