utsavjha opened a new issue #11447:
URL: https://github.com/apache/druid/issues/11447


   Please provide a detailed title (e.g. "Clustered Druid, without ZooKeeper, 
throwing intermittent `NullPointerException`").
   
   ### Affected Version: 0.21.1, deployed on Redhat Openshift 3.11.
   Using PostgresSQL on TLS as MetadataStore.
   Using S3 as DeepStorage
   
   ### Description
   
   Hello, I am trying to run a Native-Parallel Batch Ingestion Job in my 
cluster (2x Historicals, 5x MiddleManagers, 1 Coordinator, 1 Overlord, 1 Router 
and 2 Brokers), while connecting to an S3 Input source. I am running the Druid 
cluster on Openshift, utilising the `druid-kubernetes-extension` to run a Druid 
cluster without any ZooKeeper dependency.
   
   I placed the appropriate configs for removal of ZK and using HTTP instead 
for LeaderElection. Added the necessary ServiceAccount Roles and the deployment 
occurs smoothly. No errors/exceptions thrown, no Liveness/Readiness Probe 
failed, either during deployment or during ingestion.
   
   However, after 1/3 of elapsed time during the ingestion job, the Coordinator 
and Overlord somehow lose state and stop being visible to the Druid-Cluster. 
The ingestion itself does not fail, but is full of error messages, like the 
ones attached.
   
   <img width="1568" alt="problem-start-1" 
src="https://user-images.githubusercontent.com/11601144/125730068-06596ea5-845b-477f-9153-d11362a13072.png";>
   
   This problem continues thereafter for the entire ingestion, just dirtying 
the logs, everywhere. It doesnt resolve itself even after a re-deployment, so 
the issue is really bugging me out.
   
   <img width="1496" alt="prob-continues-2" 
src="https://user-images.githubusercontent.com/11601144/125730059-1d3b5007-dd13-4abe-9d12-ddd2e2060857.png";>
   
   <img width="1588" alt="prob-continues-3" 
src="https://user-images.githubusercontent.com/11601144/125730066-d84f6080-db93-46b0-8878-9cf66571d5b0.png";>
   
   Metrics of Ingestion:
   - 400GB Input Data on S3
   - 34GB Total Segments on S3
   - Number of Segments for each ingestion: 36
   - Total Segments from start of time: 108
   - Used to run for ~12 Hours with Zookeeper.
   
   
   Possible fixes I tried:
   - Checking if any resource contention is occuring during the ingestion: No, 
both Coordinator and Overlord pods have sufficient memory, compute and 
disk-space available. (For both, Xms/Xmx = 2g/4g, pod has on average ~1200MiB 
available during ingestion).
   - Previously, I was running both Coordinator and Overlord within a single 
process (although with greater resources). I assumed the error is caused due to 
this coupliing and hence, sought to remove it. The split was successful, I was 
able to see the OverLord handling ingestion-tasks, while Coordinator only 
dealing with the created segments. But the error persisted.
   - Running a smaller ingestion job: THIS SUCCEEDED WITHOUT ANY ERRORS (Tried 
with the Wikipedia Source)
   - Running Overlord and Coordinator as a StatefulSet because I assumed that 
re-running a deployment could mean that a different Node is scheduling the 
pods, thereby causing some SynchronisationIssue with the unaffected pods.
   
   I'm attaching my Config Files(they were ymls but GH didnt allow me to upload 
YML) and error screenshots. Any help would be greatly appreciated.
   
   
[coordinator-ss.txt](https://github.com/apache/druid/files/6820419/coordinator-ss.txt)
   
[druid-configmap.txt](https://github.com/apache/druid/files/6820420/druid-configmap.txt)
   
[historical.txt](https://github.com/apache/druid/files/6820421/historical.txt)
   [overlord.txt](https://github.com/apache/druid/files/6820422/overlord.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to