utsavjha opened a new issue #11447: URL: https://github.com/apache/druid/issues/11447
Please provide a detailed title (e.g. "Clustered Druid, without ZooKeeper, throwing intermittent `NullPointerException`"). ### Affected Version: 0.21.1, deployed on Redhat Openshift 3.11. Using PostgresSQL on TLS as MetadataStore. Using S3 as DeepStorage ### Description Hello, I am trying to run a Native-Parallel Batch Ingestion Job in my cluster (2x Historicals, 5x MiddleManagers, 1 Coordinator, 1 Overlord, 1 Router and 2 Brokers), while connecting to an S3 Input source. I am running the Druid cluster on Openshift, utilising the `druid-kubernetes-extension` to run a Druid cluster without any ZooKeeper dependency. I placed the appropriate configs for removal of ZK and using HTTP instead for LeaderElection. Added the necessary ServiceAccount Roles and the deployment occurs smoothly. No errors/exceptions thrown, no Liveness/Readiness Probe failed, either during deployment or during ingestion. However, after 1/3 of elapsed time during the ingestion job, the Coordinator and Overlord somehow lose state and stop being visible to the Druid-Cluster. The ingestion itself does not fail, but is full of error messages, like the ones attached. <img width="1568" alt="problem-start-1" src="https://user-images.githubusercontent.com/11601144/125730068-06596ea5-845b-477f-9153-d11362a13072.png"> This problem continues thereafter for the entire ingestion, just dirtying the logs, everywhere. It doesnt resolve itself even after a re-deployment, so the issue is really bugging me out. <img width="1496" alt="prob-continues-2" src="https://user-images.githubusercontent.com/11601144/125730059-1d3b5007-dd13-4abe-9d12-ddd2e2060857.png"> <img width="1588" alt="prob-continues-3" src="https://user-images.githubusercontent.com/11601144/125730066-d84f6080-db93-46b0-8878-9cf66571d5b0.png"> Metrics of Ingestion: - 400GB Input Data on S3 - 34GB Total Segments on S3 - Number of Segments for each ingestion: 36 - Total Segments from start of time: 108 - Used to run for ~12 Hours with Zookeeper. Possible fixes I tried: - Checking if any resource contention is occuring during the ingestion: No, both Coordinator and Overlord pods have sufficient memory, compute and disk-space available. (For both, Xms/Xmx = 2g/4g, pod has on average ~1200MiB available during ingestion). - Previously, I was running both Coordinator and Overlord within a single process (although with greater resources). I assumed the error is caused due to this coupliing and hence, sought to remove it. The split was successful, I was able to see the OverLord handling ingestion-tasks, while Coordinator only dealing with the created segments. But the error persisted. - Running a smaller ingestion job: THIS SUCCEEDED WITHOUT ANY ERRORS (Tried with the Wikipedia Source) - Running Overlord and Coordinator as a StatefulSet because I assumed that re-running a deployment could mean that a different Node is scheduling the pods, thereby causing some SynchronisationIssue with the unaffected pods. I'm attaching my Config Files(they were ymls but GH didnt allow me to upload YML) and error screenshots. Any help would be greatly appreciated. [coordinator-ss.txt](https://github.com/apache/druid/files/6820419/coordinator-ss.txt) [druid-configmap.txt](https://github.com/apache/druid/files/6820420/druid-configmap.txt) [historical.txt](https://github.com/apache/druid/files/6820421/historical.txt) [overlord.txt](https://github.com/apache/druid/files/6820422/overlord.txt) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
