vpeack opened a new issue #11478: URL: https://github.com/apache/druid/issues/11478
Hi everyone, Following a [post on ASF slack](https://the-asf.slack.com/archives/CJ8D1JTB8/p1626879127422700?thread_ts=1626868492.418800&cid=CJ8D1JTB8), I open up a new issue here on the advice of someone from Imply. We are running compaction tasks through indexers that randomly fail on phase 3 (partial_index_generic_merge) with the following error message (more details below) : "error in opening zip file" The reply we had on slack : As to the specific error, I'm not sure if it's exactly the same as what's going on in https://github.com/apache/druid/issues/9993, but that issue does point out an important thing, which is that if the shuffle server returns an error, the shuffle client will not actually log out that error, but it will just log this sort of obtuse zip decompression error. (Because it's trying to unzip the error message.) This isn't good error behavior, so we should adjust that to log the actual server error instead of trying to unzip the error message. Which is silly! This seems an indexer bug .Could you please create a BUG request in druid github project with all the details. ### Affected Version 0.21.0 ### Description - Cluster size 1 master (coordinator/overlord) 2 routers/brokers ~10 historicals ~20 indexers (dedicated to these tasks) + ~5 indexers for realtime ingestion (kafka) ~30TB data - Configurations in use Spec object we are using : `{ "type": "index_parallel", "spec": { "ioConfig": { "type": "index_parallel", "inputSource": { "type": "druid", "dataSource": "events", "interval": "2021-07-13T00:00:00/2021-07-14T00:00:00" } }, "tuningConfig": { "type": "index_parallel", "partitionsSpec": { "type": "hashed", "maxRowsPerSegment": 800000 }, "forceGuaranteedRollup": true, "maxNumConcurrentSubTasks": 40, "totalNumMergeTasks": 20, "maxRetry": 10, "maxPendingPersists": 1, "maxRowsPerSegment": 800000 }, "dataSchema": { "dataSource": "events", "granularitySpec": { "type": "uniform", "queryGranularity": "HOUR", "segmentGranularity": "HOUR", "rollup": true }, "timestampSpec": { "column": "__time", "format": "iso" }, "dimensionsSpec": { }, "metricsSpec": [ ] } } }` - Steps to reproduce the problem Happens randomly - The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful. `{"severity": "INFO", "message": "[[partial_index_generic_merge_events_gpceoeme_2021-07-21T11:15:41.883Z]-threading-task-runner-executor-0] org.apache.druid.utils.CompressionUtils - Unzipping file[/opt/druid-data/task/partial_index_generic_merge_events_gpceoeme_2021-07-21T11:15:41.883Z/work/indexing-tmp/2021-07-20T08:00:00.000Z/2021-07-20T09:00:00.000Z/10/temp_partial_index_generate_events_ooikmkan_2021-07-21T11:00:25.016Z] to [/opt/druid-data/task/partial_index_generic_merge_events_gpceoeme_2021-07-21T11:15:41.883Z/work/indexing-tmp/2021-07-20T08:00:00.000Z/2021-07-20T09:00:00.000Z/10/unzipped_partial_index_generate_events_ooikmkan_2021-07-21T11:00:25.016Z]"} {"severity": "ERROR", "message": "[[partial_index_generic_merge_events_gpceoeme_2021-07-21T11:15:41.883Z]-threading-task-runner-executor-0] org.apache.druid.indexing.overlord.ThreadingTaskRunner - Exception caught while running the task."} java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) ~[?:1.8.0_292] at java.util.zip.ZipFile.<init>(ZipFile.java:225) ~[?:1.8.0_292] at java.util.zip.ZipFile.<init>(ZipFile.java:155) ~[?:1.8.0_292] at java.util.zip.ZipFile.<init>(ZipFile.java:169) ~[?:1.8.0_292] at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:235) ~[druid-core-0.21.0.jar:0.21.0] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.fetchSegmentFiles(PartialSegmentMergeTask.java:224) ~[druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.common.task.batch.parallel.PartialSegmentMergeTask.runTask(PartialSegmentMergeTask.java:162) ~[druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.common.task.batch.parallel.PartialGenericSegmentMergeTask.runTask(PartialGenericSegmentMergeTask.java:41) ~[druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:152) ~[druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:211) [druid-indexing-service-0.21.0.jar:0.21.0] at org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:151) [druid-indexing-service-0.21.0.jar:0.21.0] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]` - Any debugging that you have already done N/A Any ideas on how to we can resolve this ? Feel free to ask if you need anything else. Thanks a lot -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
