BartolomeD opened a new issue #10206:
URL: https://github.com/apache/druid/issues/10206
### Affected Version
0.19.0-rc1
### Description
Immediately after successfully completing both the
`partial_dimension_distribution` and `partial_range_index_generate` phases of
`index_parallel` ingestion all `partial_index_generic_merge` tasks fail without
producing logs. Below is the log for the main `index_parallel` task. As you can
see, the inputSplit is _massive_ and has almost 3 million characters. Could
this be a reason those tasks fail? In this log we see the tasks fail without a
reason being provided:
```
2020-07-21T14:23:54,024 WARN [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor -
task[partial_index_generic_merge_listing-stats_hjlgleak_2020-07-21T14:23:51.446Z]
failed!
```
`druid/indexing-logs/index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z/log`:
```
2020-07-21T14:23:52,753 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner -
Submit a new task for
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_5] and
inputSplit[InputSplit{split=[PartitionLocation{host='ip-10-0-2-94.ec2.internal',
.......
< THIS LINE HAS 2920968 characters, I HAVE TRUNCATED THIS >
......, secondaryPartition=BuildingSingleDimensionShardSpec{bucketId=15,
dimension='H3_HEX_RES_5', start='851f40c3fffffff', end='851f8d47fffffff',
partitionNum=54}}]}]
2020-07-21T14:23:52,915 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a
new
task[partial_index_generic_merge_listing-stats_bpjmkhkj_2020-07-21T14:23:52.698Z]
for retrying
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_2]
2020-07-21T14:23:53,320 WARN [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor -
task[partial_index_generic_merge_listing-stats_gkbimbgc_2020-07-21T14:23:51.045Z]
failed!
2020-07-21T14:23:53,320 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - We still
have more chances[2/3] to process the
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_0].
2020-07-21T14:23:53,334 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a
new
task[partial_index_generic_merge_listing-stats_lcohkdeb_2020-07-21T14:23:53.087Z]
for spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_5]
2020-07-21T14:23:53,511 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner -
Waiting for subTasks to be completed
2020-07-21T14:23:53,687 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a
new
task[partial_index_generic_merge_listing-stats_hgedboii_2020-07-21T14:23:53.511Z]
for retrying
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_0]
2020-07-21T14:23:54,024 WARN [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor -
task[partial_index_generic_merge_listing-stats_hjlgleak_2020-07-21T14:23:51.446Z]
failed!
2020-07-21T14:23:54,024 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - We still
have more chances[1/3] to process the
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_3].
2020-07-21T14:23:54,195 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a
new
task[partial_index_generic_merge_listing-stats_ibkdpjng_2020-07-21T14:23:54.024Z]
for retrying
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_3]
2020-07-21T14:23:54,532 WARN [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor -
task[partial_index_generic_merge_listing-stats_kjlmjglj_2020-07-21T14:23:52.261Z]
failed!
2020-07-21T14:23:54,532 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - We still
have more chances[1/3] to process the
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_4].
2020-07-21T14:23:54,726 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a
new
task[partial_index_generic_merge_listing-stats_dginpfip_2020-07-21T14:23:54.532Z]
for retrying
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_4]
2020-07-21T14:23:55,066 WARN [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor -
task[partial_index_generic_merge_listing-stats_emlcfcoj_2020-07-21T14:23:51.895Z]
failed!
2020-07-21T14:23:55,066 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - We still
have more chances[2/3] to process the
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_1].
2020-07-21T14:23:55,235 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a
new
task[partial_index_generic_merge_listing-stats_omoemikj_2020-07-21T14:23:55.066Z]
for retrying
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_1]
2020-07-21T14:23:55,684 WARN [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor -
task[partial_index_generic_merge_listing-stats_bpjmkhkj_2020-07-21T14:23:52.698Z]
failed!
2020-07-21T14:23:55,684 INFO [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - We still
have more chances[2/3] to process the
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_2].
2020-07-21T14:23:55,853 INFO [task-monport=8091, useHttps=false,
subTaskId='partial_range_index_generate_listing-stats_bpeglhfn_2020-07-20T07:43:00.320Z',
interval=2019-03-01T00:00:00.000Z/2019-04-01T00:00:00.000Z,
secondaryPartition=BuildingSingleDimensionShardSpec{bucketId=20,
dimension='H3_HEX_RES_5', start='8530a127fffffff', end='85390ca3fffffff',
partitionNum=3}}, PartitionLocation{host='ip-10-0-2-94.ec2.internal',
port=8091, useHttps=false,
subTaskId='partial_range_index_generate_listing-stats_cbpldlfd_2020-07-19T21:45:35.321Z',itor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Submitted a
new
task[partial_index_generic_merge_listing-stats_difjpham_2020-07-21T14:23:55.684Z]
for retrying
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_2]
2020-07-21T14:23:56,186 WARN [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor -
task[partial_index_generic_merge_listing-stats_hgedboii_2020-07-21T14:23:53.511Z]
failed!
2020-07-21T14:23:56,186 ERROR [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor -
spec[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z_0] failed
after [3] tries
2020-07-21T14:23:56,187 ERROR [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner -
Failed because of the failed sub
task[partial_index_generic_merge_listing-stats_hgedboii_2020-07-21T14:23:53.511Z]
2020-07-21T14:23:56,187 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner -
Cleaning up resources
2020-07-21T14:23:56,187 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to
cancel
subtask[partial_index_generic_merge_listing-stats_omoemikj_2020-07-21T14:23:55.066Z]
2020-07-21T14:23:56,188 ERROR [task-monitor-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Error while
monitoring
java.lang.RuntimeException: java.lang.InterruptedException
at
org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskStatus(HttpIndexingServiceClient.java:268)
~[druid-server-0.19.0.jar:0.19.0]
at
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor.lambda$start$0(TaskMonitor.java:128)
~[druid-indexing-service-0.19.0.jar:0.19.0]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[?:1.8.0_252]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[?:1.8.0_252]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[?:1.8.0_252]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[?:1.8.0_252]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_252]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
~[?:1.8.0_252]
at
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
~[guava-16.0.1.jar:?]
at
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
~[guava-16.0.1.jar:?]
at
org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:173)
~[druid-server-0.19.0.jar:0.19.0]
at
org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:145)
~[druid-server-0.19.0.jar:0.19.0]
at
org.apache.druid.client.indexing.HttpIndexingServiceClient.getTaskStatus(HttpIndexingServiceClient.java:253)
~[druid-server-0.19.0.jar:0.19.0]
... 8 more
2020-07-21T14:23:56,347 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to
cancel
subtask[partial_index_generic_merge_listing-stats_difjpham_2020-07-21T14:23:55.684Z]
2020-07-21T14:23:56,475 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to
cancel
subtask[partial_index_generic_merge_listing-stats_lcohkdeb_2020-07-21T14:23:53.087Z]
2020-07-21T14:23:56,568 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to
cancel
subtask[partial_index_generic_merge_listing-stats_ibkdpjng_2020-07-21T14:23:54.024Z]
2020-07-21T14:23:56,662 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Request to
cancel
subtask[partial_index_generic_merge_listing-stats_dginpfip_2020-07-21T14:23:54.532Z]
2020-07-21T14:23:56,758 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.common.task.batch.parallel.TaskMonitor - Stopped
taskMonitor
2020-07-21T14:23:56,762 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed
with status: {
"id" : "index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z",
"status" : "FAILED",
"duration" : 221801819,
"errorMsg" : null,
"location" : {
"host" : null,
"port" : -1,
"tlsPort" : -1
}
}
2020-07-21T14:23:56,770 INFO [main]
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle
[module] stage [ANNOUNCEMENTS]
2020-07-21T14:23:56,774 INFO [main]
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle
[module] stage [SERVER]
2020-07-21T14:23:56,778 INFO [main]
org.eclipse.jetty.server.AbstractConnector - Stopped
ServerConnector@bb50de6{HTTP/1.1,[http/1.1]}{0.0.0.0:8100}
2020-07-21T14:23:56,778 INFO [main] org.eclipse.jetty.server.session - node0
Stopped scavenging
2020-07-21T14:23:56,780 INFO [main]
org.eclipse.jetty.server.handler.ContextHandler - Stopped
o.e.j.s.ServletContextHandler@2a54c92e{/,null,UNAVAILABLE}
2020-07-21T14:23:56,786 INFO [main]
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle
[module] stage [NORMAL]
2020-07-21T14:23:56,787 INFO [main]
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting
graceful shutdown of
task[index_parallel_listing-stats_ompohlkj_2020-07-19T00:47:06.225Z].
2020-07-21T14:23:56,787 INFO [main]
org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexPhaseRunner -
Cleaning up resources
2020-07-21T14:23:56,791 INFO [main]
org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager
- CoordinatorPollingBasicAuthenticatorCacheManager is stopping.
2020-07-21T14:23:56,791 INFO [main]
org.apache.druid.security.basic.authentication.db.cache.CoordinatorPollingBasicAuthenticatorCacheManager
- CoordinatorPollingBasicAuthenticatorCacheManager is stopped.
2020-07-21T14:23:56,791 INFO [main]
org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager
- CoordinatorPollingBasicAuthorizerCacheManager is stopping.
2020-07-21T14:23:56,791 INFO [main]
org.apache.druid.security.basic.authorization.db.cache.CoordinatorPollingBasicAuthorizerCacheManager
- CoordinatorPollingBasicAuthorizerCacheManager is stopped.
2020-07-21T14:23:56,794 INFO
[LookupExtractorFactoryContainerProvider-MainThread]
org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop
exited. Lookup notices are not handled anymore.
2020-07-21T14:23:56,829 INFO [Curator-Framework-0]
org.apache.curator.framework.imps.CuratorFrameworkImpl -
backgroundOperationsLoop exiting
2020-07-21T14:23:56,834 INFO [main] org.apache.zookeeper.ZooKeeper -
Session: 0x100000025e001c7 closed
2020-07-21T14:23:56,834 INFO [main]
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle
[module] stage [INIT]
Finished peon task
```
### Cluster Size
- Master server
- m5.large instance: 2 vCPU, 8 GiB Memory
- Data servers
- 2 i3.xlarge instances: 4 vCPU, 30.5 GiB Memory, 1 x 0.95TB SSD (each)
- Query server
- m5.large instace: 2 vCPU, 8GiB Memory
### Configurations in use
```
druid.indexer.runner.javaOpts=-server -Xms4g -Xmx4g
-XX:MaxDirectMemorySize=4g -Duser.timezone=UTC -Dfile.encoding=UTF-8
-XX:+UseG1GC -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/ec2-user/apache-druid-0.18.1/var/heap_dumps/druid-heap.hprof
-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
```
### IngestionSpec
```json
{
"type": "index_parallel",
"spec": {
"ioConfig": {
"type": "index_parallel",
"inputSource": {
"type": "s3",
"prefixes": [
"s3://atr-druid-data-sources/listing-stats/"
]
},
"inputFormat": {
"type": "csv",
"findColumnsFromHeader": true,
"listDelimiter": "|"
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index_parallel",
"partitionsSpec": {
"type": "single_dim",
"partitionDimension": "H3_HEX_RES_5",
"targetRowsPerSegment": 5500000
},
"forceGuaranteedRollup": true,
"logParseExceptions": true,
"maxNumConcurrentSubTasks": 2,
"dataSchema": {
"dataSource": "listing-stats",
"granularitySpec": {
"type": "uniform",
"queryGranularity": "DAY",
"rollup": true,
"intervals": [
"2018-05-01T00:00:00.000Z/2020-06-01T00:00:00.000Z"
],
"segmentGranularity": "DAY"
},
"timestampSpec": {
"column": "CALENDAR_DATE",
"format": "auto"
},
"dimensionsSpec": {
"dimensions": [
{
"type": "string",
"name": "ARRANGEMENT_TYPE"
},
{
"type": "string",
"name": "H3_HEX_RES_5"
},
{
"type": "long",
"name": "BATHROOMS"
},
{
"type": "long",
"name": "INSTANT_BOOK"
},
{
"type": "long",
"name": "IS_HOTEL"
},
{
"type": "long",
"name": "IS_MANAGED"
},
{
"type": "string",
"name": "UID"
},
{
"type": "long",
"name": "SLEEPS"
},
{
"type": "long",
"name": "ROOMS"
},
{
"type": "long",
"name": "NUMBER_OF_REVIEWS"
},
{
"type": "long",
"name": "RATING"
}
]
},
"metricsSpec": [
{
"name": "sum_BLOCKED",
"type": "longSum",
"fieldName": "BLOCKED"
},
{
"name": "sum_BOOKED",
"type": "longSum",
"fieldName": "BOOKED"
},
{
"name": "sum_LEAD_TIME_PROP",
"type": "longSum",
"fieldName": "LEAD_TIME_PROP"
},
{
"name": "sum_LISTED_NIGHT_RATE",
"type": "longSum",
"fieldName": "LISTED_NIGHT_RATE"
},
{
"name": "sum_BOOKING_PROP",
"type": "floatSum",
"fieldName": "BOOKING_PROP"
},
{
"name": "sum_BOOKED_NIGHT_RATE",
"type": "longSum",
"fieldName": "BOOKED_NIGHT_RATE"
},
{
"name": "sum_ACTIVE",
"type": "longSum",
"fieldName": "ACTIVE"
}
]
}
}
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]