sascha-coenen opened a new issue #8922: ArrayIndexOutOfBoundsException on 
historical in class GroupByMergingQueryRunnerV2 when grouping on 
high-cardinality dimension
URL: https://github.com/apache/incubator-druid/issues/8922
 
 
   
   
   ### Affected Version
   Druid 0.16.0
   
   ### Description
   
   An ArrayIndexOutOfBoundsException is thrown by a historical when grouping on 
a high-cardinality dimension. 
   
   This happens reproducibly on a test cluster with two historicals (r4.4xl) 
and a dataset that was ingested with the new native index_parallel job using 
the druid indexer, so not the middlemanager but the new indexer process. 
   The dataset is 12 GB in size as displayed in Druid's legacy coordinator 
console and consists of 500 shards, segment size is around 25MB with 100k 
records per segment and rollup was disabled during the ingestion. (83 
dimensions, 1 metric)
   
   The data constitutes a single hour which has no query granularity and hourly 
segment granularity.
   The data set was just for testing, so it is not a battle-hardened data 
model. The data is in so far battle tested as that is is authentic production 
data which we ingest with a hadoop indexer into our production cluster. In this 
case the data is rolled up. In contrast, I took this data set and ingested it 
without rollup using the new experimental ingestion pipeline that was 
introduced in Druid 0.16.  
   
   I was executing the following query
   
   `
   SELECT
     "deviceId",
     COUNT(*) AS "Count"
   FROM "hackathon"
   WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' YEAR
   GROUP BY 1
   ORDER BY "Count" DESC
   LIMIT 100
   `
   
   This always raises the following exception within one of the historicals:
   
   `
        [java] 2019-11-21T18:17:57,132 ERROR [processing-3] 
org.apache.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2 - 
Exception with one of the sequences!
        [java] java.lang.ArrayIndexOutOfBoundsException
        [java] 2019-11-21T18:17:57,132 ERROR [processing-3] 
com.google.common.util.concurrent.Futures$CombinedFuture - input future failed.
        [java] java.lang.RuntimeException: 
java.lang.ArrayIndexOutOfBoundsException
        [java]  at 
org.apache.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1$1$1.call(GroupByMergingQueryRunnerV2.java:253)
 ~[druid-processing-0.16.0-incubating.jar:0.16.0-incubating]
        [java]  at 
org.apache.druid.query.groupby.epinephelinae.GroupByMergingQueryRunnerV2$1$1$1.call(GroupByMergingQueryRunnerV2.java:233)
 ~[druid-processing-0.16.0-incubating.jar:0.16.0-incubating]
        [java]  at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_212]
        [java]  at 
org.apache.druid.query.PrioritizedListenableFutureTask.run(PrioritizedExecutorService.java:247)
 [druid-processing-0.16.0-incubating.jar:0.16.0-incubating]
        [java]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_212]
        [java]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_212]
        [java]  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
        [java] Caused by: java.lang.ArrayIndexOutOfBoundsException
   `
   
   
   If I instead group on other dimensions then I do not receive any exeption. 
So this specifically happens with a dimension that has a high cardinality 
because it contains device IDs.
   However, I tried to group on other high-cardinality dimensions like a 
session ID that contains a GUUID and this only resulted in a 
ResourceLimitExceeded exception which is fine.
   
   I couldn't provoke another ArrayIndexOutOfBoundsException so far with any of 
the other columns.
   
   I then relaxed the above query by removing the ORDER BY clause and then a 
resultset was returned to me.
   I was also able to retain the ORDER BY clause by adding a filter condition 
on "deviceId IS NOT NULL" which did not raise an ArrayIndexOutOfBounds 
exception anymore but ran into the ResourceLimitExceededException.
   
   In summary, it looks to me as if the error might have to do with high 
cardinality dimensions that can contain null entries.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to