FrankChen021 opened a new issue, #17891:
URL: https://github.com/apache/druid/issues/17891

   ### Description
   
   After we upgraded one of our clusters from 22 to 27, the performance of a 
user query drastically degraded from previously 5 seconds to more than 300 
seconds when querying on a data set over 2 billion rows.
   
   The SQL looks like this:
   
   ```sql
   SELECT
   id,
   sum(val)/30 as l30d_ado
   FROM
   mp
   WHERE country = 'xx'
   AND (is_cross_border = 1
       or id IN
       (
           570092232
               ,
           ... a lots of id
       )
   )
   AND __time >= timestamp '2025-03-08 17:00:00'
   AND __time < timestamp '2025-04-07 17:00:00'
   GROUP BY id
   ORDER BY l30d_ado DESC
   ```
   
   After some investigation, we found that the EXPLAIN result is different. On 
22, it generates a native query for the `id` like this:
   
   ```json
   {
           "type": "or",
           "fields": [
             {
               "type": "bound",
               "dimension": "is_cross_border",
               "lower": "1",
               "upper": "1",
               "lowerStrict": false,
               "upperStrict": false,
               "extractionFn": null,
               "ordering": {
                 "type": "numeric"
               }
             },
             {
               "type": "in",
               "dimension": "id",
               "values": [
                 "788563396",
                 "769186459",
                 "1189559866",
                 "1210227126",
                 "998697532",
                 ....a lot of id
   ```
   
   However, on Druid 27, the EXPLAIN for the `id` field is sth like:
   
   ```json
        {
           "type": "or",
           "fields": [
             {
               "type": "bound",
               "dimension": "is_cross_border",
               "lower": "1",
               "upper": "1",
               "ordering": {
                 "type": "numeric"
               }
             },
             {
               "type": "bound",
               "dimension": "id",
               "lower": "570092232",
               "upper": "570092232",
               "ordering": {
                 "type": "numeric"
               }
             },
             {
               "type": "bound",
               "dimension": "id",
               "lower": "664177432",
               "upper": "664177432",
               "ordering": {
                 "type": "numeric"
               }
             },
             {
               "type": "bound",
               "dimension": "id",
               "lower": "881910562",
               "upper": "881910562",
               "ordering": {
                 "type": "numeric"
               }
             },
            ... a lot of 'id' of 'bound' type
   ```
   
   Druid 27 generates lots of 'bound' filter for the `id` field. If we copy the 
native query generated on Druid 22 to Druid 27 and run it, it takes less than 1 
seconds to get the result, which is as expected.
   
   And if we look at schema of the data source on two different clusters, we 
found that:
   on the old one(Druid 22), the `id` field is shown as `BIGINT` while it's 
shown as `VARCHAR` on Druid 27.
   
   In supervisor configuration of these two clusters, the `id` fields are all 
declared as `long` in the dimension spec.
   However, daily batch ingestion tasks overwrites all realtime segments. And 
in the batch ingestion task spec, the `id` field are both declared as 
`string`(which should be long) in the dimension spec.
   
   So I think this type conflict in segments generated by batch ingestion and 
realtime ingestion creates the problem.
   The following table summaries the difference of two versions:
   
    | version | type of id declared in supervisor | type of id declared in 
batch ingestion | type displayed in data source | Native Query Plan |
    |--------|----|--------------|-------------|-------------|
    | Druid 22 | long | string | BIGINT | all ids are converted to string, and 
`IN` operator is used |
    | Druid 27 | long | string | VARCHAR | all ids are converted to `bound` 
operator one by one |
   
   
   There comes serveral question, which is hard to explain:
   1. if the type in the data source is BIGINT on Druid 22, why does the Native 
Query convert it to string to use in the `IN` operator ?
   2. if the type in data soruce is VARCHAR on Druid 27, why does it generate 
query to use `bound`?
   3. why are the type displayed in the data source on two clusters different? 
which change between 22 and 27 might cause this problem?
   
   @gianm @clintropolis do u have any idea?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to