harnasz opened a new issue #9460: Issue with CONCAT expression when using Kafka 
streaming ingestion.
URL: https://github.com/apache/druid/issues/9460
 
 
   We are seeing issues when using Kafka Streaming Ingestion using the CONCAT 
`expression` whereby it's prepending and appending `["` and `"]` to the result, 
believing this could be due to the rows being in the aggregate heap memory and 
not yet being persisted to the segments.
   
   See below for more detail.
   
   ## Affected Version
   
   `0.16`and `0.17`
   
   ## Description
   
   #### Cluster size
   
   Single using the quickstart `bin/start-single-server-small`
   
   #### Configurations in use
   
   Using the default configuration located here:
   
   `conf/druid/single-server/small`
   
   However using MySQL for the metadata storage and enabling globally cached 
lookups.
   
   ## Steps to reproduce the problem
   
   **Setup a Kafka Supervisor**
   
   *Note `"maxRowsPerSegment":3` in `tuningConfig`, we will refer to this 
later* 
   
   ```
   {
     "type":"kafka",
     "dataSchema":{
       "dataSource":"items",
       "parser":{
         "type":"string",
         "parseSpec":{
           "format":"csv",
           "timestampSpec":{
             "column":"time",
             "format":"iso"
           },
           "columns":[
             "time",
             "currency",
             "value"
           ],
           "dimensionsSpec":{
             "dimensions":[
               "currency"
             ]
           }
         }
       },
       "metricsSpec":[
         {
           "name":"count",
           "type":"count"
         },
         {
           "name":"sum_value",
           "type":"doubleSum",
           "fieldName":"value"
         }
       ],
       "granularitySpec":{
         "type":"uniform",
         "segmentGranularity":"WEEK",
         "queryGranularity":"NONE"
       }
     },
     "tuningConfig":{
       "type":"kafka",
       "maxRowsPerSegment":3
     },
     "ioConfig":{
       "topic":"debugitems",
       "consumerProperties":{
         "bootstrap.servers":"localhost:9092"
       },
       "taskCount":1,
       "replicas":1,
       "taskDuration":"PT1H"
     }
   }
   ```
   
   
   **Produce the Messages**
   
   Using Kafkacat execute the following *two* commands to produce the messages:
    
   ```
   echo "2020-01-14T11:11:00.000Z,GBP,30.12" | kafkacat -b  127.0.0.1:9092  -t 
debugitems
   echo "2020-01-15T11:11:00.000Z,EUR,30.12" | kafkacat -b  127.0.0.1:9092  -t 
debugitems
   ```
   
   Then run the following query:
   
   ```
   SELECT __time,  sum_value, CONCAT(TIME_FORMAT(__time, 'yyyy-MM-dd'), ':', 
currency) as "concat_expression" FROM items
   ```
   
   and you will see the results of below:
   
   ```
   (Query 1 Result)
   +--------------------------+-----------+--------------------+
   |          __time          | sum_value | concat_expression  |
   +--------------------------+-----------+--------------------+
   | 2020-01-14T11:11:00.000Z |     30.12 | ["2020-01-14:GBP"] |
   | 2020-01-15T11:11:00.000Z |     30.12 | ["2020-01-15:EUR"] |
   +--------------------------+-----------+--------------------+
   ```
   
   If you then produce another message of:
   
   ```
   echo "2020-01-16T11:11:00.000Z,GBP,30.12" | kafkacat -b  127.0.0.1:9092  -t 
debugitems
   ```
   
   And then rerun the above query you will see:
   
   ```
   (Query 2 Result)
   +--------------------------+-----------+-------------------+
   |          __time          | sum_value | concat_expression |
   +--------------------------+-----------+-------------------+
   | 2020-01-14T11:11:00.000Z |     30.12 | 2020-01-14:GBP    |
   | 2020-01-15T11:11:00.000Z |     30.12 | 2020-01-15:EUR    |
   | 2020-01-16T11:11:00.000Z |     30.12 | 2020-01-16:GBP    |
   +--------------------------+-----------+-------------------+
   ```
   
   ## The Issue
   
   From the results of Query 1 the values in the column of `concat_expression` 
wraps the value with `["` and `"]`.  Using the LTRIM and RTRIM functions to 
trim `["` but this does not have any impact on the value from the expression. 
   
   From the results of Query 2 after 3 values have been ingested the values in 
the column of `concat_expression` are *no* longer wrapped with `["` and `"]`.
   
   We believe this could be due to that when two rows have been ingested they 
are aggregated in heap memory however when the third row gets ingested it gets 
persisted to the segment due to setting `"maxRowsPerSegment` to `3`.
   
   We are seeing no issues when using Batch Ingestion, just the streaming 
ingestion. We are using the result of the CONCAT expression to invoke a lookup, 
but as the CONCAT expression is appending and prepending `["` and `"]` and then 
cannot find the value in the lookup.
   
   ## Any debugging that you have already done
   
   The only debugging that has been carried out is changing the tuning config 
values, reducing `intermediatePersistPeriod` or `maxRowsPerSegment` to persist 
the rows to the segments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to