harnasz opened a new issue #9460: Issue with CONCAT expression when using Kafka streaming ingestion. URL: https://github.com/apache/druid/issues/9460 We are seeing issues when using Kafka Streaming Ingestion using the CONCAT `expression` whereby it's prepending and appending `["` and `"]` to the result, believing this could be due to the rows being in the aggregate heap memory and not yet being persisted to the segments. See below for more detail. ## Affected Version `0.16`and `0.17` ## Description #### Cluster size Single using the quickstart `bin/start-single-server-small` #### Configurations in use Using the default configuration located here: `conf/druid/single-server/small` However using MySQL for the metadata storage and enabling globally cached lookups. ## Steps to reproduce the problem **Setup a Kafka Supervisor** *Note `"maxRowsPerSegment":3` in `tuningConfig`, we will refer to this later* ``` { "type":"kafka", "dataSchema":{ "dataSource":"items", "parser":{ "type":"string", "parseSpec":{ "format":"csv", "timestampSpec":{ "column":"time", "format":"iso" }, "columns":[ "time", "currency", "value" ], "dimensionsSpec":{ "dimensions":[ "currency" ] } } }, "metricsSpec":[ { "name":"count", "type":"count" }, { "name":"sum_value", "type":"doubleSum", "fieldName":"value" } ], "granularitySpec":{ "type":"uniform", "segmentGranularity":"WEEK", "queryGranularity":"NONE" } }, "tuningConfig":{ "type":"kafka", "maxRowsPerSegment":3 }, "ioConfig":{ "topic":"debugitems", "consumerProperties":{ "bootstrap.servers":"localhost:9092" }, "taskCount":1, "replicas":1, "taskDuration":"PT1H" } } ``` **Produce the Messages** Using Kafkacat execute the following *two* commands to produce the messages: ``` echo "2020-01-14T11:11:00.000Z,GBP,30.12" | kafkacat -b 127.0.0.1:9092 -t debugitems echo "2020-01-15T11:11:00.000Z,EUR,30.12" | kafkacat -b 127.0.0.1:9092 -t debugitems ``` Then run the following query: ``` SELECT __time, sum_value, CONCAT(TIME_FORMAT(__time, 'yyyy-MM-dd'), ':', currency) as "concat_expression" FROM items ``` and you will see the results of below: ``` (Query 1 Result) +--------------------------+-----------+--------------------+ | __time | sum_value | concat_expression | +--------------------------+-----------+--------------------+ | 2020-01-14T11:11:00.000Z | 30.12 | ["2020-01-14:GBP"] | | 2020-01-15T11:11:00.000Z | 30.12 | ["2020-01-15:EUR"] | +--------------------------+-----------+--------------------+ ``` If you then produce another message of: ``` echo "2020-01-16T11:11:00.000Z,GBP,30.12" | kafkacat -b 127.0.0.1:9092 -t debugitems ``` And then rerun the above query you will see: ``` (Query 2 Result) +--------------------------+-----------+-------------------+ | __time | sum_value | concat_expression | +--------------------------+-----------+-------------------+ | 2020-01-14T11:11:00.000Z | 30.12 | 2020-01-14:GBP | | 2020-01-15T11:11:00.000Z | 30.12 | 2020-01-15:EUR | | 2020-01-16T11:11:00.000Z | 30.12 | 2020-01-16:GBP | +--------------------------+-----------+-------------------+ ``` ## The Issue From the results of Query 1 the values in the column of `concat_expression` wraps the value with `["` and `"]`. Using the LTRIM and RTRIM functions to trim `["` but this does not have any impact on the value from the expression. From the results of Query 2 after 3 values have been ingested the values in the column of `concat_expression` are *no* longer wrapped with `["` and `"]`. We believe this could be due to that when two rows have been ingested they are aggregated in heap memory however when the third row gets ingested it gets persisted to the segment due to setting `"maxRowsPerSegment` to `3`. We are seeing no issues when using Batch Ingestion, just the streaming ingestion. We are using the result of the CONCAT expression to invoke a lookup, but as the CONCAT expression is appending and prepending `["` and `"]` and then cannot find the value in the lookup. ## Any debugging that you have already done The only debugging that has been carried out is changing the tuning config values, reducing `intermediatePersistPeriod` or `maxRowsPerSegment` to persist the rows to the segments.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
