[I] How to preserve multi-value column format in MSQ deduplication query in Apache Druid? (druid)

via GitHub Wed, 18 Jun 2025 00:43:08 -0700


alduleacristi opened a new issue, #18152:
URL: https://github.com/apache/druid/issues/18152


   I'm working with Apache Druid and have introduced a second timestamp column 
called ingestionTimestamp to support a deduplication job. Additionally, I have 
a column named tags, which is a multi-value VARCHAR column.
   
   The deduplication is performed using an MSQ (Multi-Stage Query) like the 
following:
   
       REPLACE INTO "target-datasource" 
       OVERWRITE 
   
       SELECT 
           __time,
           LATEST_BY("entityId", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS 
"entityId",
           LATEST_BY("entityName", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) 
AS "entityName",
           LATEST_BY("tagSetA", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS 
"tagSetA",
           LATEST_BY("tagSetB", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS 
"tagSetB",
           MAX("ingestionTimestamp") AS ingestionTimestamp
       FROM "target-datasource"
       GROUP BY 
           __time, 
           "entityUID"
       PARTITIONED BY 'P1M';
   
   **Problem:**
   After running this query, the tags-like columns (tagSetA, tagSetB) are no 
longer in a multi-value format. This breaks downstream queries that rely on the 
multi-value nature of these columns.
   
   **My understanding:**
   MSQ might not support preserving multi-value columns directly, especially 
when using functions like LATEST_BY.
   
   **Question:**
   How can I run this kind of deduplication query while preserving the 
multi-value format of these columns? Is there a recommended approach or 
workaround in Druid to handle this scenario?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] How to preserve multi-value column format in MSQ deduplication query in Apache Druid? (druid)

Reply via email to