alduleacristi opened a new issue, #18152:
URL: https://github.com/apache/druid/issues/18152
I'm working with Apache Druid and have introduced a second timestamp column
called ingestionTimestamp to support a deduplication job. Additionally, I have
a column named tags, which is a multi-value VARCHAR column.
The deduplication is performed using an MSQ (Multi-Stage Query) like the
following:
REPLACE INTO "target-datasource"
OVERWRITE
SELECT
__time,
LATEST_BY("entityId", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS
"entityId",
LATEST_BY("entityName", MILLIS_TO_TIMESTAMP("ingestionTimestamp"))
AS "entityName",
LATEST_BY("tagSetA", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS
"tagSetA",
LATEST_BY("tagSetB", MILLIS_TO_TIMESTAMP("ingestionTimestamp")) AS
"tagSetB",
MAX("ingestionTimestamp") AS ingestionTimestamp
FROM "target-datasource"
GROUP BY
__time,
"entityUID"
PARTITIONED BY 'P1M';
**Problem:**
After running this query, the tags-like columns (tagSetA, tagSetB) are no
longer in a multi-value format. This breaks downstream queries that rely on the
multi-value nature of these columns.
**My understanding:**
MSQ might not support preserving multi-value columns directly, especially
when using functions like LATEST_BY.
**Question:**
How can I run this kind of deduplication query while preserving the
multi-value format of these columns? Is there a recommended approach or
workaround in Druid to handle this scenario?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]