gianm commented on issue #6066: Sorting rows when rollup is disabled URL: https://github.com/apache/incubator-druid/issues/6066#issuecomment-410052884 > I'm guessing Druid does the rollup by sorting the dimensions in the order in the ingestion spec, which would indicate that dimension ordering would matter. Would ordering dims by based on cardinality help further? I'm guessing low->high card would be better than high->low, but idk enough about LZ(4?) compression to have a good intuition on this. Yep it does sort in the order that you provide them in the ingestion spec. LZ4 works by finding and eliminating redundancies in a sliding window, so it works best if you have a lot of local redundancy. CONCISE / Roaring also work better in that case (they can encode runs). And selecting values out of columns is faster if you are selecting rows that are close together. If you mostly care about size, I think you should do best starting with a lower cardinality column, especially one that a lot of other columns will have some kind of dependency on (like if you have a 'country' column and each region, city, user, etc will generally just be in one country). That way you're maximizing compression for as many columns as possible. If you care about retrieval speed too then it helps to think about locality: if you have a column you often filter on, you want it first/early in the sort order. Sometimes a column satisfies both of these, and in that case life is good (e.g. "tenant_id" for a multi-tenant-one-table design). In the case I mentioned (where we saw dramatically better compression after sorting) it was an e-commerce dataset, and we sorted rows first on product brand, then on product id. Btw, this line of thinking is most effective if your queryGranularity is coarse, since Druid always sorts time _first_ and then your other dimensions.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
