gianm edited a comment on issue #11929:
URL: https://github.com/apache/druid/issues/11929#issuecomment-973174010


   > As for the partition, I think an independent PARTITION BY is more 
flexible. ClickHouse also provides optional PARTITION BY keyword. If it's not 
specified, data would not be partitioned.
   
   @FrankChen021 It sounds like that PARTITION BY is similar to Druid "segment 
granularity". I was suggesting we call that "BUCKET BY" but "PARTITION BY" does 
seem to be more common. To try to figure out what we should do, I did some 
research into how these things are usually called.
   
   One concept is "splitting the dataset into subsets, where each subset has a 
_single value_ of a key". This is often used to simplify data management, 
because it enables rewriting that one partition without touching anything else. 
It's common for the key to be some time function like hour, day, or month. This 
is supported by a variety of dbs, although not all of them. It seems like 
"PARTITION BY" or "PARTITIONED BY" is the most common term:
   
   - Druid: segmentGranularity
   - BigQuery: [PARTITION 
BY](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#partition_expression)
   - Snowflake: does not appear to offer this
   - Redshift: does not appear to offer this
   - Hive: [PARTITIONED 
BY](https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable)
   - Spark SQL: [PARTITIONED 
BY](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html)
   
   Another concept is "colocating data with same or similar values of a key". 
This is used to improve compression and query performance. It's supported by 
every db I checked.
   
   - Druid: partitionsSpec
   - BigQuery: [CLUSTER 
BY](https://cloud.google.com/bigquery/docs/creating-clustered-tables)
   - Snowflake: [CLUSTER 
BY](https://docs.snowflake.com/en/sql-reference/sql/create-table.html)
   - Redshift: 
[SORTKEY](https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html)
   - Hive: [CLUSTERED 
BY](https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable)
   - Spark SQL: [CLUSTERED 
BY](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html)
   
   With all this in mind it seems like the most conventional language would be 
PARTITION BY for segment granularity and CLUSTER BY for secondary partitioning. 
Meaning the query would look like:
   
   ```sql
   INSERT INTO tbl
   SELECT ...
   FROM ...
   PARTITION BY FLOOR(__time TO DAY)
   CLUSTER BY channel
   ```
   
   I think there is some risk here of confusion with "PARTITION BY" vs. Druid's 
"partitionsSpec" ingestion config, which also uses the word "partition" but 
refers more to the "clustering" concept. But I could believe this is fine for 
the sake of having the SQL language be more aligned with other DBs.
   
   I'm ok with going with this language. What do people think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to