gianm edited a comment on issue #11929: URL: https://github.com/apache/druid/issues/11929#issuecomment-973174010
> As for the partition, I think an independent PARTITION BY is more flexible. ClickHouse also provides optional PARTITION BY keyword. If it's not specified, data would not be partitioned. @FrankChen021 It sounds like that PARTITION BY is similar to Druid "segment granularity". I was suggesting we call that "BUCKET BY" but "PARTITION BY" does seem to be more common. To try to figure out what we should do, I did some research into how these things are usually called. One concept is "splitting the dataset into subsets, where each subset has a _single value_ of a key". This is often used to simplify data management, because it enables rewriting that one partition without touching anything else. It's common for the key to be some time function like hour, day, or month. This is supported by a variety of dbs, although not all of them. It seems like "PARTITION BY" or "PARTITIONED BY" is the most common term: - Druid: segmentGranularity - BigQuery: [PARTITION BY](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#partition_expression) - Snowflake: does not appear to offer this - Redshift: does not appear to offer this - Hive: [PARTITIONED BY](https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable) - Spark SQL: [PARTITIONED BY](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html) Another concept is "colocating data with same or similar values of a key". This is used to improve compression and query performance. It's supported by every db I checked. - Druid: partitionsSpec - BigQuery: [CLUSTER BY](https://cloud.google.com/bigquery/docs/creating-clustered-tables) - Snowflake: [CLUSTER BY](https://docs.snowflake.com/en/sql-reference/sql/create-table.html) - Redshift: [SORTKEY](https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html) - Hive: [CLUSTERED BY](https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable) - Spark SQL: [CLUSTERED BY](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html) With all this in mind it seems like the most conventional language would be PARTITION BY for segment granularity and CLUSTER BY for secondary partitioning. Meaning the query would look like: ```sql INSERT INTO tbl SELECT ... FROM ... PARTITION BY FLOOR(__time TO DAY) CLUSTER BY channel ``` I think there is some risk here of confusion with "PARTITION BY" vs. Druid's "partitionsSpec" ingestion config, which also uses the word "partition" but refers more to the "clustering" concept. But I could believe this is fine for the sake of having the SQL language be more aligned with other DBs. I'm ok with going with this language. What do people think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
