[GitHub] [druid] gianm edited a comment on issue #11929: Batch ingestion using SQL INSERT

GitBox Thu, 18 Nov 2021 11:11:17 -0800


gianm edited a comment on issue #11929:
URL: https://github.com/apache/druid/issues/11929#issuecomment-973174010

> As for the partition, I think an independent PARTITION BY is more
flexible. ClickHouse also provides optional PARTITION BY keyword. If it's not
specified, data would not be partitioned.

@FrankChen021 It sounds like that PARTITION BY is similar to Druid "segment
granularity". I was suggesting we call that "BUCKET BY" but "PARTITION BY" does
seem to be more common. To try to figure out what we should do, I did some
research into how these things are usually called.

One concept is "splitting the dataset into subsets, where each subset has a
_single value_ of a key". This is often used to simplify data management,
because it enables rewriting that one partition without touching anything else.
It's common for the key to be some time function like hour, day, or month. This
is supported by a variety of dbs, although not all of them. It seems like
"PARTITION BY" or "PARTITIONED BY" is the most common term:

- Druid: segmentGranularity
- BigQuery: [PARTITION
BY](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#partition_expression)
- Snowflake: does not appear to offer this
- Redshift: does not appear to offer this
- Hive: [PARTITIONED
BY](https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable)
- Spark SQL: [PARTITIONED
BY](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html)

Another concept is "colocating data with same or similar values of a key".
This is used to improve compression and query performance. It's supported by
every db I checked.

- Druid: partitionsSpec
- BigQuery: [CLUSTER
BY](https://cloud.google.com/bigquery/docs/creating-clustered-tables)
- Snowflake: [CLUSTER
BY](https://docs.snowflake.com/en/sql-reference/sql/create-table.html)
- Redshift:
[SORTKEY](https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html)
- Hive: [CLUSTERED
BY](https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable)
- Spark SQL: [CLUSTERED
BY](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html)

With all this in mind it seems like the most conventional language would be
PARTITION BY for segment granularity and CLUSTER BY for secondary partitioning.
Meaning the query would look like:

```sql
INSERT INTO tbl
SELECT ...
FROM ...
PARTITION BY FLOOR(__time TO DAY)
CLUSTER BY channel
```

I think there is some risk here of confusion with "PARTITION BY" vs. Druid's
"partitionsSpec" ingestion config, which also uses the word "partition" but
refers more to the "clustering" concept. But I could believe this is fine for
the sake of having the SQL language be more aligned with other DBs.

I'm ok with going with this language. What do people think?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm edited a comment on issue #11929: Batch ingestion using SQL INSERT

Reply via email to