[incubator-iceberg] 01/02: Add doc about hidden partitioning.

blue Mon, 10 Dec 2018 09:19:54 -0800

This is an automated email from the ASF dual-hosted git repository.

blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-iceberg.git


commit 06e30dd96eac4f2201065da4a8f84fe922941884
Author: Ryan Blue <[email protected]>
AuthorDate: Mon Dec 10 09:16:05 2018 -0800

    Add doc about hidden partitioning.
---
 site/docs/partitioning.md | 70 +++++++++++++++++++++++++++++++++++++++++++++++
 site/mkdocs.yml           |  2 ++
 2 files changed, 72 insertions(+)

diff --git a/site/docs/partitioning.md b/site/docs/partitioning.md
new file mode 100644
index 0000000..f5d0019
--- /dev/null
+++ b/site/docs/partitioning.md
@@ -0,0 +1,70 @@
+
+## What is partitioning?
+
+Partitioning is a way to make queries faster by grouping similar rows together 
when writing.
+
+For example, queries for log entries from a `logs` table would usually include 
a time range, like this query for logs between 10 and 12 AM:
+
+```sql
+SELECT level, message FROM logs
+WHERE event_time BETWEEN '2018-12-01 10:00:00' AND '2018-12-01 12:00:00'
+```
+
+Configuring the `logs` table to partition by the date of `event_time` will 
group log events into files with the same event date. Iceberg keeps track of 
that date and will use it to skip files for other dates that don't have useful 
data.
+
+Iceberg can partition timestamps by year, month, day, and hour granularity. It 
can also use a categorical column, like `level` in this logs example, to store 
rows together and speed up queries.
+
+
+## What does Iceberg do differently?
+
+Other tables formats like Hive support partitioning, but Iceberg supports 
*hidden partitioning*.
+
+* Iceberg handles the tedious and error-prone task of producing partition 
values for rows in a table.
+* Iceberg avoids reading unnecessary partitions automatically. Consumers don't 
need to know how the table is partitioned and add extra filters to their 
queries.
+* Iceberg partition layouts can evolve as needed.
+
+### Partitioning in Hive
+
+To demonstrate the difference, consider how Hive would handle a `logs` table.
+
+In Hive, partitions are explicit and appear as a column, so the `logs` table 
would have a column called `event_date`. When writing, an insert needs to 
supply the data for the `event_date` column:
+
+```sql
+INSERT INTO logs PARTITION (event_date)
+  SELECT level, message, event_time, format_time(event_time, 'YYYY-MM-dd')
+  FROM unstructured_log_source
+```
+
+Similarly, queries that search through the `logs` table must have an 
`event_date` filter in addition to an `event_time` filter.
+
+```sql
+SELECT level, count(1) as count FROM logs
+WHERE event_time BETWEEN '2018-12-01 10:00:00' AND '2018-12-01 12:00:00'
+  AND event_date = '2018-12-01'
+```
+
+If the `event_date` filter were missing, Hive would scan through every file in 
the table because it doesn't know that the `event_time` column is related to 
the `event_date` column.
+
+### Problems with Hive partitioning
+
+Hive must be given partition values. In the logs example, it doesn't know the 
relationship bewteen `event_time` and `event_date`.
+
+This leads to several problems:
+
+* Hive can't validate partition values -- it is up to the writer to produce 
the correct value
+    - Using the wrong format, `2018-12-01` instead of `20181201`, produces 
silently incorrect results, not query failures
+    - Using the wrong source column, like `processing_time`, or time zone also 
causes incorrect results, not failures
+* It is up to the user to write queries correctly
+    - Using the wrong format also leads to silently incorrect results
+    - Users that don't understand a table's physical layout get needlessly 
slow queries -- Hive can't translate filters automatically
+* Working queries are tied to the table's partitioning scheme, so partitioning 
configuration cannot be changed without breaking queries
+
+### Iceberg's hidden partitioning
+
+Iceberg produces partition values by taking a column value and optionally 
transforming it. Iceberg is responsible for converting `event_time` into 
`event_date`, and keeps track of the relationship.
+
+Table partitioning is configured using these relationships. The `logs` table 
would be partitioned by `date(event_time)` and `level`.
+
+Because Iceberg doesn't require user-maintained partition columns, it can hide 
partitioning. Partition values are produced correctly every time and always 
used to speed up queries, when possible. Producers and consumers wouldn't even 
see `event_date`.
+
+Most importantly, queries no longer depend on a table's physical layout. With 
a separation between physical and logical, Iceberg tables can evolve partition 
schemes over time as data volume changes. Misconfigured tables can be fixed 
without an expensive migration.
diff --git a/site/mkdocs.yml b/site/mkdocs.yml
index ab88811..f585536 100644
--- a/site/mkdocs.yml
+++ b/site/mkdocs.yml
@@ -33,3 +33,5 @@ markdown_extensions:
 nav:
   - About: index.md
   - Community: community.md
+  - Docs:
+    - partitioning.md

[incubator-iceberg] 01/02: Add doc about hidden partitioning.

Reply via email to