wjones127 commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770946718
##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
:linenos:
:lineno-match:
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases
the number of
+files and it creates a directory structure around the files. Both of these
have benefits
+as well as costs. Depending on the configuration and the size of your dataset,
the costs
+can outweigh the benefits.
+
+Because partitions split up the dataset into multiple files, partitioned
datasets can be
+read and written with parallelism. However, each additional file adds a little
overhead in
+processing for filesystem interaction. It also increases the overall dataset
size since
+each file has some shared metadata. For example, each parquet file contains
the schema and
+group-level statistics. The number of partitions is a floor for the number of
files. If
+you partition a dataset by date with a year of data, you will have at least
365 files. If
+you further partition by another dimension with 1,000 unique values, you will
have up to
+365,000 files. This fine of partitioning often leads to small files that
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to
prune which
+files are loaded in a scan. However, this adds overhead to discovering files
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These
operations may
+be particularly expensive if you are using an object store filesystem such as
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years
worth
+of data will require 365 list calls to find all the files; adding another
column with
+cardinality 1,000 will make that 365,365 calls.
+
+The most optimal partitioning layout will depend on your data, access
patterns, and which
+systems will be reading the data. Most systems, including Arrow, should work
across a
+range of file sizes and partitioning layouts, but there are extremes you
should avoid. To
+avoid pathological behavior, keep to these guidelines:
Review comment:
That's a fair point. I can soften the language.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]