wjones127 commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770946135
##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
:linenos:
:lineno-match:
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases
the number of
+files and it creates a directory structure around the files. Both of these
have benefits
+as well as costs. Depending on the configuration and the size of your dataset,
the costs
+can outweigh the benefits.
+
+Because partitions split up the dataset into multiple files, partitioned
datasets can be
+read and written with parallelism. However, each additional file adds a little
overhead in
+processing for filesystem interaction. It also increases the overall dataset
size since
+each file has some shared metadata. For example, each parquet file contains
the schema and
+group-level statistics. The number of partitions is a floor for the number of
files. If
+you partition a dataset by date with a year of data, you will have at least
365 files. If
+you further partition by another dimension with 1,000 unique values, you will
have up to
+365,000 files. This fine of partitioning often leads to small files that
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to
prune which
+files are loaded in a scan. However, this adds overhead to discovering files
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These
operations may
+be particularly expensive if you are using an object store filesystem such as
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years
worth
+of data will require 365 list calls to find all the files; adding another
column with
+cardinality 1,000 will make that 365,365 calls.
Review comment:
Yeah I'm probably wrong at that then. I've heard this idea in the Spark
context, but it might not be recursive part as much as pagination through a web
API that makes this slow. I will remove this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]