wjones127 commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770713896
##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,25 @@ altogether if they do not match the filter:
:linenos:
:lineno-match:
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets can improve performance when reading datasets, but have
several
+potential costs when reading and writing:
+
+#. Can significantly increase the number of files to write. The number of
partitions is a
+ floor for the number of files in a dataset. If you partition a dataset by
date with a
+ year of data, you will have at least 365 files. If you further partition by
another
+ dimension with 1,000 unique values, you will have 365,000 files. This can
make it slower
+ to write and increase the size of the overall dataset because each file has
some fixed
+ overhead. For example, each file in parquet dataset contains the schema.
+#. Multiple partitioning columns can produce deeply nested folder structures
which are slow
+ to navigate because they require many recusive "list directory" calls to
discover files.
+ These operations may be particularly expensive if you are using an object
store
+ filesystem such as S3. One workaround is to combine multiple columns into
one for
+ partitioning. For example, instead of a schema like /year/month/day/ use
/YYYY-MM-DD/.
+
+
Review comment:
Good point on those two advantages. I will add that to discussion.
Row group settings are (somewhat strangely) in the generic
[FileSystemDatasetWriteOptions](https://github.com/apache/arrow/blob/331d3864fb524d276a86a31cd16cecbb67052bff/cpp/src/arrow/dataset/file_base.h#L385),
so actually makes sense to discuss around here.
I feel somewhat reticent to specify too exact rules of thumb for file size.
(1) that might change for Arrow over time as performance improvements are made.
(2) that may be very different depending on use case (and compression as you
point out); and (3) that may vary depending on who the reader is (it might be
Spark or something else rather than Arrow C++.) What do you think about just
pointing out cases that are pathological? For example, partitioning to file
sizes less than a few MB means the overhead of the filesystem and the metadata
outweighs any filtering speedups. And partitioning to file sizes of 2GB+ means
not enough parallelism or OOM errors in many cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]