[GitHub] [arrow] wjones127 commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

GitBox Thu, 16 Dec 2021 08:31:56 -0800


wjones127 commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770713896




##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,25 @@ altogether if they do not match the filter:
    :linenos:
    :lineno-match:
 
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets can improve performance when reading datasets, but have 
several 
+potential costs when reading and writing:
+
+#. Can significantly increase the number of files to write. The number of 
partitions is a
+   floor for the number of files in a dataset. If you partition a dataset by 
date with a 
+   year of data, you will have at least 365 files. If you further partition by 
another
+   dimension with 1,000 unique values, you will have 365,000 files. This can 
make it slower
+   to write and increase the size of the overall dataset because each file has 
some fixed
+   overhead. For example, each file in parquet dataset contains the schema.
+#. Multiple partitioning columns can produce deeply nested folder structures 
which are slow
+   to navigate because they require many recusive "list directory" calls to 
discover files.
+   These operations may be particularly expensive if you are using an object 
store 
+   filesystem such as S3. One workaround is to combine multiple columns into 
one for
+   partitioning. For example, instead of a schema like /year/month/day/ use 
/YYYY-MM-DD/.
+ 
+

Review comment:
       Good point on those two advantages. I will add that to discussion.
   
   Row group settings are (somewhat strangely) in the generic 
[FileSystemDatasetWriteOptions](https://github.com/apache/arrow/blob/331d3864fb524d276a86a31cd16cecbb67052bff/cpp/src/arrow/dataset/file_base.h#L385),
 so actually makes sense to discuss around here.
   
   I feel somewhat reticent to specify too exact rules of thumb for file size. 
(1) that might change for Arrow over time as performance improvements are made. 
(2) that may be very different depending on use case (and compression as you 
point out); and (3) that may vary depending on who the reader is (it might be 
Spark or something else rather than Arrow C++.) What do you think about just 
pointing out cases that are pathological? For example, partitioning to file 
sizes less than a few MB means the overhead of the filesystem and the metadata 
outweighs any filtering speedups. And partitioning to file sizes of 2GB+ means 
not enough parallelism or OOM errors in many cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

Reply via email to