[GitHub] [arrow] wjones127 commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

GitBox Thu, 16 Dec 2021 13:53:52 -0800


wjones127 commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770946135




##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
    :linenos:
    :lineno-match:
 
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases 
the number of
+files and it creates a directory structure around the files. Both of these 
have benefits
+as well as costs. Depending on the configuration and the size of your dataset, 
the costs 
+can outweigh the benefits. 
+
+Because partitions split up the dataset into multiple files, partitioned 
datasets can be 
+read and written with parallelism. However, each additional file adds a little 
overhead in 
+processing for filesystem interaction. It also increases the overall dataset 
size since 
+each file has some shared metadata. For example, each parquet file contains 
the schema and
+group-level statistics. The number of partitions is a floor for the number of 
files. If 
+you partition a dataset by date with a year of data, you will have at least 
365 files. If 
+you further partition by another dimension with 1,000 unique values, you will 
have up to 
+365,000 files. This fine of partitioning often leads to small files that 
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to 
prune which 
+files are loaded in a scan. However, this adds overhead to discovering files 
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These 
operations may 
+be particularly expensive if you are using an object store filesystem such as 
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years 
worth
+of data will require 365 list calls to find all the files; adding another 
column with 
+cardinality 1,000 will make that 365,365 calls.

Review comment:
       Yeah I'm probably wrong at that then. I've heard this idea in the Spark 
context, but it might not be recursive part as much as pagination through a web 
API that makes this slow. I will remove this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

Reply via email to