[GitHub] [arrow] westonpace commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

GitBox Thu, 16 Dec 2021 13:38:38 -0800


westonpace commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770936440




##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
    :linenos:
    :lineno-match:
 
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases 
the number of
+files and it creates a directory structure around the files. Both of these 
have benefits
+as well as costs. Depending on the configuration and the size of your dataset, 
the costs 
+can outweigh the benefits. 
+
+Because partitions split up the dataset into multiple files, partitioned 
datasets can be 
+read and written with parallelism. However, each additional file adds a little 
overhead in 
+processing for filesystem interaction. It also increases the overall dataset 
size since 
+each file has some shared metadata. For example, each parquet file contains 
the schema and
+group-level statistics. The number of partitions is a floor for the number of 
files. If 
+you partition a dataset by date with a year of data, you will have at least 
365 files. If 
+you further partition by another dimension with 1,000 unique values, you will 
have up to 
+365,000 files. This fine of partitioning often leads to small files that 
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to 
prune which 
+files are loaded in a scan. However, this adds overhead to discovering files 
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These 
operations may 
+be particularly expensive if you are using an object store filesystem such as 
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years 
worth
+of data will require 365 list calls to find all the files; adding another 
column with 
+cardinality 1,000 will make that 365,365 calls.
+
+The most optimal partitioning layout will depend on your data, access 
patterns, and which
+systems will be reading the data. Most systems, including Arrow, should work 
across a 
+range of file sizes and partitioning layouts, but there are extremes you 
should avoid. To 
+avoid pathological behavior, keep to these guidelines:

Review comment:
       I like this idea but the wording seems like you will always avoid bad 
behavior if you follow these rules.  It's entirely possible to design a 50MB 
file with poor row groupings (although you do discuss that in the next 
paragraph).  Maybe we could word it as "These guidelines can help avoid some 
known worst case situations" or something like that.

##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
    :linenos:
    :lineno-match:
 
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases 
the number of
+files and it creates a directory structure around the files. Both of these 
have benefits
+as well as costs. Depending on the configuration and the size of your dataset, 
the costs 
+can outweigh the benefits. 
+
+Because partitions split up the dataset into multiple files, partitioned 
datasets can be 
+read and written with parallelism. However, each additional file adds a little 
overhead in 
+processing for filesystem interaction. It also increases the overall dataset 
size since 
+each file has some shared metadata. For example, each parquet file contains 
the schema and
+group-level statistics. The number of partitions is a floor for the number of 
files. If 
+you partition a dataset by date with a year of data, you will have at least 
365 files. If 
+you further partition by another dimension with 1,000 unique values, you will 
have up to 
+365,000 files. This fine of partitioning often leads to small files that 
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to 
prune which 
+files are loaded in a scan. However, this adds overhead to discovering files 
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These 
operations may 
+be particularly expensive if you are using an object store filesystem such as 
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years 
worth
+of data will require 365 list calls to find all the files; adding another 
column with 
+cardinality 1,000 will make that 365,365 calls.

Review comment:
       Ironically I think this is actually worse on local filesystems than it 
is on S3.  S3 supports a recursive query (and we use it I'm pretty sure) so we 
only actually do a single list directory call.  Maybe just drop the `These 
operations may be...such as S3.` line.

##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
    :linenos:
    :lineno-match:
 
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases 
the number of
+files and it creates a directory structure around the files. Both of these 
have benefits
+as well as costs. Depending on the configuration and the size of your dataset, 
the costs 
+can outweigh the benefits. 
+
+Because partitions split up the dataset into multiple files, partitioned 
datasets can be 
+read and written with parallelism. However, each additional file adds a little 
overhead in 
+processing for filesystem interaction. It also increases the overall dataset 
size since 
+each file has some shared metadata. For example, each parquet file contains 
the schema and
+group-level statistics. The number of partitions is a floor for the number of 
files. If 
+you partition a dataset by date with a year of data, you will have at least 
365 files. If 
+you further partition by another dimension with 1,000 unique values, you will 
have up to 
+365,000 files. This fine of partitioning often leads to small files that 
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to 
prune which 
+files are loaded in a scan. However, this adds overhead to discovering files 
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These 
operations may 
+be particularly expensive if you are using an object store filesystem such as 
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years 
worth
+of data will require 365 list calls to find all the files; adding another 
column with 
+cardinality 1,000 will make that 365,365 calls.
+
+The most optimal partitioning layout will depend on your data, access 
patterns, and which
+systems will be reading the data. Most systems, including Arrow, should work 
across a 
+range of file sizes and partitioning layouts, but there are extremes you 
should avoid. To 
+avoid pathological behavior, keep to these guidelines:
+
+ * Avoid files smaller than 20MB and larger than 2GB
+ * Avoid partitioning layouts with more than 10,000 distinct partitions.
+
+For file formats that have a notion of groups within a file, such as Parquet, 
similar
+guidelines apply. Row groups can provide parallelism when reading and allow 
data skipping
+based on statistics, but very small groups can cause metadata to be a 
significant portion
+of file size. Arrow's file writer provide sensible defaults for group sizing 
in most cases.

Review comment:
       ```suggestion
   of file size. Arrow's file writer provides sensible defaults for group 
sizing in most cases.
   ```

##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
    :linenos:
    :lineno-match:
 
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases 
the number of
+files and it creates a directory structure around the files. Both of these 
have benefits
+as well as costs. Depending on the configuration and the size of your dataset, 
the costs 
+can outweigh the benefits. 
+
+Because partitions split up the dataset into multiple files, partitioned 
datasets can be 
+read and written with parallelism. However, each additional file adds a little 
overhead in 
+processing for filesystem interaction. It also increases the overall dataset 
size since 
+each file has some shared metadata. For example, each parquet file contains 
the schema and
+group-level statistics. The number of partitions is a floor for the number of 
files. If 
+you partition a dataset by date with a year of data, you will have at least 
365 files. If 
+you further partition by another dimension with 1,000 unique values, you will 
have up to 
+365,000 files. This fine of partitioning often leads to small files that 
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to 
prune which 
+files are loaded in a scan. However, this adds overhead to discovering files 
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These 
operations may 
+be particularly expensive if you are using an object store filesystem such as 
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years 
worth
+of data will require 365 list calls to find all the files; adding another 
column with 
+cardinality 1,000 will make that 365,365 calls.
+
+The most optimal partitioning layout will depend on your data, access 
patterns, and which
+systems will be reading the data. Most systems, including Arrow, should work 
across a 
+range of file sizes and partitioning layouts, but there are extremes you 
should avoid. To 
+avoid pathological behavior, keep to these guidelines:

Review comment:
       ```suggestion
   avoid worst case behavior, keep to these guidelines:
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

Reply via email to