westonpace commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770936440
##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
:linenos:
:lineno-match:
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases
the number of
+files and it creates a directory structure around the files. Both of these
have benefits
+as well as costs. Depending on the configuration and the size of your dataset,
the costs
+can outweigh the benefits.
+
+Because partitions split up the dataset into multiple files, partitioned
datasets can be
+read and written with parallelism. However, each additional file adds a little
overhead in
+processing for filesystem interaction. It also increases the overall dataset
size since
+each file has some shared metadata. For example, each parquet file contains
the schema and
+group-level statistics. The number of partitions is a floor for the number of
files. If
+you partition a dataset by date with a year of data, you will have at least
365 files. If
+you further partition by another dimension with 1,000 unique values, you will
have up to
+365,000 files. This fine of partitioning often leads to small files that
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to
prune which
+files are loaded in a scan. However, this adds overhead to discovering files
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These
operations may
+be particularly expensive if you are using an object store filesystem such as
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years
worth
+of data will require 365 list calls to find all the files; adding another
column with
+cardinality 1,000 will make that 365,365 calls.
+
+The most optimal partitioning layout will depend on your data, access
patterns, and which
+systems will be reading the data. Most systems, including Arrow, should work
across a
+range of file sizes and partitioning layouts, but there are extremes you
should avoid. To
+avoid pathological behavior, keep to these guidelines:
Review comment:
I like this idea but the wording seems like you will always avoid bad
behavior if you follow these rules. It's entirely possible to design a 50MB
file with poor row groupings (although you do discuss that in the next
paragraph). Maybe we could word it as "These guidelines can help avoid some
known worst case situations" or something like that.
##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
:linenos:
:lineno-match:
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases
the number of
+files and it creates a directory structure around the files. Both of these
have benefits
+as well as costs. Depending on the configuration and the size of your dataset,
the costs
+can outweigh the benefits.
+
+Because partitions split up the dataset into multiple files, partitioned
datasets can be
+read and written with parallelism. However, each additional file adds a little
overhead in
+processing for filesystem interaction. It also increases the overall dataset
size since
+each file has some shared metadata. For example, each parquet file contains
the schema and
+group-level statistics. The number of partitions is a floor for the number of
files. If
+you partition a dataset by date with a year of data, you will have at least
365 files. If
+you further partition by another dimension with 1,000 unique values, you will
have up to
+365,000 files. This fine of partitioning often leads to small files that
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to
prune which
+files are loaded in a scan. However, this adds overhead to discovering files
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These
operations may
+be particularly expensive if you are using an object store filesystem such as
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years
worth
+of data will require 365 list calls to find all the files; adding another
column with
+cardinality 1,000 will make that 365,365 calls.
Review comment:
Ironically I think this is actually worse on local filesystems than it
is on S3. S3 supports a recursive query (and we use it I'm pretty sure) so we
only actually do a single list directory call. Maybe just drop the `These
operations may be...such as S3.` line.
##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
:linenos:
:lineno-match:
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases
the number of
+files and it creates a directory structure around the files. Both of these
have benefits
+as well as costs. Depending on the configuration and the size of your dataset,
the costs
+can outweigh the benefits.
+
+Because partitions split up the dataset into multiple files, partitioned
datasets can be
+read and written with parallelism. However, each additional file adds a little
overhead in
+processing for filesystem interaction. It also increases the overall dataset
size since
+each file has some shared metadata. For example, each parquet file contains
the schema and
+group-level statistics. The number of partitions is a floor for the number of
files. If
+you partition a dataset by date with a year of data, you will have at least
365 files. If
+you further partition by another dimension with 1,000 unique values, you will
have up to
+365,000 files. This fine of partitioning often leads to small files that
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to
prune which
+files are loaded in a scan. However, this adds overhead to discovering files
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These
operations may
+be particularly expensive if you are using an object store filesystem such as
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years
worth
+of data will require 365 list calls to find all the files; adding another
column with
+cardinality 1,000 will make that 365,365 calls.
+
+The most optimal partitioning layout will depend on your data, access
patterns, and which
+systems will be reading the data. Most systems, including Arrow, should work
across a
+range of file sizes and partitioning layouts, but there are extremes you
should avoid. To
+avoid pathological behavior, keep to these guidelines:
+
+ * Avoid files smaller than 20MB and larger than 2GB
+ * Avoid partitioning layouts with more than 10,000 distinct partitions.
+
+For file formats that have a notion of groups within a file, such as Parquet,
similar
+guidelines apply. Row groups can provide parallelism when reading and allow
data skipping
+based on statistics, but very small groups can cause metadata to be a
significant portion
+of file size. Arrow's file writer provide sensible defaults for group sizing
in most cases.
Review comment:
```suggestion
of file size. Arrow's file writer provides sensible defaults for group
sizing in most cases.
```
##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,46 @@ altogether if they do not match the filter:
:linenos:
:lineno-match:
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets has two aspects that affect performance: it increases
the number of
+files and it creates a directory structure around the files. Both of these
have benefits
+as well as costs. Depending on the configuration and the size of your dataset,
the costs
+can outweigh the benefits.
+
+Because partitions split up the dataset into multiple files, partitioned
datasets can be
+read and written with parallelism. However, each additional file adds a little
overhead in
+processing for filesystem interaction. It also increases the overall dataset
size since
+each file has some shared metadata. For example, each parquet file contains
the schema and
+group-level statistics. The number of partitions is a floor for the number of
files. If
+you partition a dataset by date with a year of data, you will have at least
365 files. If
+you further partition by another dimension with 1,000 unique values, you will
have up to
+365,000 files. This fine of partitioning often leads to small files that
mostly consist of
+metadata.
+
+Partitioned datasets create nested folder structures, and those allow us to
prune which
+files are loaded in a scan. However, this adds overhead to discovering files
in the dataset,
+as we'll need to recursively "list directory" to find the data files. These
operations may
+be particularly expensive if you are using an object store filesystem such as
S3. Too fine
+partitions can cause problems here: Partitioning a dataset by date for a years
worth
+of data will require 365 list calls to find all the files; adding another
column with
+cardinality 1,000 will make that 365,365 calls.
+
+The most optimal partitioning layout will depend on your data, access
patterns, and which
+systems will be reading the data. Most systems, including Arrow, should work
across a
+range of file sizes and partitioning layouts, but there are extremes you
should avoid. To
+avoid pathological behavior, keep to these guidelines:
Review comment:
```suggestion
avoid worst case behavior, keep to these guidelines:
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]