Weston Pace created ARROW-17593:
-----------------------------------
Summary: [C++] Try and maintain input shape in Acero
Key: ARROW-17593
URL: https://issues.apache.org/jira/browse/ARROW-17593
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Weston Pace
Data is scanned in large chunks based on the format. For example, CSV scans
chunks based on a chunk_size while parquet scans entire row groups.
Then, upon entry into Acero, these chunks are sliced into morsels (~L3 size)
for parallelism and batches (~L1-L2 size) for cache efficient processing.
However, the way it is currently done, means that the output of Acero is a
stream of tiny batches. This is somewhat undesirable in many cases.
For example, if a pyarrow user calls pq.read_table they might expect to get one
batch per row group. If they were to turn around and write out that table to a
new parquet file then either they end up with a non-ideal parquet file (tiny
row groups) or they are forced to concatenate the batches (which is an
allocation + copy).
Even if the user is doing their own streaming processing (e.g. in pyarrow)
these small batch sizes are undesirable as the overhead of python means that
streaming processing should be done in larger batches.
Instead, there should be a configurable max_batch_size, independent of row
group size and morsel size, which is configurable, and quite large by default
(1Mi or 64Mi rows). This control exists for users that want to do their own
streaming processing and need to be able to tune for RAM usage.
Acero will read in data based on the format, as it does today (e.g. CSV chunk
size, row group size). If the source data is very large (bigger than
max_batch_size) it will be sliced. From that point on, any morsels or batches
should simply be views into this larger output batch. For example, when doing
a projection to add a new column, we should allocate a max_batch_size array and
then populate it over many runs of the project node.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)