theogaraj opened a new issue, #45018:
URL: https://github.com/apache/arrow/issues/45018
### Describe the usage question you have. Please include as many useful
details as possible.
NOTE: I posted this to StackOverflow a few days ago, reposting here for
more focused attention of the Arrow team. Apologies if this breaches protocol.
My code and observations are based on `pyarrow` version `18.0.0`.
**Question**: is there some internally imposed upper limit on the
`min_rows_per_group` parameter of `pyarrow.dataset.write_dataset`?
I have a number of parquet files that I am trying to coalesce into larger
files with larger row-groups.
#### With write_dataset (doesn't work as expected)
To do so, I am iterating over the files and using
`pyarrow.parquet.ParquetFile.iter_batches` to yield record batches, which form
the input to `pyarrow.dataset.write_dataset`. I would like to write row-groups
of ~9 million.
```python
import pyarrow.parquet as pq
import pyarrow.dataset as ds
def batch_generator():
source = UPath(INPATH)
for f in source.glob('*.parquet'):
with pq.ParquetFile(f) as pf:
yield from pf.iter_batches(batch_size=READ_BATCH_SIZE)
batches = batch_generator()
ds.write_dataset(
batches,
OUTPATH,
format='parquet',
schema=SCHEMA,
min_rows_per_group=WRITE_BATCH_SIZE,
max_rows_per_group=WRITE_BATCH_SIZE,
existing_data_behavior='overwrite_or_ignore'
)
```
**Notes:**
- `UPath` is from `universal-pathlib` and is basically the equivalent of
pathlib.Path but works for both local as well as cloud storage.
- `WRITE_BATCH_SIZE` is set to ~9.1M
- `READ_BATCH_SIZE` is a smaller value, ~ 12k, to lower memory footprint
while reading.
When I check my output (using `pyarrow.parquet.read_metadata`) I see that I
have a single file (as desired) but row-groups are no larger than ~2.1M.
#### Alternative approach (works as expected)
I then tried this alternative approach to coalescing the row-groups (adapted
from [this
gist](https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c)):
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
def batch_generator():
source = UPath(INPATH)
for f in source.glob('*.parquet'):
with pq.ParquetFile(f) as pf:
yield from pf.iter_batches(batch_size=READ_BATCH_SIZE)
def write_parquet(outpath: UPath, tables):
with pq.ParquetWriter(outpath, SCHEMA) as writer:
for table in tables:
writer.write_table(table, row_group_size=WRITE_BATCH_SIZE)
def aggregate_tables():
tables = []
current_size = 0
for batch in batch_generator():
table = pa.Table.from_batches([batch])
this_size = table.num_rows
if current_size + this_size > WRITE_BATCH_SIZE:
yield pa.concat_tables(tables)
tables = []
current_size = 0
tables.append(table)
current_size += this_size
if tables:
yield pa.concat_tables(tables)
write_parquet(OUTPATH, aggregate_tables(), SCHEMA)
```
`batch_generator` remains the same as before. The key difference here is
that I am manually combining row-groups issued by `batch_generator`, using
`pa.concat_tables()`.
My output file has one row-group of ~9M, and a second row-group of ~3M,
which is what I expect and want.
So my question is... why does `pyarrow.dataset.write_dataset` not write up
to the requested value of `min_rows_per_group` but seem to cap off at ~2.1M?
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]