[GitHub] [iceberg] maxdebayser commented on pull request #7831: Compute parquet stats

via GitHub Tue, 13 Jun 2023 16:04:45 -0700


maxdebayser commented on PR #7831:
URL: https://github.com/apache/iceberg/pull/7831#issuecomment-1590005818

@Fokko, I understand your concern, I think it's because we have different
use cases in mind.

If I understand correctly you want to write a pyarrow.Table to a partitioned
dataset with write_dataset. Therefore computing min/max on the whole Table is
not what you need because you actually need the min/max for the columns of the
individual files. (Just pointing out that with the metadata collector you get
the stats for the row chunks, so you'll still have to compute the stats for the
file from those).

I'm coming from a different use case. I would like to write from Ray using
something like
https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html#ray.data.Dataset.write_parquet
. In this case there is no global pyarrow.Table that represent the dataset,
Pyarrow tables are the blocks of the dataset that each individual ray task
sees, for example in `map_batches`. In this scenario the pyarrow.write_dataset
cannot be used because the full dataset is not entirely loaded into the memory
of any compute node. In this scenario the GIL is also not a big concern because
ray uses multiple worker processes.

I think we have to see if there is a way to have a single API for both use
cases or if we'll need to have different API calls for both. In the second case
it would be better to share part of the implementation to ensure that the
behavior is consistent, but it could perhaps lead to bad performance.

Regarding the efficiency, the pyarrow.compute.min function is implemented in
C++, so I think the performance is probably not a huge concern here. But I can
try to compare both approaches with a large enough data set to measure it.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] maxdebayser commented on pull request #7831: Compute parquet stats

Reply via email to