maxdebayser commented on PR #7831:
URL: https://github.com/apache/iceberg/pull/7831#issuecomment-1590005818

   @Fokko, I understand your concern, I think it's because we have different 
use cases in mind.
   
   If I understand correctly you want to write a pyarrow.Table to a partitioned 
dataset with write_dataset. Therefore computing min/max on the whole Table is 
not what you need because you actually need the min/max for the columns of the 
individual files. (Just pointing out that with the metadata collector you get 
the stats for the row chunks, so you'll still have to compute the stats for the 
file from those).
   
   I'm coming from a different use case. I would like to write from Ray using 
something like 
https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.write_parquet.html#ray.data.Dataset.write_parquet
 . In this case there is no global pyarrow.Table that represent the dataset, 
Pyarrow tables are the blocks of the dataset that each individual ray task 
sees, for example in `map_batches`. In this scenario the pyarrow.write_dataset 
cannot be used because the full dataset is not entirely loaded into the memory 
of any compute node. In this scenario the GIL is also not a big concern because 
ray uses multiple worker processes.
   
   I think we have to see if there is a way to have a single API for both use 
cases or if we'll need to have different API calls for both. In the second case 
it would be better to share part of the implementation to ensure that the 
behavior is consistent, but it could perhaps lead to bad performance.
   
   Regarding the efficiency, the pyarrow.compute.min function is implemented in 
C++, so I think the performance is probably not a huge concern here. But I can 
try to compare both approaches with a large enough data set to measure it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to