Re: [I] Reading FixedSizeList from parquet is slower than reading values into more rows [arrow]

via GitHub Tue, 14 May 2024 02:51:09 -0700


rok commented on issue #34510:
URL: https://github.com/apache/arrow/issues/34510#issuecomment-2109768275


   > What if applications would use custom metadata to hold the schema and 
tensor type while writing only storage values (floats for example) in parquet 
files? It would need some custom logic to construct the tensor again when 
reading but might be a good alternative (buffers should still be the same after 
read, not copied).
   
   I think `FixedShapeTensor` get stored as `FixedSizeList` plus some metadata 
so overhead comes from storing `FixedSizeList`. I'm not sure, but maybe there's 
a clean way to have `FixedSizeList` cast to `FixedSizeBinary` or similar when 
writing `FixedShapeTensor` and then the inverse on reading. I don't think we 
have a clean option here though.
   
   Given the current activity in Parquet community it might be worth proposing 
adding `FixedSizeList` to Parquet?
   
   Also I wonder if optimized take 
(https://github.com/apache/arrow/issues/39798) would improve the performance 
somewhat once all the PRs land.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Reading FixedSizeList from parquet is slower than reading values into more rows [arrow]

Reply via email to