zeroshade commented on pull request #9862:
URL: https://github.com/apache/arrow/pull/9862#issuecomment-813062545


   Rebased as per suggestion to hopefully get a successful run there.
   
   As for the requested changes:
   
   1) Updated the `MaxCompressLen` function for ZSTD to explicitly handle 
signed vs unsigned to ensure no issues there.
   2) Is there a test case that exists in the integration archery tests which 
has the metadata say to use compression but passes uncompressed buffers? The 
code currently correctly handles 0 length and nil buffers, but I didn't see 
anything in the spec or config (or the C++ implementation) for having non-zero 
length buffers which are uncompressed being sent along side compressed ones. I 
guess I'm missing something?
   
   As far as the decoupling of compression/decompression and IO calls:
   For Decompression: The current code structure only decompresses as it loads 
the records but doesn't provide any decoupling between reading the record in 
with the IO calls and the loading of the flatbuffer data that I could put the 
decompression into. Currently it's just going to decompress in the same place 
that it would be loading the raw flatbuffer data into an `array.Record`.
   
   For Compression: 
   The compression happens during the payload encoding. So while the current 
code does not provide any hook that could be used to decouple encoding the 
payload and the IO to send it (the encoding happens right before it writes the 
record) if that decoupling were to be added, it is easy to do it as the 
compression is part of the payload encoding, not part of the IO to write the 
record out.
   
   The one optimization I can see here is that I could implement 
parallelization for the compression to attempt to compress the body buffers in 
parallel rather than serially like I'm doing right now.
   
   Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to