[GitHub] [arrow] shanjixi commented on issue #36608: [C++] Support reading Hadoop-snappy File Format Directly

via GitHub Fri, 21 Jul 2023 03:33:52 -0700


shanjixi commented on issue #36608:
URL: https://github.com/apache/arrow/issues/36608#issuecomment-1645368364


   > supported
   
   TFRecord is just like a file contains multi-rows；
   In practice，we use a converter to transform TFRecord file (compressed) into 
arrow::Table Type. Then we apply Arrow filter/take/list_flatten functions 
before we send the Data to Tensorflow worker;
   
   Arrow is used to read data from HDFS, make further processing and avoid 
memroy copy cost between these two steps;
   
   What's more, within our company both  Hadoop-ZSTD and Hadoop-SNAP are 
supported to be read with Arrow directly( we call them as 
zstd_decompress_inputstream and snappy_decompress_inputstream.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] shanjixi commented on issue #36608: [C++] Support reading Hadoop-snappy File Format Directly

Reply via email to