[GitHub] [arrow-rs] tustvold opened a new issue, #1764: Optimized Writing of Arrow Byte Array to Parquet

GitBox Mon, 30 May 2022 03:27:05 -0700


tustvold opened a new issue, #1764:
URL: https://github.com/apache/arrow-rs/issues/1764


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   A significant amount of effort has been put into making the reading of byte 
arrays from parquet fast:
   
   * https://github.com/apache/arrow-rs/pull/1041
   * https://github.com/apache/arrow-rs/pull/1082
   * https://github.com/apache/arrow-rs/pull/1180
   
   We should invest some effort in making the writer performance comparable.
   
   **Describe the solution you'd like**
   
   Currently in order to write byte array types from arrow:
   
   * Any dictionaries are hydrated
   * Each value from a string array is separately allocated into a 
`Vec<ByteArray>`
   * These values are then written using the ColumnWriter
   
   It would be a significant performance win to be able to elide these first 
two steps. This would likely involve much the same process as was followed for 
the reader:
   
   * Generify ColumnWriter to allow writing from different buffers
   * Add the ability to write from an arrow ByteArray directly
   * Add the ability to write from an arrow dictionary array directly
   
   **Describe alternatives you've considered**
   
   We could not do this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold opened a new issue, #1764: Optimized Writing of Arrow Byte Array to Parquet

Reply via email to