[PR] GH-3080: HadoopStreams to support ByteBufferPositionedReadable [parquet-java]

via GitHub Tue, 03 Dec 2024 12:46:08 -0800


steveloughran opened a new pull request, #3096:
URL: https://github.com/apache/parquet-java/pull/3096


   ### Rationale for this change
   
   If a stream declares in its StreamCapabilities that it supports
   `ByteBufferPositionedReadable`, then use that API for
   `readFully(ByteBuffer)`
   
   ```
   ByteBufferPositionedReadable.readFully(long position, ByteBuffer buf)
   ```
   
   Adding support for Hadoop `ByteBufferPositionedReadable` streams may improve 
performance
   by pushing retry/recovery logic into the filesystem client library.
   
   This interface is implemented by the HDFS input stream; we are considering 
adding
   it elsewhere.
   
   ### What changes are included in this PR?
   
   * New SeekableInputStream implementation: `H3ByteBufferInputStream`
   * Instantiated in `HadoopStreams` if the `FSDataInputStream` is considered 
suitable.
   * Tests for the new behavior and that no regressions are caused.
   
   #### Class `H3ByteBufferInputStream`
   
   The reading is done in a new class, `H3ByteBufferInputStream`, which 
subclasses `H2ByteBufferInputStream`. This reduces the amount of duplicate 
code, it just makes it a bit unclean.
   
   The purist way to do it would be to create an abstract superclass 
`HadoopInputStream` to hold all commonality between the the three input streams.
   
   I'm happy to do this, just didn't want to doing some larger refactoring 
without (a) showing the core design worked and (b) getting permission to do 
this. Should I do this?
   
   #### `HadoopStreams` changes
   
   Selection of the new input stream is done if and only if the stream declares 
the capability `in:preadbytebuffer`.
   There is no equivalent of `isWrappedStreamByteBufferReadable()` which 
recurses through
   a chain of wrapped streams looking for the API.
   If a stream doesn't declare its support for the API, it won't get picked up.
   This is done knowing that the sole production implemenation which currently 
exists, 
   the HDFS input stream, does declare this capability.
   
   ### Are these changes tested?
   
   There is new test suite, for new behavior and ensuring that the integration 
with
   HadoopStreams still retains the correct behavior for existing streams.
   Suite is parameterized on heap and direct buffers.
   
   ### Are there any user-facing changes?
   
   No
   
   Closes GH-3080


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] GH-3080: HadoopStreams to support ByteBufferPositionedReadable [parquet-java]

Reply via email to