ariel-miculas opened a new pull request, #20823:
URL: https://github.com/apache/datafusion/pull/20823

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   - Closes #.
   
   ## Rationale for this change
   
   This is an alternative approach to
   https://github.com/apache/datafusion/pull/19687
   
   Instead of reading the entire range in the json FileOpener, implement an
   AlignedBoundaryStream which scans the range for newlines as the FileStream
   requests data from the stream, by wrapping the original stream returned by 
the
   ObjectStore.
   
   This eliminated the overhead of the extra two get_opts requests needed by
   calculate_range and more importantly, it allows for efficient read-ahead
   implementations by the underlying ObjectStore. Previously this was 
inefficient
   because the streams opened by calculate_range included a stream from (start -
   1) to file_size and another one from (end - 1) to end_of_file, just to find 
the
   two relevant newlines.
   
   
   ## What changes are included in this PR?
   Added the AlignedBoundaryStream which wraps a stream returned by the object 
store and finds the delimiting newlines for a particular file range. Notably it 
doesn't do any standalone reads (unlike the calculate_range function), 
eliminating two calls to get_opts.
   
   ## Are these changes tested?
   Yes, added unit tests.
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are 
they covered by existing tests)?
   -->
   
   ## Are there any user-facing changes?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to