marvinlanhenke opened a new issue, #8723:
URL: https://github.com/apache/arrow-datafusion/issues/8723

   ### Is your feature request related to a problem or challenge?
   
   As originally stated in #6922 (I'm not sure why the issue was closed) and 
discussed in #6801 the current `FileOpener` implementation for both, Csv and 
Json, are utilizing multiple GetRequests to adjust the byte range prior to 
parsing / reading the file itself. 
   
   This is suboptimal and can be improved - minimizing the latency due to 
multiple remote network requests.
   
   ### Describe the solution you'd like
   
   I would like to reduce the number of GetRequests from 3 to 1.
   
   This can be done by "overfetching" the original partition byte range; and 
then adjust the range by finding the newline delimiter similar to the solution 
already implemented.
   
   The approach is outlined here: 
https://github.com/apache/arrow-datafusion/pull/6801#discussion_r1257465786 by 
@alamb
   
   There are some edge-cases that need consideration, like "heterogenous object 
sizes" within a CSV row or JSON object, that leads to  partition ranges 
overlapping on the same line, which can lead to reading the same line twice. 
Error handling/ retry when no newline can be found ("overfetching" range was to 
small) has to be handled, as well.
   
   **POC**:
   ---
   I already went ahead and implemented a POC which works and can handle some 
edge-cases like overlapping partition ranges; appropriate error handling / 
retry is still missing.
   
   However, I definitely **need help** to improve upon this: 
https://github.com/marvinlanhenke/arrow-datafusion/blob/poc_optimize_get_req/datafusion/core/src/datasource/physical_plan/json.rs#L232-L380
   
   The solution is inefficient due to line-by-line operations and buffer 
cloning / copying.
   I tried different ways to handle the `GetResultPayload::Stream` by using 
BytesMut::new() & buffer.extend_from_slice; but I was not able to handle all 
the edge-cases correctly.
   
   I'd greatly appreciate if someone can give some pointers; or take it from 
here to improve upon the POC.
   
   ### Describe alternatives you've considered
   
   Leave as is.
   
   ### Additional context
   
   None.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to