marvinlanhenke opened a new issue, #8723: URL: https://github.com/apache/arrow-datafusion/issues/8723
### Is your feature request related to a problem or challenge? As originally stated in #6922 (I'm not sure why the issue was closed) and discussed in #6801 the current `FileOpener` implementation for both, Csv and Json, are utilizing multiple GetRequests to adjust the byte range prior to parsing / reading the file itself. This is suboptimal and can be improved - minimizing the latency due to multiple remote network requests. ### Describe the solution you'd like I would like to reduce the number of GetRequests from 3 to 1. This can be done by "overfetching" the original partition byte range; and then adjust the range by finding the newline delimiter similar to the solution already implemented. The approach is outlined here: https://github.com/apache/arrow-datafusion/pull/6801#discussion_r1257465786 by @alamb There are some edge-cases that need consideration, like "heterogenous object sizes" within a CSV row or JSON object, that leads to partition ranges overlapping on the same line, which can lead to reading the same line twice. Error handling/ retry when no newline can be found ("overfetching" range was to small) has to be handled, as well. **POC**: --- I already went ahead and implemented a POC which works and can handle some edge-cases like overlapping partition ranges; appropriate error handling / retry is still missing. However, I definitely **need help** to improve upon this: https://github.com/marvinlanhenke/arrow-datafusion/blob/poc_optimize_get_req/datafusion/core/src/datasource/physical_plan/json.rs#L232-L380 The solution is inefficient due to line-by-line operations and buffer cloning / copying. I tried different ways to handle the `GetResultPayload::Stream` by using BytesMut::new() & buffer.extend_from_slice; but I was not able to handle all the edge-cases correctly. I'd greatly appreciate if someone can give some pointers; or take it from here to improve upon the POC. ### Describe alternatives you've considered Leave as is. ### Additional context None. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
