ariel-miculas commented on code in PR #21478:
URL: https://github.com/apache/datafusion/pull/21478#discussion_r3078385857


##########
datafusion/datasource-json/src/boundary_stream.rs:
##########
@@ -90,10 +90,52 @@ async fn get_stream(
     range: std::ops::Range<u64>,
 ) -> object_store::Result<BoxStream<'static, object_store::Result<Bytes>>> {
     let opts = GetOptions {
-        range: Some(GetRange::Bounded(range)),
+        range: Some(GetRange::Bounded(range.clone())),
         ..Default::default()
     };
     let result = store.get_opts(&location, opts).await?;
+
+    #[cfg(not(target_arch = "wasm32"))]
+    if let GetResultPayload::File(mut file, _path) = result.payload {
+        use std::io::{Read, Seek, SeekFrom};
+        const CHUNK_SIZE: u64 = 8 * 1024;

Review Comment:
   I've picked this value because that's the default for both
   *  `BufReader` (which is used when reading the entire json file as shown 
below)
   ```
                   GetResultPayload::File(file, _) => {
                       let bytes = file_compression_type.convert_read(file)?;
   
                       if newline_delimited {
                           // NDJSON: use BufReader directly
                           let reader = BufReader::new(bytes);
                           let arrow_reader = ReaderBuilder::new(schema)
                               .with_batch_size(batch_size)
                               .build(reader)?;
   
                           Ok(futures::stream::iter(arrow_reader)
                               .map(|r| r.map_err(Into::into))
                               .boxed())
                       }
   ```
   * object_store's `GetResult.into_stream`:
   ```
       pub fn into_stream(self) -> BoxStream<'static, Result<Bytes>> {
           match self.payload {
               #[cfg(all(feature = "fs", not(target_arch = "wasm32")))]
               GetResultPayload::File(file, path) => {
                   const CHUNK_SIZE: usize = 8 * 1024;
                   local::chunked_stream(file, path, self.range, CHUNK_SIZE)
               }
               GetResultPayload::Stream(s) => s,
           }
       }
   ```
   
   I think this value should be kept in sync with the first bullet point (where 
we read the entire json file).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to