vigimite opened a new issue, #19605:
URL: https://github.com/apache/datafusion/issues/19605

   ### Describe the bug
   
   The `calculate_range` function in `datafusion/datasource/src/mod.rs` creates 
invalid byte ranges (where `start > end`) when reading single-line JSON files 
that exceed 10MB total size with `target_partitions >= 2`.
   
   This causes an error from object_store:
   ```
   Error: ObjectStore(Generic { store: "S3", source: Inconsistent { start: 
1149247, end: 1149246 } })
   ```
   
   ### Root Cause
   
   In `calculate_range`, the function calculates adjusted byte ranges by 
finding newline boundaries:
   
   ```rust
   let start_delta = if start != 0 {
       find_first_newline(store, location, start - 1, file_size, newline).await?
   } else {
       0
   };
   
   let end_delta = if end != file_size {
       find_first_newline(store, location, end - 1, file_size, newline).await?
   } else {
       0
   };
   
   let range = start + start_delta..end + end_delta;
   
   if range.start == range.end {
       return Ok(RangeCalculation::TerminateEarly);
   }
   ```
   
   When `find_first_newline` doesn't find a newline (single-line JSON), it 
returns the length of the remaining file. This causes `start + start_delta` to 
exceed `end + end_delta`, creating an invalid range.
   
   The current check only handles `range.start == range.end`, not `range.start 
> range.end`.
   
   ### Trigger Conditions
   
   All must be true:
   1. Total file size > 10MB (triggers `FileGroupPartitioner` repartitioning)
   2. `target_partitions >= 2`
   3. JSON files are single-line (no internal newlines, e.g., from 
`json.dump()`)
   
   
   
   ### To Reproduce
   
   ### 1. Create test data (single-line JSON, >10MB total)
   
   ```python
   import json, random, string
   for i in range(1, 11):
       data = {'id': i, 'padding': ''.join(random.choices(string.ascii_letters, 
k=1100000))}
       with open(f'data/file_{i}.json', 'w') as f:
           json.dump(data, f)  # Single line, no newlines
   ```
   
   ### 2. Upload to S3/MinIO
   
   ```bash
   mc mb minio/test-bucket
   mc cp --recursive data/ minio/test-bucket/data/
   ```
   
   ### 3. Run DataFusion query
   
   ```rust
   use datafusion::prelude::*;
   use object_store::aws::AmazonS3Builder;
   use std::sync::Arc;
   use url::Url;
   
   #[tokio::main]
   async fn main() -> Result<(), Box<dyn std::error::Error>> {
       let store = Arc::new(AmazonS3Builder::new()
           .with_endpoint("http://localhost:9000";)
           .with_bucket_name("test-bucket")
           .with_access_key_id("minioadmin")
           .with_secret_access_key("minioadmin")
           .with_region("us-east-1")
           .with_allow_http(true)
           .build()?);
   
       let config = SessionConfig::new().with_target_partitions(2);  // Fails 
with >= 2
       let ctx = SessionContext::new_with_config(config);
   
       ctx.register_object_store(&Url::parse("s3://test-bucket")?, store);
       ctx.sql("CREATE EXTERNAL TABLE test STORED AS JSON LOCATION 
's3://test-bucket/data/'").await?;
   
       let df = ctx.sql("SELECT * FROM test").await?;
       let results = df.collect().await?;  // FAILS with invalid range error
       Ok(())
   }
   ```
   
   
   ### Expected behavior
   
   DataFusion should handle single-line JSON files gracefully when 
partitioning. When a partition contains no complete records (because the entire 
file is a single line), that partition should be skipped via 
`RangeCalculation::TerminateEarly`.
   
   
   ### Additional context
   
   ### Affected versions
   
   - datafusion 51.0.0 (also tested with 50.x, 49.x, 45.x)
   - object_store 0.12.4
   - Tested with MinIO and RustFS (both fail identically)
   
   ### Workarounds
   
   1. Set partitions to 1: `SessionConfig::new().with_target_partitions(1)`
   2. Always add newline when writing NDJSON


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to