jiayuasu opened a new issue, #824:
URL: https://github.com/apache/sedona-db/issues/824

   ### Description
   
   `LasSource` reports projection pushdown to DataFusion, but the resulting 
record batches still carry the full file schema. When the requested projection 
is anything other than a prefix of the file schema, one of two things happens:
   
   1. The query errors out at the `DataSourceExec` boundary with
      `column types must match schema types, expected <T> but found <U> at 
column index N`.
   2. The query "succeeds" but silently reads the wrong columns — the 
downstream operator indexes into a batch with the file schema, not the 
projected schema.
   
   Case (2) is the more dangerous one: aggregates and filters over a projected 
column will run against whatever column happens to sit at the same physical 
index in the unprojected batch.
   
   ### Repro (uses files already in `rust/sedona-pointcloud/tests/data/`)
   
   `extra.las` is a 1-point LAS 1.4 file whose schema (under default 
`las.geometry_encoding = 'plain'` and `las.extra_bytes = 'ignore'`) starts with 
`x, y, z, intensity, return_number, ..., classification, ...`.
   
   \`\`\`sql
   -- works (prefix of file schema):
   SELECT x, y, z FROM 'tests/data/extra.las';
   
   -- fails: expected UInt8 (classification) but found Float64 (x) at column 
index 0
   SELECT classification FROM 'tests/data/extra.las';
   
   -- fails: expected UInt8 (classification) but found UInt16 (intensity) at 
column index 1
   SELECT geometry, classification FROM 'tests/data/extra.las';
   -- (same with default WKB encoding)
   
   -- fails the same way (DataFusion pushes a single-column projection):
   SELECT count(*) FROM 'tests/data/large.laz' WHERE classification = 0;
   \`\`\`
   
   Silent-wrong-result version, on `large.laz` where the generator sets `x = y 
= z` so the bug is hidden by coincidence:
   
   \`\`\`sql
   -- returns (n=100000, min_x=0.5, max_x=1.0, min_z=0.5, max_z=1.0)
   -- but the engine is actually computing min/max of x and y, not x and z.
   SELECT count(*), min(x), max(x), min(z), max(z) FROM 'tests/data/large.laz';
   \`\`\`
   
   Re-running with a file where `y != z` will return wrong numbers without any 
error.
   
   ### Hypothesis
   
   `LasOpener::open` (`rust/sedona-pointcloud/src/las/opener.rs`) yields the 
record batch returned by `file_reader.get_batch(chunk_meta)` unchanged — it 
never sees the projection.
   
   `LasSource::create_file_opener` 
(`rust/sedona-pointcloud/src/las/source.rs:98-117`) wraps the opener with 
`ProjectionOpener::try_new(split_projection, inner_opener, 
table_schema.file_schema())` when `split_projection` is set, but the output 
stream coming out of that wrapper still has the file schema, so the upstream 
`DataSourceExec` schema check fails.
   
   Either the inner opener needs to apply `split_projection.file_indices` 
itself (slice columns before yielding), or the `ProjectionOpener` arrangement 
needs to be corrected (e.g. wrong arg order / wrong projected schema being 
declared on `FileScanConfig`).
   
   ### Fix proposal (sketch)
   
   Apply the file-index projection inside `LasOpener` before yielding:
   
   \`\`\`rust
   let record_batch = file_reader.get_batch(chunk_meta).await?;
   let record_batch = if let Some(file_indices) = &file_indices {
       record_batch.project(file_indices)?
   } else {
       record_batch
   };
   \`\`\`
   
   …and pass \`split_projection.file_indices.clone()\` into \`LasOpener\` from 
\`LasSource::create_file_opener\`.
   
   ### Environment
   
   - Branch: \`main\` at commit 5edd7cf0d (\`feat(python/sedonadb): add Expr 
foundation #807\`)
   - \`cargo run -p sedona --features pointcloud\` (default workspace toolchain)
   
   ### Suggested regression test
   
   Add to \`rust/sedona-pointcloud/src/las/format.rs\` \`mod test\`:
   
   \`\`\`rust
   #[tokio::test]
   async fn projection_actually_executes() {
       let ctx = setup_context();
       let df = ctx
           .sql(\"SELECT classification FROM 'tests/data/extra.las'\")
           .await
           .unwrap();
       let batches = df.collect().await.unwrap();
       assert_eq!(batches[0].schema().field(0).data_type(),
                  &arrow_schema::DataType::UInt8);
       assert_eq!(batches[0].num_columns(), 1);
   }
   \`\`\`
   
   The existing \`format::test::projection\` test only inspects the logical 
schema (\`df.schema().fields().len()\`); it never executes the plan, so the bug 
doesn't surface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to