schenksj opened a new pull request, #4525:
URL: https://github.com/apache/datafusion-comet/pull/4525

   ## Which issue does this PR close?
   
   Closes #4520.
   
   ## Rationale for this change
   
   Comet's native readers go through `object_store`, which only understands a 
fixed set of URL schemes. When a scan's path uses a custom Hadoop `FileSystem` 
scheme (e.g. registered via `spark.hadoop.fs.<scheme>.impl`), the native reader 
fails at *execution* with `Generic URL error: Unable to recognise URL "..."` — 
there is no graceful recovery once native execution has started. This was 
surfaced by Delta tables opened with custom filesystem options 
(`DeltaTable.forPath(spark, path, fsOptions)`), where Delta reads its internal 
`_delta_log/*.checkpoint.parquet` via ordinary V1 parquet scans that Comet then 
claimed and crashed on, but it reproduces for *any* V1 parquet scan on such a 
scheme.
   
   ## What changes are included in this PR?
   
   `CometScanRule` declines a V1 native scan when its root-path scheme isn't 
natively readable, so Spark's Hadoop-FS-aware reader handles it. Rather than 
hardcode the object_store-supported scheme set in the planner (a mirror that 
drifts), the answer comes from the native layer itself: a new 
`NativeBase.isObjectStoreSchemeSupported` JNI method backed by `object_store`'s 
own `ObjectStoreScheme::parse` — the same path 
`prepare_object_store_with_configs` dispatches through. The user's libhdfs 
scheme config (`spark.hadoop.fs.comet.libhdfs.schemes`) is unioned in on the 
JVM side; results are cached per scheme; and if native can't be consulted the 
scheme is assumed supported rather than over-restricting.
   
   ## How are these changes tested?
   
   `CometScanSchemeFallbackSuite` registers `FakeHDFSFileSystem` for a 
`fake://` scheme (not routed through libhdfs) and applies `CometScanRule` to 
the scan's physical plan. It asserts the scan falls back to Spark (no 
`CometScanExec`). The test **fails without the gate** (Comet claims the 
`fake://` scan) and **passes with it**. The libhdfs-scheme regression guard 
(`ParquetReadFromFakeHadoopFsSuite`) continues to engage Comet for configured 
libhdfs schemes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to