andygrove commented on PR #4525:
URL:
https://github.com/apache/datafusion-comet/pull/4525#issuecomment-4586905171
Nice approach overall. Sourcing "can the native reader handle this scheme?"
from `object_store`'s own `ObjectStoreScheme::parse` instead of a hardcoded
list is the right call, and for every scheme that flows through `parse_url`
natively the gate is exact by construction.
I think there is one gap worth a look before merge: plain `hdfs://`.
On the native side, `is_hdfs_scheme` in `parquet_support.rs` returns `true`
for `scheme == "hdfs"` whenever `fs.comet.libhdfs.schemes` is unset, and
`create_hdfs_object_store` is compiled into the default build (`default =
["hdfs-opendal"]` in `native/core/Cargo.toml`). So the native reader handles
`hdfs://` out of the box.
On the JVM side, `COMET_LIBHDFS_SCHEMES` has no default, so when it is unset
`libhdfsSchemes` is empty. For `hdfs`, the decline condition then reduces to
`!isNativelyReadableScheme(uri)`, and `object_store` has no `hdfs` scheme, so
that helper returns `false`. The net effect is that a plain `hdfs://` V1 scan
gets declined and falls back to Spark, even though native could read it.
I built the branch with a default native library and probed the gate's
helper directly to confirm:
```
isNativelyReadableScheme(hdfs://namenode:8020/...) = false // declined
isNativelyReadableScheme(s3a://bucket/key) = true // ok
isNativelyReadableScheme(file:/tmp/data) = true // ok
```
So `s3a` and `file` stay consistent (they bypass `parse_url` natively but
`object_store` recognizes them anyway). Only `hdfs` diverges, and it diverges
in the default HDFS configuration rather than an exotic one, so this looks like
a silent fallback regression for HDFS users.
Would it make sense to mirror the native default on the JVM, so the two stay
in lockstep?
```scala
val libhdfsSchemes: Set[String] = COMET_LIBHDFS_SCHEMES.get() match {
case Some(s) =>
s.split(",").map(_.trim.toLowerCase(Locale.ROOT)).filter(_.nonEmpty).toSet
case None => Set("hdfs") // native is_hdfs_scheme defaults to `scheme
== "hdfs"` when unset
}
```
A test asserting that an `hdfs://` root path with the config unset is still
claimed by Comet would lock this in, alongside the existing `fake://` decline
case.
One smaller thing: is the V2 `BatchScanExec` path susceptible to the same
`Unable to recognise URL` failure on custom schemes, or is this intentionally
V1-only? A note or follow-up issue would help.
*Disclosure: I used Claude Code to help review this PR, including building
the branch and running the scheme-gate probe above.*
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]