comphead opened a new issue, #1336:
URL: https://github.com/apache/datafusion-comet/issues/1336
### What is the problem the feature request solves?
Currently Apache DataFusion Comet reads the data from underlying sources
using builtin Comet reader which lacks support for nested types processing.
There is an experimental feature
```
val COMET_NATIVE_SCAN_IMPL: ConfigEntry[String] =
conf("spark.comet.scan.impl")
.doc(
s"The implementation of Comet Native Scan to use. Available modes are
'$SCAN_NATIVE_COMET'," +
s"'$SCAN_NATIVE_DATAFUSION', and '$SCAN_NATIVE_ICEBERG_COMPAT'. " +
s"'$SCAN_NATIVE_COMET' is for the original Comet native scan which
uses a jvm based " +
"parquet file reader and native column decoding. Supports simple
types only " +
s"'$SCAN_NATIVE_DATAFUSION' is a fully native implementation of scan
based on DataFusion" +
s"'$SCAN_NATIVE_ICEBERG_COMPAT' is a native implementation that
exposes apis to read " +
"parquet columns natively.")
.internal()
.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set(SCAN_NATIVE_COMET, SCAN_NATIVE_DATAFUSION,
SCAN_NATIVE_ICEBERG_COMPAT))
.createWithDefault(sys.env
.getOrElse("COMET_PARQUET_SCAN_IMPL", SCAN_NATIVE_COMET)
.toLowerCase(Locale.ROOT))
```
to scan the data using DataFusion native reader which supports Arrow nested
types, however the reader has to be able to read data from remote HDFS
filesystem.
There are some object store implementations available to work with HDFS
which are
- https://github.com/datafusion-contrib/datafusion-objectstore-hdfs(native
object store on top of libhdfs and JVM. More mem usage but has richer client
setting support, like retry, network options, etc
- https://github.com/datafusion-contrib/hdfs-native-object-store having less
client options but no JVM dependency
Subtasks
- [ ] Create optional HDFS feature for Comet
- [ ] Use HDFS file system based on some parameter (schema or
`spark.defaultFS`)
- [ ] Remote HDFS tests with Minikube
- [ ] Documentation
### Describe the potential solution
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]