comphead opened a new issue, #1336: URL: https://github.com/apache/datafusion-comet/issues/1336
### What is the problem the feature request solves? Currently Apache DataFusion Comet reads the data from underlying sources using builtin Comet reader which lacks support for nested types processing. There is an experimental feature ``` val COMET_NATIVE_SCAN_IMPL: ConfigEntry[String] = conf("spark.comet.scan.impl") .doc( s"The implementation of Comet Native Scan to use. Available modes are '$SCAN_NATIVE_COMET'," + s"'$SCAN_NATIVE_DATAFUSION', and '$SCAN_NATIVE_ICEBERG_COMPAT'. " + s"'$SCAN_NATIVE_COMET' is for the original Comet native scan which uses a jvm based " + "parquet file reader and native column decoding. Supports simple types only " + s"'$SCAN_NATIVE_DATAFUSION' is a fully native implementation of scan based on DataFusion" + s"'$SCAN_NATIVE_ICEBERG_COMPAT' is a native implementation that exposes apis to read " + "parquet columns natively.") .internal() .stringConf .transform(_.toLowerCase(Locale.ROOT)) .checkValues(Set(SCAN_NATIVE_COMET, SCAN_NATIVE_DATAFUSION, SCAN_NATIVE_ICEBERG_COMPAT)) .createWithDefault(sys.env .getOrElse("COMET_PARQUET_SCAN_IMPL", SCAN_NATIVE_COMET) .toLowerCase(Locale.ROOT)) ``` to scan the data using DataFusion native reader which supports Arrow nested types, however the reader has to be able to read data from remote HDFS filesystem. There are some object store implementations available to work with HDFS which are - https://github.com/datafusion-contrib/datafusion-objectstore-hdfs(native object store on top of libhdfs and JVM. More mem usage but has richer client setting support, like retry, network options, etc - https://github.com/datafusion-contrib/hdfs-native-object-store having less client options but no JVM dependency Subtasks - [ ] Create optional HDFS feature for Comet - [ ] Use HDFS file system based on some parameter (schema or `spark.defaultFS`) - [ ] Remote HDFS tests with Minikube - [ ] Documentation ### Describe the potential solution _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org