[I] [Experimental] Integrate Comet native reader with remote HDFS [datafusion-comet]

via GitHub Fri, 24 Jan 2025 12:35:22 -0800


comphead opened a new issue, #1336:
URL: https://github.com/apache/datafusion-comet/issues/1336


   ### What is the problem the feature request solves?
   
   Currently Apache DataFusion Comet reads the data from underlying sources 
using builtin Comet reader which lacks support for nested types processing. 
   
   There is an experimental feature 
   ```
     val COMET_NATIVE_SCAN_IMPL: ConfigEntry[String] = 
conf("spark.comet.scan.impl")
       .doc(
         s"The implementation of Comet Native Scan to use. Available modes are 
'$SCAN_NATIVE_COMET'," +
           s"'$SCAN_NATIVE_DATAFUSION', and '$SCAN_NATIVE_ICEBERG_COMPAT'. " +
           s"'$SCAN_NATIVE_COMET' is for the original Comet native scan which 
uses a jvm based " +
           "parquet file reader and native column decoding. Supports simple 
types only " +
           s"'$SCAN_NATIVE_DATAFUSION' is a fully native implementation of scan 
based on DataFusion" +
           s"'$SCAN_NATIVE_ICEBERG_COMPAT' is a native implementation that 
exposes apis to read " +
           "parquet columns natively.")
       .internal()
       .stringConf
       .transform(_.toLowerCase(Locale.ROOT))
       .checkValues(Set(SCAN_NATIVE_COMET, SCAN_NATIVE_DATAFUSION, 
SCAN_NATIVE_ICEBERG_COMPAT))
       .createWithDefault(sys.env
         .getOrElse("COMET_PARQUET_SCAN_IMPL", SCAN_NATIVE_COMET)
         .toLowerCase(Locale.ROOT))
   ```
   
   to scan the data using DataFusion native reader which supports Arrow nested 
types, however the reader has to be able to read data from remote HDFS 
filesystem. 
   
   There are some object store implementations available to work with HDFS 
which are 
   - https://github.com/datafusion-contrib/datafusion-objectstore-hdfs(native 
object store on top of libhdfs and JVM. More mem usage but has richer client 
setting support, like retry, network options, etc
   - https://github.com/datafusion-contrib/hdfs-native-object-store having less 
client options but no JVM dependency
   
   
   Subtasks
   - [ ] Create optional HDFS feature for Comet
   - [ ] Use HDFS file system based on some parameter (schema or 
`spark.defaultFS`)
   - [ ] Remote HDFS tests with Minikube
   - [ ] Documentation
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] [Experimental] Integrate Comet native reader with remote HDFS [datafusion-comet]

Reply via email to