rahul-ghiware opened a new issue, #404: URL: https://github.com/apache/incubator-xtable/issues/404
Using Spark 3.4.0, Scala 2.12 and Iceberg spark runtime 1.4.2 - Created iceberg table in tmp folder ``` rghiware ~ $ cd /tmp rghiware /tmp $ cd iceberg-warehouse/people rghiware iceberg-warehouse/people $ ls data metadata rghiware iceberg-warehouse/people $ cd data rghiware people/data $ ls 00000-3-4117ce4f-ff56-410b-a248-c9ed512903c8-00001.parquet ``` - Created a yaml file (`my_dataset_config.yaml`) for iceberg target and delta ``` sourceFormat: ICEBERG targetFormats: - HUDI - DELTA datasets: - tableBasePath: file:///tmp/iceberg-warehouse/people tableDataPath: file:///tmp/iceberg-warehouse/people/data tableName: people ``` - Ran the one table sync jar locally ``` java -jar ./utilities-0.1.0-beta1-bundled.jar -d my_dataset_config.yaml ``` - Was able to confirm and see .hoodie and _delta_log folders under the data folders for the table ``` rghiware /tmp $ cd iceberg-warehouse/people/data rghiware people/data $ ls -altr total 16 -rw-r--r--@ 1 rghiware wheel 24 Mar 29 09:49 .00000-3-4117ce4f-ff56-410b-a248-c9ed512903c8-00001.parquet.crc -rw-r--r--@ 1 rghiware wheel 1618 Mar 29 09:49 00000-3-4117ce4f-ff56-410b-a248-c9ed512903c8-00001.parquet drwxr-xr-x@ 4 rghiware wheel 128 Mar 29 09:49 .. drwxr-xr-x@ 6 rghiware wheel 192 Mar 29 09:54 . drwxr-xr-x@ 15 rghiware wheel 480 Mar 29 09:55 .hoodie drwxr-xr-x@ 4 rghiware wheel 128 Mar 29 09:55 _delta_log ``` - Able to load data with pySpark using delta format ``` pyspark \ --packages io.delta:delta-core_2.12:2.4.0 \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \ --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" ``` ``` >>> df = spark.read.format("delta").load("/tmp/iceberg-warehouse/people/data") >>> df.printSchema() root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- age: integer (nullable = true) |-- city: string (nullable = true) |-- create_ts: string (nullable = true) >>> df.show() 24/03/29 10:27:32 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. +---+-------+---+----+-------------------+ | id| name|age|city| create_ts| +---+-------+---+----+-------------------+ | 6|Charlie| 31| DFW|2023-08-29 00:00:00| | 1| John| 25| NYC|2023-09-28 00:00:00| | 4| Andrew| 40| NYC|2023-10-28 00:00:00| | 3|Michael| 35| ORD|2023-09-28 00:00:00| | 5| Bob| 28| SEA|2023-09-23 00:00:00| | 2| Emily| 30| SFO|2023-09-28 00:00:00| +---+-------+---+----+-------------------+ >>> ``` - However unable to load same data with pySpark using hudi format ``` pyspark \ --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \ --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog" \ --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" ``` ``` >>> df = spark.read.format("hudi").load("/tmp/iceberg-warehouse/people/data") 24/03/29 10:30:30 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 24/03/29 10:30:30 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file >>> df.printSchema() root |-- id: integer (nullable = true) |-- name: string (nullable = true) |-- age: integer (nullable = true) |-- city: string (nullable = true) |-- create_ts: string (nullable = true) >>> df.show() +---+----+---+----+---------+ | id|name|age|city|create_ts| +---+----+---+----+---------+ +---+----+---+----+---------+ >>> ``` Not able to figure out if I'm missing anything here or it is an issue with xtable jar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org