[I] No data returned when querying hudi target table generated from iceberg source [incubator-xtable]

via GitHub Fri, 29 Mar 2024 07:33:40 -0700


rahul-ghiware opened a new issue, #404:
URL: https://github.com/apache/incubator-xtable/issues/404


   Using Spark 3.4.0, Scala 2.12 and Iceberg spark runtime 1.4.2
   
   - Created iceberg table in tmp folder
   
   ```
   rghiware ~ $ cd /tmp
   rghiware /tmp $ cd iceberg-warehouse/people
   rghiware iceberg-warehouse/people $ ls
   data         metadata
   rghiware iceberg-warehouse/people $ cd data
   rghiware people/data $ ls
   00000-3-4117ce4f-ff56-410b-a248-c9ed512903c8-00001.parquet
   ```
   
   - Created a yaml file (`my_dataset_config.yaml`) for iceberg target and delta
   
   ```
   sourceFormat: ICEBERG
   targetFormats:
     - HUDI
     - DELTA
   datasets:
     -
       tableBasePath: file:///tmp/iceberg-warehouse/people
       tableDataPath: file:///tmp/iceberg-warehouse/people/data
       tableName: people
   ```
   
   - Ran the one table sync jar locally
   ```
   java -jar ./utilities-0.1.0-beta1-bundled.jar -d my_dataset_config.yaml
   ```
   
   - Was able to confirm and see  .hoodie and _delta_log folders under the data 
folders for the table
   ```
   rghiware /tmp $ cd iceberg-warehouse/people/data
   rghiware people/data $ ls -altr
   total 16
   -rw-r--r--@  1 rghiware  wheel    24 Mar 29 09:49 
.00000-3-4117ce4f-ff56-410b-a248-c9ed512903c8-00001.parquet.crc
   -rw-r--r--@  1 rghiware  wheel  1618 Mar 29 09:49 
00000-3-4117ce4f-ff56-410b-a248-c9ed512903c8-00001.parquet
   drwxr-xr-x@  4 rghiware  wheel   128 Mar 29 09:49 ..
   drwxr-xr-x@  6 rghiware  wheel   192 Mar 29 09:54 .
   drwxr-xr-x@ 15 rghiware  wheel   480 Mar 29 09:55 .hoodie
   drwxr-xr-x@  4 rghiware  wheel   128 Mar 29 09:55 _delta_log
   ```
   
   - Able to load data with pySpark using delta format
   ```
    pyspark \
     --packages io.delta:delta-core_2.12:2.4.0 \
     --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 \
     --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
   ```
   ```
   >>> df = 
spark.read.format("delta").load("/tmp/iceberg-warehouse/people/data")
   >>> df.printSchema()
   root
    |-- id: integer (nullable = true)
    |-- name: string (nullable = true)
    |-- age: integer (nullable = true)
    |-- city: string (nullable = true)
    |-- create_ts: string (nullable = true)
   
   >>> df.show()
   24/03/29 10:27:32 WARN package: Truncated the string representation of a 
plan since it was too large. This behavior can be adjusted by setting 
'spark.sql.debug.maxToStringFields'.
   +---+-------+---+----+-------------------+
   | id|   name|age|city|          create_ts|
   +---+-------+---+----+-------------------+
   |  6|Charlie| 31| DFW|2023-08-29 00:00:00|
   |  1|   John| 25| NYC|2023-09-28 00:00:00|
   |  4| Andrew| 40| NYC|2023-10-28 00:00:00|
   |  3|Michael| 35| ORD|2023-09-28 00:00:00|
   |  5|    Bob| 28| SEA|2023-09-23 00:00:00|
   |  2|  Emily| 30| SFO|2023-09-28 00:00:00|
   +---+-------+---+----+-------------------+
   
   >>>
   ```
   
   - However unable to load same data with pySpark using hudi format
   ```
     pyspark \
     --packages org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0 \
     --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
     --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 \
     --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
   ```
   ```
   >>> df = spark.read.format("hudi").load("/tmp/iceberg-warehouse/people/data")
   24/03/29 10:30:30 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   24/03/29 10:30:30 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   >>> df.printSchema()
   root
    |-- id: integer (nullable = true)
    |-- name: string (nullable = true)
    |-- age: integer (nullable = true)
    |-- city: string (nullable = true)
    |-- create_ts: string (nullable = true)
   
   >>> df.show()
   +---+----+---+----+---------+
   | id|name|age|city|create_ts|
   +---+----+---+----+---------+
   +---+----+---+----+---------+
   
   >>>
   ```
   
   Not able to figure out if I'm missing anything here or it is an issue with 
xtable jar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@xtable.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] No data returned when querying hudi target table generated from iceberg source [incubator-xtable]

Reply via email to