kbendick opened a new issue #3453:
URL: https://github.com/apache/iceberg/issues/3453


   While testing the 0.12.1 release candidate, I tried running the `add_files` 
procedure with an ORC table.
   
   I ran it once, so the files were imported. I then dropped the table, created 
a new table, and tried to reimport from the same path.
   
   I got the following `NumberFormatException` when trying to parse the number 
of imported files.
   
   ```
   scala> spark.sql("CALL hive.system.add_files(table => 'hive.default.test2', 
source_table => '`orc`.`hdfs://hdfs-box:8020/user/hive/warehouse/orc`')").show
   java.lang.NumberFormatException: null
     at java.lang.Long.parseLong(Long.java:552)
     at java.lang.Long.parseLong(Long.java:631)
     at 
org.apache.iceberg.spark.procedures.AddFilesProcedure.lambda$importToIceberg$1(AddFilesProcedure.java:135)
     at 
org.apache.iceberg.spark.procedures.BaseProcedure.execute(BaseProcedure.java:85)
     at 
org.apache.iceberg.spark.procedures.BaseProcedure.modifyIcebergTable(BaseProcedure.java:74)
     at 
org.apache.iceberg.spark.procedures.AddFilesProcedure.importToIceberg(AddFilesProcedure.java:121)
     at 
org.apache.iceberg.spark.procedures.AddFilesProcedure.call(AddFilesProcedure.java:108)
     at 
org.apache.spark.sql.execution.datasources.v2.CallExec.run(CallExec.scala:33)
     at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
     at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
     at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:46)
     at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
     at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
     at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
     at org.apache.spark.sql.Dataset.<init>(Dataset.scala:228)
     at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
     at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
     at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
     at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
     ... 47 elided
     ```
     
     The path in the HDFS existed, but there were no files there, so `null` 
seems to have been returned as a result.
   
     This seems to still exist in current Iceberg: 
https://github.com/apache/iceberg/blob/1c158df94bf004d43867939841e75e4bfb941c16/spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/procedures/AddFilesProcedure.java#L144
     
     We should fail early if there are no files present in the path based 
directory (or at least say number of files is zero).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to