[I] Java api appendFile leads to null value of array/struct column [iceberg]

via GitHub Thu, 17 Apr 2025 00:17:58 -0700


EstherLCode opened a new issue, #12825:
URL: https://github.com/apache/iceberg/issues/12825


   ### Apache Iceberg version
   
   1.4.3
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Hi there,
   I have a parquet file called `a.parquet` generated from a jso. The schema of 
`a.parquet` is:
   `
   message schema {
     optional binary stringField (STRING);
     optional group objectField {
       optional binary innerString (STRING);
       optional group innerArray (LIST) {
         repeated group list {
           optional binary element (STRING);
         }
       }
       optional group innerObject {
         optional binary nestedString (STRING);
         optional binary nestedBoolean (STRING);
         optional group nestedArray (LIST) {
           repeated group list {
             optional group element {
               optional binary f1 (STRING);
               optional binary f2 (STRING);
               optional binary f4 (STRING);
             }
           }
         }
       }
     }
     optional group arrayField (LIST) {
       repeated group list {
         optional group element {
           optional binary nestedString (STRING);
           optional binary nestedNumber (STRING);
           optional binary nestedBoolean (STRING);
         }
       }
     }
   }
   `
   Using this schema to create an iceberg table `db1.tb1`.
   And the content of `a.parquet` is:
   `
   scala> spark.read.parquet("/baseDir/a.parquet").show(false)
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   |stringField|objectField                                                     
                               |arrayField                                      
    |
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   |value      |{innerValue, [1, 2, 3], {v1, true, [{v1, null, null}, {null, 
1.0, true}, {null, null, false}]}}|[{nestedValue, 42, null}, {nestedValue, 
null, true}]|
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   `
   
   I tried to add this parquet to an iceberg table 'db1.tb1' by **appendFile** 
function
   `
   val table = catalog.loadTable(TableIdentifier.of("db1","tb1")
   val dataFile = DataFiles.builder(table.spec())
       .withPath("/baseDir/a.parquet")
       .withFormat(FileFormat.PARQUET)
       .withFileSizeInBytes(size)
       .withRecordCount(spark.read.parquet("/baseDir/a.parquet").count)
       .build()
   table.newAppend()
       .appendFile(dataFile)
       .commit()
   `
   But when I query the table, I got **null** for columns objectField and 
arrayField, like:
   `
   scala> spark.sql("select * from db1.tb1").show(false)
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   |stringField|objectField                                                     
                               |arrayField                                      
    |
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   |value      |null|null|
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   `
   The weird thing is then I tried the sql api `add_files`, this command worked 
well and also fixed the old record.
   `
   spark.sql("""call spark_catalog.system.add_files (table => 
'spark_catalog.db1.tb1',source_table => 
'`parquet`.`hdfs:///baseDir/a.parquet`')""")
   `
   After this command, I got two completed and correct records.
   `
   scala> spark.sql("select * from db1.tb1").show(false)
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   |stringField|objectField                                                     
                               |arrayField                                      
    |
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   |value      |{innerValue, [1, 2, 3], {v1, true, [{v1, null, null}, {null, 
1.0, true}, {null, null, false}]}}|[{nestedValue, 42, null}, {nestedValue, 
null, true}]|
   |value      |{innerValue, [1, 2, 3], {v1, true, [{v1, null, null}, {null, 
1.0, true}, {null, null, false}]}}|[{nestedValue, 42, null}, {nestedValue, 
null, true}]|
   
+-----------+-----------------------------------------------------------------------------------------------+----------------------------------------------------+
   `
   
   I'm wondering if my calling of `appendFile` is correct? And why here are 
different results between `appendFile` and `add_files`? 
   
   Thanks a lot.
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Java api appendFile leads to null value of array/struct column [iceberg]

Reply via email to