szehon-ho opened a new pull request #3760:
URL: https://github.com/apache/iceberg/pull/3760


   A certain string in the input data (with a prefix of over 16 unparseable 
chars like high/low surrogates) triggered a NullPointerException in Parquet 
writer flush, which I reproduced in the accompanying unit test.
   
   ```
   java.lang.NullPointerException
        at 
org.apache.iceberg.parquet.ParquetUtil.toBufferMap(ParquetUtil.java:307)
        at 
org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:166)
        at 
org.apache.iceberg.parquet.ParquetUtil.footerMetrics(ParquetUtil.java:88)
        at 
org.apache.iceberg.parquet.ParquetWriter.metrics(ParquetWriter.java:126)
        at org.apache.iceberg.io.DataWriter.close(DataWriter.java:89)
        at 
org.apache.iceberg.parquet.TestParquetDataWriter.testCorruptString(TestParquetDataWriter.java:158)
   ```
   
   The problem is that UnicodeUtil and BinaryUtil return null if fail to get a 
truncated upper bound the string/binary.  A null value in the upperBound maps 
then triggers a NPE in the ParquetUtil.toBufferMap class as it tries to call 
.getValue() on it.
   
   ```
     private static Map<Integer, ByteBuffer> toBufferMap(Schema schema, 
Map<Integer, Literal<?>> map) {
       Map<Integer, ByteBuffer> bufferMap = Maps.newHashMap();
       for (Map.Entry<Integer, Literal<?>> entry : map.entrySet()) {
         bufferMap.put(entry.getKey(),
             Conversions.toByteBuffer(schema.findType(entry.getKey()), 
entry.getValue().value()));
       }
       return bufferMap;
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to