[GitHub] [iceberg] rdblue commented on pull request #828: iceberg-spark changes for vectorized reads

GitBox Wed, 03 Jun 2020 15:07:28 -0700


rdblue commented on pull request #828:
URL: https://github.com/apache/iceberg/pull/828#issuecomment-638486083



   Running the current tests with coverage shows that there are a few places 
that are not getting tested:
   * Code paths for missing columns because there are no projection tests.
   * Struct code paths because there are not tests for nested structs, only 
top-level columns.
   * `VectorizedParquetDefinitionLevelReader.setNulls` -- looks like the random 
data doesn't produce enough consecutive null values for this to get used
   * `DictionaryDecimalBinaryAccessor` and 
`VectorizedDictionaryEncodedParquetValuesReader.readBatchOfDictionaryEncodedFixedWidthBinary`
 -- I'm not sure why, but it looks like dictionary-encoded decimals stored as 
fixed are not getting tested. I would start by adding assertions to tests that 
the Parquet files are written as you expect (all dictionary encoded or 
fallback).
   * All code paths where `setArrowValidityVector` is true. I think we should 
have tests for these as well.
   * Code paths for timestamp-millis  -- this is probably okay.
   
   I wrote a test for nested structs to fix coverage, but it currently fails. 
Here's the test case:
   
   ```java
     @Test
     public void testNestedStruct() throws IOException {
       writeAndValidate(TypeUtil.assignIncreasingFreshIds(new 
Schema(required(1, "struct", SUPPORTED_PRIMITIVES))));
     }
   ```
   ```
   java.lang.ClassCastException: Cannot cast 
org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader to 
org.apache.iceberg.arrow.vectorized.VectorizedArrowReader
        at java.lang.Class.cast(Class.java:3369)
        at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:545)
        at 
java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
        at 
java.util.stream.ReferencePipeline.toArray(ReferencePipeline.java:438)
        at 
org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.<init>(ColumnarBatchReader.java:45)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on pull request #828: iceberg-spark changes for vectorized reads

Reply via email to