mayursrivastava opened a new pull request #2286:
URL: https://github.com/apache/iceberg/pull/2286


   This is my first Iceberg pull request - I'm not fully aware of the 
contributing conventions of this repo, so let me know if any changes are needed 
in the pr. 
   
   I've refactored some code from the Spark vectorized reader and added an 
ArrowReader which is a vectorized reader in Iceberg core. This is a follow up 
on the following discussion on the mailing list: 
https://lists.apache.org/thread.html/r4162810433a81a751ce15d5d82b98c5be698790176dc686c93fc3b81%40%3Cdev.iceberg.apache.org%3E
 
   
   About the ArrowReader:
   
   1. I’ve put the ArrowReader in the iceberg-data module because it needed to 
access the Iceberg table scan. Let me know if the reader needs to be moved.
   
   2. I had to make a dependency addition of ‘iceberg-arrow’ for the 
iceberg-data module. Specially for the ArrowReaderTest, I had to add the 
following. Let me know if there is a better way for doing this.
       compileOnly("org.apache.arrow:arrow-vector") {
         exclude group: 'io.netty', module: 'netty-buffer'
         exclude group: 'com.google.code.findbugs', module: 'jsr305'
       }
   
   3. Most of the code in ArrowReader is taken from the spark vectorized 
reader. I think there is a potential to share ‘BaseArrowFileScanTaskReader’ in 
both versions, but I did not attempt to do it yet.
   
   4. ArrowReader returns an iterator of VectorSchemaRoot and the behavior is 
explained in the Javadoc.
   
   5. Some small changes were needed in IcebergGenerics to expose the table 
scan object and VectorizedArrowReader to allocate different timestamp Arrow 
vectors based on with/without timezone.
   
   6. All prepush gradle tests pass except one which is still running (and it 
seems very slow - TestFlinkIcebergSink).
   
   7. I've not performed any performance tests with the implementation yet. I'm 
planning to do so this week.
   
   Following are some limitations/questions for this implementation:
   
   1. The arrow vector type is coupled with the physical data type in the 
parquet file: When column data contains a constant value, the column is 
dictionary encoded and the returned Arrow type is int32 irrespective of the 
Iceberg data type. I think that the Arrow vector type should be consistent with 
the logical Iceberg data type (and not change due to the physical data type). 
There is a test ArrowReaderTest.testReadAllWithConstantRecords() that is 
currently ignored.
   
   2. Type promotion does not work: In the ArrowReaderTest, the data for column 
‘int_promotion’ was written as int, and then type was promoted from int to 
long, but the Arrow reader still returns IntVector. I think that the Arrow 
vector type should be consistent with the promoted logical Iceberg data type.
   
   3. Data type limitations:
     a. Types not tested: UUID, FixedType, DecimalType. In the ArrowReaderTest, 
the parquet write was failing for these data types (due to a null pointer 
exception in ParquetMetadataConverter.addRowGroup: 
columnMetadata.getStatistics() was null). Are there unit tests with these types 
that write to parquet?
     b. Types not supported: TimeType, ListType, MapType, StructType. What is 
the path to add Arrow support for these data types?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to