*Vectorization notes (Nov 14) *
Attendees:
- Anjali
- Samarth
- Ryan
- Gautam
Overall things covered:
- Current state of performance
- How to start getting things from vectorized-read branch into master
- Next steps for complex types
Current performance:
- Reads for dictionary encoded string columns including fallback to
plain encoding is around 30% faster than vectorized spark reads
- Other primitive types 5-7 % slower
- Currently using Arrow version 14.1
- Upgrading to this improved performance
- Shade this version within iceberg so it doesn't conflict with
Spark's dependency
*Things to do:*
- Merge Reader and ArrowVectorizedReaders into one and handle enabling
vectorization based on config and projection schema (
https://github.com/apache/incubator-iceberg/issues/520) - *Gautam*
- Separate Glue code for Spark ColumnVector from Iceberg arrow (added
new issue: https://github.com/apache/incubator-iceberg/issues/648)
- *Samarth/
Anjali *?
- Separate out iceberg arrow code into it's own module (
https://github.com/apache/incubator-iceberg/issues/522)
- Unit tests for current work.
*Discussion: *
What are next steps?
*Ryan*:
- Aim for vectorization work to make it into master. Work on separating
out code into PRs to master
- ColumnVector implementations
- Breaking up Type-wise decode implementations
- Separate out glue code for iceberg arrow and spark ColumnVector
- Make sure Licensing of code is honored (e.g. if code was copied from
spark, attribute that contribution accordingly)
Question: Is smallest unit of task planning a row group?
*Ryan*: Yes, having said that, there's provision in spark to read partial
batches. Can use row counts in ColumnVector to express how much valid
Can we start on complex types?
*Ryan*: Yes, shouldn't be blocked on anything major. Can start with
top-level structs right now (struct with 1 level of nesting).
Added a new issue https://github.com/apache/incubator-iceberg/issues/648 ,
please add this to the milestone
https://github.com/apache/incubator-iceberg/milestone/2
Lemme know if there was anything I missed or misquoted.
Regards,
-Gautam.