Re: Approaching Vectorized Reading in Iceberg ..

Ryan Blue Fri, 14 Jun 2019 16:22:54 -0700

Replies inline.

On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com> wrote:


> Thanks for responding Ryan,
>
> Couple of follow up questions on ParquetValueReader for Arrow..
>
> I'd like to start with testing Arrow out with readers for primitive type
> and incrementally add in Struct/Array support, also ArrowWriter [1]
> currently doesn't have converters for map type. How can I default these
> types to regular materialization whilst supporting Arrow based support for
> primitives?
>

We should look at what Spark does to handle maps.

I think we should get the prototype working with test cases that don't have
maps, structs, or lists. Just getting primitives working is a good start
and just won't hit these problems.


> Lemme know if this makes sense...
>
> - I extend  PrimitiveReader (for Arrow) that loads primitive types into
> ArrowColumnVectors of corresponding column types by iterating over
> underlying ColumnIterator *n times*, where n is size of batch.
>

Sounds good to me. I'm not sure about extending vs wrapping because I'm not
too familiar with the Arrow APIs.


> - Reader.newParquetIterable()  maps primitive column types to the newly
> added ArrowParquetValueReader but for other types (nested types, etc.) uses
> current *InternalRow* based ValueReaders
>

Sounds good for primitives, but I would just leave the nested types
un-implemented for now.


> - Stitch the columns vectors together to create ColumnarBatch, (Since
> *SupportsScanColumnarBatch* mixin currently expects this ) .. *although* *I'm
> a bit lost on how the stitching of columns happens currently*? .. and how
> the ArrowColumnVectors could  be stitched alongside regular columns that
> don't have arrow based support ?
>

I don't think that you can mix regular columns and Arrow columns. It has to
be all one or the other. That's why it's easier to start with primitives,
then add structs, then lists, and finally maps.


> - Reader returns readTasks as  *InputPartition<*ColumnarBatch*> *so that
> DataSourceV2ScanExec starts using ColumnarBatch scans
>

We will probably need two paths. One for columnar batches and one for
row-based reads. That doesn't need to be done right away and what you
already have in your working copy makes sense as a start.


> That's a lot of questions! :-) but hope i'm making sense.
>
> -Gautam.
>
>
>
> [1] -
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Approaching Vectorized Reading in Iceberg ..

Reply via email to