Re: Approaching Vectorized Reading in Iceberg ..

Ryan Blue Fri, 21 Jun 2019 10:18:47 -0700

The vectorized reader in Spark is only used if the schema is flat.

On Fri, Jun 14, 2019 at 5:45 PM Gautam <gautamkows...@gmail.com> wrote:


>
> Agree with the approach of getting this working for primitive types only.
> I'l work on a prototype assuming just primitive types for now.
>
> I don't think that you can mix regular columns and Arrow columns. It has
>> to be all one or the other.
>
>
> I was jsut curious about this coz Vanilla Spark reader (with
> vectorization) doesn't support batching on nested fields today but it's
> still able to do vectorization on data with nested/non-nested. This is not
> needed for my poc but would be good to know so if we can leverage this for
> our implementation. Either ways, i'l get to it when this step is done.
>
>
> On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com> wrote:
>
>> Replies inline.
>>
>> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com> wrote:
>>
>>> Thanks for responding Ryan,
>>>
>>> Couple of follow up questions on ParquetValueReader for Arrow..
>>>
>>> I'd like to start with testing Arrow out with readers for primitive type
>>> and incrementally add in Struct/Array support, also ArrowWriter [1]
>>> currently doesn't have converters for map type. How can I default these
>>> types to regular materialization whilst supporting Arrow based support for
>>> primitives?
>>>
>>
>> We should look at what Spark does to handle maps.
>>
>> I think we should get the prototype working with test cases that don't
>> have maps, structs, or lists. Just getting primitives working is a good
>> start and just won't hit these problems.
>>
>>
>>> Lemme know if this makes sense...
>>>
>>> - I extend  PrimitiveReader (for Arrow) that loads primitive types into
>>> ArrowColumnVectors of corresponding column types by iterating over
>>> underlying ColumnIterator *n times*, where n is size of batch.
>>>
>>
>> Sounds good to me. I'm not sure about extending vs wrapping because I'm
>> not too familiar with the Arrow APIs.
>>
>>
>>> - Reader.newParquetIterable()  maps primitive column types to the newly
>>> added ArrowParquetValueReader but for other types (nested types, etc.) uses
>>> current *InternalRow* based ValueReaders
>>>
>>
>> Sounds good for primitives, but I would just leave the nested types
>> un-implemented for now.
>>
>>
>>> - Stitch the columns vectors together to create ColumnarBatch, (Since
>>> *SupportsScanColumnarBatch* mixin currently expects this ) .. *although*
>>>  *I'm a bit lost on how the stitching of columns happens currently*? ..
>>> and how the ArrowColumnVectors could  be stitched alongside regular columns
>>> that don't have arrow based support ?
>>>
>>
>> I don't think that you can mix regular columns and Arrow columns. It has
>> to be all one or the other. That's why it's easier to start with
>> primitives, then add structs, then lists, and finally maps.
>>
>>
>>> - Reader returns readTasks as  *InputPartition<*ColumnarBatch*> *so
>>> that DataSourceV2ScanExec starts using ColumnarBatch scans
>>>
>>
>> We will probably need two paths. One for columnar batches and one for
>> row-based reads. That doesn't need to be done right away and what you
>> already have in your working copy makes sense as a start.
>>
>>
>>> That's a lot of questions! :-) but hope i'm making sense.
>>>
>>> -Gautam.
>>>
>>>
>>>
>>> [1] -
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Approaching Vectorized Reading in Iceberg ..

Reply via email to