The vectorized reader in Spark is only used if the schema is flat. On Fri, Jun 14, 2019 at 5:45 PM Gautam <gautamkows...@gmail.com> wrote:
> > Agree with the approach of getting this working for primitive types only. > I'l work on a prototype assuming just primitive types for now. > > I don't think that you can mix regular columns and Arrow columns. It has >> to be all one or the other. > > > I was jsut curious about this coz Vanilla Spark reader (with > vectorization) doesn't support batching on nested fields today but it's > still able to do vectorization on data with nested/non-nested. This is not > needed for my poc but would be good to know so if we can leverage this for > our implementation. Either ways, i'l get to it when this step is done. > > > On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com> wrote: > >> Replies inline. >> >> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com> wrote: >> >>> Thanks for responding Ryan, >>> >>> Couple of follow up questions on ParquetValueReader for Arrow.. >>> >>> I'd like to start with testing Arrow out with readers for primitive type >>> and incrementally add in Struct/Array support, also ArrowWriter [1] >>> currently doesn't have converters for map type. How can I default these >>> types to regular materialization whilst supporting Arrow based support for >>> primitives? >>> >> >> We should look at what Spark does to handle maps. >> >> I think we should get the prototype working with test cases that don't >> have maps, structs, or lists. Just getting primitives working is a good >> start and just won't hit these problems. >> >> >>> Lemme know if this makes sense... >>> >>> - I extend PrimitiveReader (for Arrow) that loads primitive types into >>> ArrowColumnVectors of corresponding column types by iterating over >>> underlying ColumnIterator *n times*, where n is size of batch. >>> >> >> Sounds good to me. I'm not sure about extending vs wrapping because I'm >> not too familiar with the Arrow APIs. >> >> >>> - Reader.newParquetIterable() maps primitive column types to the newly >>> added ArrowParquetValueReader but for other types (nested types, etc.) uses >>> current *InternalRow* based ValueReaders >>> >> >> Sounds good for primitives, but I would just leave the nested types >> un-implemented for now. >> >> >>> - Stitch the columns vectors together to create ColumnarBatch, (Since >>> *SupportsScanColumnarBatch* mixin currently expects this ) .. *although* >>> *I'm a bit lost on how the stitching of columns happens currently*? .. >>> and how the ArrowColumnVectors could be stitched alongside regular columns >>> that don't have arrow based support ? >>> >> >> I don't think that you can mix regular columns and Arrow columns. It has >> to be all one or the other. That's why it's easier to start with >> primitives, then add structs, then lists, and finally maps. >> >> >>> - Reader returns readTasks as *InputPartition<*ColumnarBatch*> *so >>> that DataSourceV2ScanExec starts using ColumnarBatch scans >>> >> >> We will probably need two paths. One for columnar batches and one for >> row-based reads. That doesn't need to be done right away and what you >> already have in your working copy makes sense as a start. >> >> >>> That's a lot of questions! :-) but hope i'm making sense. >>> >>> -Gautam. >>> >>> >>> >>> [1] - >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix