Re: Approaching Vectorized Reading in Iceberg ..

Gautam Fri, 26 Jul 2019 14:25:41 -0700

I'v been trying to run the jmh benchmarks bundled within the project. I'v
been running into issues with that .. have other hit this? Am I running
these incorrectly?



bash-3.2$ ./gradlew :iceberg-spark:jmh
-PjmhIncludeRegex=IcebergSourceFlatParquetDataFilterBenchmark
-PjmhOutputPath=benchmark/iceberg-source-flat-parquet-data-filter-benchmark-result.txt
..
...
> Task :iceberg-spark:jmhCompileGeneratedClasses FAILED
error: plug-in not found: ErrorProne

FAILURE: Build failed with an exception.



Is there a config/plugin I need to add to build.gradle?








On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue <rb...@netflix.com> wrote:

> Thanks Gautam!
>
> We'll start taking a look at your code. What do you think about creating a
> branch in the Iceberg repository where we can work on improving it
> together, before merging it into master?
>
> Also, you mentioned performance comparisons. Do you have any early results
> to share?
>
> rb
>
> On Tue, Jul 23, 2019 at 3:40 PM Gautam <gautamkows...@gmail.com> wrote:
>
>> Hello Folks,
>>
>> I have checked in a WIP branch [1] with a working version of Vectorized
>> reads for Iceberg reader. Here's the diff  [2].
>>
>> *Implementation Notes:*
>>  - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to instruct
>> the DataSourceV2ScanExec to use `planBatchPartitions()` instead of the
>> usual `planInputPartitions()`. It returns instances of `ColumnarBatch` on
>> each iteration.
>>  - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This was
>> copied from [3] . Added by @Daniel Weeks <dwe...@netflix.com> . Thanks
>> for that!
>>  - `VectorizedParquetValueReaders` contains ParquetValueReaders used for
>> reading/decoding the Parquet rowgroups (aka pagestores as referred to in
>> the code)
>>  - `VectorizedSparkParquetReaders` contains the visitor implementations
>> to map Parquet types to appropriate value readers. I implemented the struct
>> visitor so that the root schema can be mapped properly. This has the added
>> benefit of vectorization support for structs, so yay!
>>  - For the initial version the value readers read an entire row group
>> into a single Arrow Field Vector. this i'd imagine will require tuning for
>> right batch sizing but i'v gone with one batch per rowgroup for now.
>>  - Arrow Field Vectors are wrapped using `ArrowColumnVector` which is
>> Spark's ColumnVector implementation backed by Arrow. This is the first
>> contact point between Spark and Arrow interfaces.
>>  - ArrowColumnVectors are stitched together into a `ColumnarBatch` by
>> `ColumnarBatchReader` . This is my replacement for `InternalRowReader`
>> which maps Structs to Columnar Batches. This allows us to have nested
>> structs where each level of nesting would be a nested columnar batch. Lemme
>> know what you think of this approach.
>>  - I'v added value readers for all supported primitive types listed in
>> `AvroDataTest`. There's a corresponding test for vectorized reader under
>> `TestSparkParquetVectorizedReader`
>>  - I haven't fixed all the Checkstyle errors so you will have to turn
>> checkstyle off in build.gradle. Also skip tests while building.. sorry! :-(
>>
>> *P.S*. There's some unused code under ArrowReader.java. Ignore this as
>> it's not used. This was from my previous impl of Vectorization. I'v kept it
>> around to compare performance.
>>
>> Lemme know what folks think of the approach. I'm getting this working for
>> our scale test benchmark and will report back with numbers. Feel free to
>> run your own benchmarks and share.
>>
>> Cheers,
>> -Gautam.
>>
>>
>>
>>
>> [1] -
>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP
>> [2] -
>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP
>> [3] -
>> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java
>>
>>
>> On Mon, Jul 22, 2019 at 2:33 PM Gautam <gautamkows...@gmail.com> wrote:
>>
>>> Will do. Doing a bit of housekeeping on the code and also adding more
>>> primitive type support.
>>>
>>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah <mch...@palantir.com> wrote:
>>>
>>>> Would it be possible to put the work in progress code in open source?
>>>>
>>>>
>>>>
>>>> *From: *Gautam <gautamkows...@gmail.com>
>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>>>> *Date: *Monday, July 22, 2019 at 9:46 AM
>>>> *To: *Daniel Weeks <dwe...@netflix.com>
>>>> *Cc: *Ryan Blue <rb...@netflix.com>, Iceberg Dev List <
>>>> dev@iceberg.apache.org>
>>>> *Subject: *Re: Approaching Vectorized Reading in Iceberg ..
>>>>
>>>>
>>>>
>>>> That would be great!
>>>>
>>>>
>>>>
>>>> On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks <dwe...@netflix.com>
>>>> wrote:
>>>>
>>>> Hey Gautam,
>>>>
>>>>
>>>>
>>>> We also have a couple people looking into vectorized reading (into
>>>> Arrow memory).  I think it would be good for us to get together and see if
>>>> we can collaborate on a common approach for this.
>>>>
>>>>
>>>>
>>>> I'll reach out directly and see if we can get together.
>>>>
>>>>
>>>>
>>>> -Dan
>>>>
>>>>
>>>>
>>>> On Sun, Jul 21, 2019 at 10:35 PM Gautam <gautamkows...@gmail.com>
>>>> wrote:
>>>>
>>>> Figured this out. I'm returning ColumnarBatch iterator directly without
>>>> projection with schema set appropriately in `readSchema() `.. the empty
>>>> result was due to valuesRead not being set correctly on FileIterator. Did
>>>> that and things are working. Will circle back with numbers soon.
>>>>
>>>>
>>>>
>>>> On Fri, Jul 19, 2019 at 5:22 PM Gautam <gautamkows...@gmail.com> wrote:
>>>>
>>>> Hey Guys,
>>>>
>>>>            Sorry bout the delay on this. Just got back on getting a
>>>> basic working implementation in Iceberg for Vectorization on primitive
>>>> types.
>>>>
>>>>
>>>>
>>>> *Here's what I have so far :  *
>>>>
>>>>
>>>>
>>>> I have added `ParquetValueReader` implementations for some basic
>>>> primitive types that build the respective Arrow Vector (`ValueVector`) viz.
>>>> `IntVector` for int, `VarCharVector` for strings and so on. Underneath each
>>>> value vector reader there are column iterators that read from the parquet
>>>> pagestores (rowgroups) in chunks. These `ValueVector-s` are lined up as
>>>> `ArrowColumnVector`-s (which is ColumnVector wrapper backed by Arrow) and
>>>> stitched together using a `ColumnarBatchReader` (which as the name suggests
>>>> wraps ColumnarBatches in the iterator)   I'v verified that these pieces
>>>> work properly with the underlying interfaces.  I'v also made changes to
>>>> Iceberg's `Reader` to  implement `planBatchPartitions()` (to add the
>>>> `SupportsScanColumnarBatch` mixin to the reader).  So the reader now
>>>> expects ColumnarBatch instances (instead of InternalRow). The query
>>>> planning runtime works fine with these changes.
>>>>
>>>>
>>>>
>>>> Although it fails during query execution, the bit it's  currently
>>>> failing at is this line of code : 
>>>> https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java#L414
>>>> [github.com]
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_blob_master_spark_src_main_java_org_apache_iceberg_spark_source_Reader.java-23L414&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=7wzoBoRwCjQjgamnHukQSe0wiATMnGbYhfJQpXfSMks&e=>
>>>>
>>>>
>>>>
>>>> This code, I think,  tries to apply the iterator's schema projection on
>>>> the InternalRow instances. This seems to be tightly coupled to InternalRow
>>>> as Spark's catalyst expressions have implemented the UnsafeProjection for
>>>> InternalRow only. If I take this out and just return the
>>>> `Iterator<ColumnarBatch>` iterator I built it returns empty result on the
>>>> client. I'm guessing this is coz Spark is unaware of the iterator's schema?
>>>> There's a Todo in the code that says "*remove the projection by
>>>> reporting the iterator's schema back to Spark*".  Is there a simple
>>>> way to communicate that to Spark for my new iterator? Any pointers on how
>>>> to get around this?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thanks and Regards,
>>>>
>>>> -Gautam.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> Replies inline.
>>>>
>>>>
>>>>
>>>> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com> wrote:
>>>>
>>>> Thanks for responding Ryan,
>>>>
>>>>
>>>>
>>>> Couple of follow up questions on ParquetValueReader for Arrow..
>>>>
>>>>
>>>>
>>>> I'd like to start with testing Arrow out with readers for primitive
>>>> type and incrementally add in Struct/Array support, also ArrowWriter [1]
>>>> currently doesn't have converters for map type. How can I default these
>>>> types to regular materialization whilst supporting Arrow based support for
>>>> primitives?
>>>>
>>>>
>>>>
>>>> We should look at what Spark does to handle maps.
>>>>
>>>>
>>>>
>>>> I think we should get the prototype working with test cases that don't
>>>> have maps, structs, or lists. Just getting primitives working is a good
>>>> start and just won't hit these problems.
>>>>
>>>>
>>>>
>>>> Lemme know if this makes sense...
>>>>
>>>>
>>>>
>>>> - I extend  PrimitiveReader (for Arrow) that loads primitive types into
>>>> ArrowColumnVectors of corresponding column types by iterating over
>>>> underlying ColumnIterator *n times*, where n is size of batch.
>>>>
>>>>
>>>>
>>>> Sounds good to me. I'm not sure about extending vs wrapping because I'm
>>>> not too familiar with the Arrow APIs.
>>>>
>>>>
>>>>
>>>> - Reader.newParquetIterable()  maps primitive column types to the newly
>>>> added ArrowParquetValueReader but for other types (nested types, etc.) uses
>>>> current *InternalRow* based ValueReaders
>>>>
>>>>
>>>>
>>>> Sounds good for primitives, but I would just leave the nested types
>>>> un-implemented for now.
>>>>
>>>>
>>>>
>>>> - Stitch the columns vectors together to create ColumnarBatch, (Since
>>>> *SupportsScanColumnarBatch* mixin currently expects this ) ..
>>>> *although* *I'm a bit lost on how the stitching of columns happens
>>>> currently*? .. and how the ArrowColumnVectors could  be stitched
>>>> alongside regular columns that don't have arrow based support ?
>>>>
>>>>
>>>>
>>>> I don't think that you can mix regular columns and Arrow columns. It
>>>> has to be all one or the other. That's why it's easier to start with
>>>> primitives, then add structs, then lists, and finally maps.
>>>>
>>>>
>>>>
>>>> - Reader returns readTasks as  *InputPartition<*ColumnarBatch*> *so
>>>> that DataSourceV2ScanExec starts using ColumnarBatch scans
>>>>
>>>>
>>>>
>>>> We will probably need two paths. One for columnar batches and one for
>>>> row-based reads. That doesn't need to be done right away and what you
>>>> already have in your working copy makes sense as a start.
>>>>
>>>>
>>>>
>>>> That's a lot of questions! :-) but hope i'm making sense.
>>>>
>>>>
>>>>
>>>> -Gautam.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> [1] - 
>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
>>>> [github.com]
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_execution_arrow_ArrowWriter.scala&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=8yzJh2S49rbuM06dC5Sy-yMECClqEeLS7tpg45BmDN4&e=>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ryan Blue
>>>>
>>>> Software Engineer
>>>>
>>>> Netflix
>>>>
>>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Approaching Vectorized Reading in Iceberg ..

Reply via email to