I'v been trying to run the jmh benchmarks bundled within the project. I'v been running into issues with that .. have other hit this? Am I running these incorrectly?
bash-3.2$ ./gradlew :iceberg-spark:jmh -PjmhIncludeRegex=IcebergSourceFlatParquetDataFilterBenchmark -PjmhOutputPath=benchmark/iceberg-source-flat-parquet-data-filter-benchmark-result.txt .. ... > Task :iceberg-spark:jmhCompileGeneratedClasses FAILED error: plug-in not found: ErrorProne FAILURE: Build failed with an exception. Is there a config/plugin I need to add to build.gradle? On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue <rb...@netflix.com> wrote: > Thanks Gautam! > > We'll start taking a look at your code. What do you think about creating a > branch in the Iceberg repository where we can work on improving it > together, before merging it into master? > > Also, you mentioned performance comparisons. Do you have any early results > to share? > > rb > > On Tue, Jul 23, 2019 at 3:40 PM Gautam <gautamkows...@gmail.com> wrote: > >> Hello Folks, >> >> I have checked in a WIP branch [1] with a working version of Vectorized >> reads for Iceberg reader. Here's the diff [2]. >> >> *Implementation Notes:* >> - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to instruct >> the DataSourceV2ScanExec to use `planBatchPartitions()` instead of the >> usual `planInputPartitions()`. It returns instances of `ColumnarBatch` on >> each iteration. >> - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This was >> copied from [3] . Added by @Daniel Weeks <dwe...@netflix.com> . Thanks >> for that! >> - `VectorizedParquetValueReaders` contains ParquetValueReaders used for >> reading/decoding the Parquet rowgroups (aka pagestores as referred to in >> the code) >> - `VectorizedSparkParquetReaders` contains the visitor implementations >> to map Parquet types to appropriate value readers. I implemented the struct >> visitor so that the root schema can be mapped properly. This has the added >> benefit of vectorization support for structs, so yay! >> - For the initial version the value readers read an entire row group >> into a single Arrow Field Vector. this i'd imagine will require tuning for >> right batch sizing but i'v gone with one batch per rowgroup for now. >> - Arrow Field Vectors are wrapped using `ArrowColumnVector` which is >> Spark's ColumnVector implementation backed by Arrow. This is the first >> contact point between Spark and Arrow interfaces. >> - ArrowColumnVectors are stitched together into a `ColumnarBatch` by >> `ColumnarBatchReader` . This is my replacement for `InternalRowReader` >> which maps Structs to Columnar Batches. This allows us to have nested >> structs where each level of nesting would be a nested columnar batch. Lemme >> know what you think of this approach. >> - I'v added value readers for all supported primitive types listed in >> `AvroDataTest`. There's a corresponding test for vectorized reader under >> `TestSparkParquetVectorizedReader` >> - I haven't fixed all the Checkstyle errors so you will have to turn >> checkstyle off in build.gradle. Also skip tests while building.. sorry! :-( >> >> *P.S*. There's some unused code under ArrowReader.java. Ignore this as >> it's not used. This was from my previous impl of Vectorization. I'v kept it >> around to compare performance. >> >> Lemme know what folks think of the approach. I'm getting this working for >> our scale test benchmark and will report back with numbers. Feel free to >> run your own benchmarks and share. >> >> Cheers, >> -Gautam. >> >> >> >> >> [1] - >> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP >> [2] - >> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP >> [3] - >> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java >> >> >> On Mon, Jul 22, 2019 at 2:33 PM Gautam <gautamkows...@gmail.com> wrote: >> >>> Will do. Doing a bit of housekeeping on the code and also adding more >>> primitive type support. >>> >>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah <mch...@palantir.com> wrote: >>> >>>> Would it be possible to put the work in progress code in open source? >>>> >>>> >>>> >>>> *From: *Gautam <gautamkows...@gmail.com> >>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> >>>> *Date: *Monday, July 22, 2019 at 9:46 AM >>>> *To: *Daniel Weeks <dwe...@netflix.com> >>>> *Cc: *Ryan Blue <rb...@netflix.com>, Iceberg Dev List < >>>> dev@iceberg.apache.org> >>>> *Subject: *Re: Approaching Vectorized Reading in Iceberg .. >>>> >>>> >>>> >>>> That would be great! >>>> >>>> >>>> >>>> On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks <dwe...@netflix.com> >>>> wrote: >>>> >>>> Hey Gautam, >>>> >>>> >>>> >>>> We also have a couple people looking into vectorized reading (into >>>> Arrow memory). I think it would be good for us to get together and see if >>>> we can collaborate on a common approach for this. >>>> >>>> >>>> >>>> I'll reach out directly and see if we can get together. >>>> >>>> >>>> >>>> -Dan >>>> >>>> >>>> >>>> On Sun, Jul 21, 2019 at 10:35 PM Gautam <gautamkows...@gmail.com> >>>> wrote: >>>> >>>> Figured this out. I'm returning ColumnarBatch iterator directly without >>>> projection with schema set appropriately in `readSchema() `.. the empty >>>> result was due to valuesRead not being set correctly on FileIterator. Did >>>> that and things are working. Will circle back with numbers soon. >>>> >>>> >>>> >>>> On Fri, Jul 19, 2019 at 5:22 PM Gautam <gautamkows...@gmail.com> wrote: >>>> >>>> Hey Guys, >>>> >>>> Sorry bout the delay on this. Just got back on getting a >>>> basic working implementation in Iceberg for Vectorization on primitive >>>> types. >>>> >>>> >>>> >>>> *Here's what I have so far : * >>>> >>>> >>>> >>>> I have added `ParquetValueReader` implementations for some basic >>>> primitive types that build the respective Arrow Vector (`ValueVector`) viz. >>>> `IntVector` for int, `VarCharVector` for strings and so on. Underneath each >>>> value vector reader there are column iterators that read from the parquet >>>> pagestores (rowgroups) in chunks. These `ValueVector-s` are lined up as >>>> `ArrowColumnVector`-s (which is ColumnVector wrapper backed by Arrow) and >>>> stitched together using a `ColumnarBatchReader` (which as the name suggests >>>> wraps ColumnarBatches in the iterator) I'v verified that these pieces >>>> work properly with the underlying interfaces. I'v also made changes to >>>> Iceberg's `Reader` to implement `planBatchPartitions()` (to add the >>>> `SupportsScanColumnarBatch` mixin to the reader). So the reader now >>>> expects ColumnarBatch instances (instead of InternalRow). The query >>>> planning runtime works fine with these changes. >>>> >>>> >>>> >>>> Although it fails during query execution, the bit it's currently >>>> failing at is this line of code : >>>> https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java#L414 >>>> [github.com] >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_blob_master_spark_src_main_java_org_apache_iceberg_spark_source_Reader.java-23L414&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=7wzoBoRwCjQjgamnHukQSe0wiATMnGbYhfJQpXfSMks&e=> >>>> >>>> >>>> >>>> This code, I think, tries to apply the iterator's schema projection on >>>> the InternalRow instances. This seems to be tightly coupled to InternalRow >>>> as Spark's catalyst expressions have implemented the UnsafeProjection for >>>> InternalRow only. If I take this out and just return the >>>> `Iterator<ColumnarBatch>` iterator I built it returns empty result on the >>>> client. I'm guessing this is coz Spark is unaware of the iterator's schema? >>>> There's a Todo in the code that says "*remove the projection by >>>> reporting the iterator's schema back to Spark*". Is there a simple >>>> way to communicate that to Spark for my new iterator? Any pointers on how >>>> to get around this? >>>> >>>> >>>> >>>> >>>> >>>> Thanks and Regards, >>>> >>>> -Gautam. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com> wrote: >>>> >>>> Replies inline. >>>> >>>> >>>> >>>> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com> wrote: >>>> >>>> Thanks for responding Ryan, >>>> >>>> >>>> >>>> Couple of follow up questions on ParquetValueReader for Arrow.. >>>> >>>> >>>> >>>> I'd like to start with testing Arrow out with readers for primitive >>>> type and incrementally add in Struct/Array support, also ArrowWriter [1] >>>> currently doesn't have converters for map type. How can I default these >>>> types to regular materialization whilst supporting Arrow based support for >>>> primitives? >>>> >>>> >>>> >>>> We should look at what Spark does to handle maps. >>>> >>>> >>>> >>>> I think we should get the prototype working with test cases that don't >>>> have maps, structs, or lists. Just getting primitives working is a good >>>> start and just won't hit these problems. >>>> >>>> >>>> >>>> Lemme know if this makes sense... >>>> >>>> >>>> >>>> - I extend PrimitiveReader (for Arrow) that loads primitive types into >>>> ArrowColumnVectors of corresponding column types by iterating over >>>> underlying ColumnIterator *n times*, where n is size of batch. >>>> >>>> >>>> >>>> Sounds good to me. I'm not sure about extending vs wrapping because I'm >>>> not too familiar with the Arrow APIs. >>>> >>>> >>>> >>>> - Reader.newParquetIterable() maps primitive column types to the newly >>>> added ArrowParquetValueReader but for other types (nested types, etc.) uses >>>> current *InternalRow* based ValueReaders >>>> >>>> >>>> >>>> Sounds good for primitives, but I would just leave the nested types >>>> un-implemented for now. >>>> >>>> >>>> >>>> - Stitch the columns vectors together to create ColumnarBatch, (Since >>>> *SupportsScanColumnarBatch* mixin currently expects this ) .. >>>> *although* *I'm a bit lost on how the stitching of columns happens >>>> currently*? .. and how the ArrowColumnVectors could be stitched >>>> alongside regular columns that don't have arrow based support ? >>>> >>>> >>>> >>>> I don't think that you can mix regular columns and Arrow columns. It >>>> has to be all one or the other. That's why it's easier to start with >>>> primitives, then add structs, then lists, and finally maps. >>>> >>>> >>>> >>>> - Reader returns readTasks as *InputPartition<*ColumnarBatch*> *so >>>> that DataSourceV2ScanExec starts using ColumnarBatch scans >>>> >>>> >>>> >>>> We will probably need two paths. One for columnar batches and one for >>>> row-based reads. That doesn't need to be done right away and what you >>>> already have in your working copy makes sense as a start. >>>> >>>> >>>> >>>> That's a lot of questions! :-) but hope i'm making sense. >>>> >>>> >>>> >>>> -Gautam. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> [1] - >>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala >>>> [github.com] >>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_execution_arrow_ArrowWriter.scala&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=8yzJh2S49rbuM06dC5Sy-yMECClqEeLS7tpg45BmDN4&e=> >>>> >>>> >>>> >>>> -- >>>> >>>> Ryan Blue >>>> >>>> Software Engineer >>>> >>>> Netflix >>>> >>>> > > -- > Ryan Blue > Software Engineer > Netflix >