Correct. On Tue, May 28, 2019 at 3:13 PM Anton Okolnychyi <aokolnyc...@apple.com> wrote:
> Alright, so we are talking about reading Parquet data into > ArrowRecordBatches and then exposing them as ColumnarBatches in Spark, > where Spark ColumnVectors actually wrap Arrow FieldVectors, correct? > > - Anton > > > On 28 May 2019, at 21:24, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > > > From a performance viewpoint, this isn’t a great solution. The row by > row approach will substantially hurt performance compared to the vectorized > reader. I’ve seen 30% or more speed up when removing row-by-row access. So > putting a row-by-row adapter in the middle of two vectorized > representations is pretty costly. > > > > Iceberg doesn’t impose this requirement, it is how Spark consumes the > rows itself, one at a time: > https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L138 > > > > By exposing Arrow data as Spark’s ColumnarBatch, we should pick up any > benefits from improved execution when Spark is updated. > > > > > > On Tue, May 28, 2019 at 12:33 PM Owen O'Malley <owen.omal...@gmail.com> > wrote: > > > > > > On Fri, May 24, 2019 at 8:28 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > > if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an > Iterator[InternalRow] interface, it would still not work right? Coz it > seems to me there is a lot more going on upstream in the operator execution > path that would be needed to be done here. > > > > There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an > iterator to read a ColumnarBatch as a sequence of InternalRow. That’s what > we want to take advantage of. You’re right that the first thing that Spark > does it to get each row as InternalRow. But we still get a benefit from > vectorizing the data materialization to Arrow itself. Spark execution is > not vectorized, but that can be updated in Spark later (I think there’s a > proposal). > > > > From a performance viewpoint, this isn't a great solution. The row by > row approach will substantially hurt performance compared to the vectorized > reader. I've seen 30% or more speed up when removing row-by-row access. So > putting a row-by-row adapter in the middle of two vectorized > representations is pretty costly. > > > > .. Owen > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > -- Ryan Blue Software Engineer Netflix