Re: Approaching Vectorized Reading in Iceberg ..

Ryan Blue Tue, 28 May 2019 15:40:26 -0700

Correct.

On Tue, May 28, 2019 at 3:13 PM Anton Okolnychyi <aokolnyc...@apple.com>
wrote:


> Alright, so we are talking about reading Parquet data into
> ArrowRecordBatches and then exposing them as ColumnarBatches in Spark,
> where Spark ColumnVectors actually wrap Arrow FieldVectors, correct?
>
> - Anton
>
> > On 28 May 2019, at 21:24, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > From a performance viewpoint, this isn’t a great solution. The row by
> row approach will substantially hurt performance compared to the vectorized
> reader. I’ve seen 30% or more speed up when removing row-by-row access. So
> putting a row-by-row adapter in the middle of two vectorized
> representations is pretty costly.
> >
> > Iceberg doesn’t impose this requirement, it is how Spark consumes the
> rows itself, one at a time:
> https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L138
> >
> > By exposing Arrow data as Spark’s ColumnarBatch, we should pick up any
> benefits from improved execution when Spark is updated.
> >
> >
> > On Tue, May 28, 2019 at 12:33 PM Owen O'Malley <owen.omal...@gmail.com>
> wrote:
> >
> >
> > On Fri, May 24, 2019 at 8:28 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> > if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an
> Iterator[InternalRow] interface, it would still not work right? Coz it
> seems to me there is a lot more going on upstream in the operator execution
> path that would be needed to be done here.
> >
> > There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an
> iterator to read a ColumnarBatch as a sequence of InternalRow. That’s what
> we want to take advantage of. You’re right that the first thing that Spark
> does it to get each row as InternalRow. But we still get a benefit from
> vectorizing the data materialization to Arrow itself. Spark execution is
> not vectorized, but that can be updated in Spark later (I think there’s a
> proposal).
> >
> > From a performance viewpoint, this isn't a great solution. The row by
> row approach will substantially hurt performance compared to the vectorized
> reader. I've seen 30% or more speed up when removing row-by-row access. So
> putting a row-by-row adapter in the middle of two vectorized
> representations is pretty costly.
> >
> > .. Owen
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Approaching Vectorized Reading in Iceberg ..

Reply via email to