Thanks Russel! I wonder if the performance gain is mainly from
vectorization instead of using arrow format?  My understanding of the
benefits of using Arrow is to avoid serialization/deserialization. I just
got a hard time understanding how Iceberg uses Arrow to get the benefit of
that.

On Sat, Feb 5, 2022 at 5:39 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> One thing to note is we never go to "RDD" records really, since we are
> always working the DataFrame API. Spark builds RDDs but expects us to
> deliver data in one of two ways, row-based
> <https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L255-L262>
>  (internalRows)
> or columnar
> <https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L267>
>  (arrowVectors).
> Columnar reads are generally more efficient and parallelizable, usually
> when someone is talking *vectorizing parquet reads* they mean columnar
> reads.
>
> While this gives us much better performance (see our various perf test
> modules in the code base if you would like to run yourself) Spark is still
> a row oriented engine. Spark wants to take advantage of this format which
> is why it provides the "columnarBatch" interface but still does all codegen
> and other operations on a per row basis. This means that although we can
> generally load the data in a much faster way than row based loading, Spark
> still has to work on the data in a row format most of the time. There are a
> variety of projects working to fix this as well.
>
> On Fri, Feb 4, 2022 at 11:01 PM Mike Zhang <mike.zhang.2001...@gmail.com>
> wrote:
>
>> I am reading the Iceberg code regarding the Parquet reading path and see
>> the Parquet files are red to Arrow format first. I wonder how much
>> performance gain we could have by doing that. Let’s take the example of the
>> Spark application with Iceberg. If the Parquet file is red directly to
>> Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark
>> Record? Since Iceberg is converting to Arrow first today, there must be
>> some benefits of that. So I feel I miss something. Can somebody help to
>> explain?
>>
>

Reply via email to