One thing to note is we never go to "RDD" records really, since we are
always working the DataFrame API. Spark builds RDDs but expects us to
deliver data in one of two ways, row-based
<https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L255-L262>
(internalRows)
or columnar
<https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L267>
(arrowVectors).
Columnar reads are generally more efficient and parallelizable, usually
when someone is talking *vectorizing parquet reads* they mean columnar
reads.

While this gives us much better performance (see our various perf test
modules in the code base if you would like to run yourself) Spark is still
a row oriented engine. Spark wants to take advantage of this format which
is why it provides the "columnarBatch" interface but still does all codegen
and other operations on a per row basis. This means that although we can
generally load the data in a much faster way than row based loading, Spark
still has to work on the data in a row format most of the time. There are a
variety of projects working to fix this as well.

On Fri, Feb 4, 2022 at 11:01 PM Mike Zhang <mike.zhang.2001...@gmail.com>
wrote:

> I am reading the Iceberg code regarding the Parquet reading path and see
> the Parquet files are red to Arrow format first. I wonder how much
> performance gain we could have by doing that. Let’s take the example of the
> Spark application with Iceberg. If the Parquet file is red directly to
> Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark
> Record? Since Iceberg is converting to Arrow first today, there must be
> some benefits of that. So I feel I miss something. Can somebody help to
> explain?
>

Reply via email to