One thing to note is we never go to "RDD" records really, since we are always working the DataFrame API. Spark builds RDDs but expects us to deliver data in one of two ways, row-based <https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L255-L262> (internalRows) or columnar <https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L267> (arrowVectors). Columnar reads are generally more efficient and parallelizable, usually when someone is talking *vectorizing parquet reads* they mean columnar reads.
While this gives us much better performance (see our various perf test modules in the code base if you would like to run yourself) Spark is still a row oriented engine. Spark wants to take advantage of this format which is why it provides the "columnarBatch" interface but still does all codegen and other operations on a per row basis. This means that although we can generally load the data in a much faster way than row based loading, Spark still has to work on the data in a row format most of the time. There are a variety of projects working to fix this as well. On Fri, Feb 4, 2022 at 11:01 PM Mike Zhang <mike.zhang.2001...@gmail.com> wrote: > I am reading the Iceberg code regarding the Parquet reading path and see > the Parquet files are red to Arrow format first. I wonder how much > performance gain we could have by doing that. Let’s take the example of the > Spark application with Iceberg. If the Parquet file is red directly to > Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark > Record? Since Iceberg is converting to Arrow first today, there must be > some benefits of that. So I feel I miss something. Can somebody help to > explain? >