Thanks Russel! I wonder if the performance gain is mainly from vectorization instead of using arrow format? My understanding of the benefits of using Arrow is to avoid serialization/deserialization. I just got a hard time understanding how Iceberg uses Arrow to get the benefit of that.
On Sat, Feb 5, 2022 at 5:39 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > One thing to note is we never go to "RDD" records really, since we are > always working the DataFrame API. Spark builds RDDs but expects us to > deliver data in one of two ways, row-based > <https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L255-L262> > (internalRows) > or columnar > <https://github.com/apache/iceberg/blob/a2260faa1d1177342d453c1de91c15ed9592e0e9/spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java#L267> > (arrowVectors). > Columnar reads are generally more efficient and parallelizable, usually > when someone is talking *vectorizing parquet reads* they mean columnar > reads. > > While this gives us much better performance (see our various perf test > modules in the code base if you would like to run yourself) Spark is still > a row oriented engine. Spark wants to take advantage of this format which > is why it provides the "columnarBatch" interface but still does all codegen > and other operations on a per row basis. This means that although we can > generally load the data in a much faster way than row based loading, Spark > still has to work on the data in a row format most of the time. There are a > variety of projects working to fix this as well. > > On Fri, Feb 4, 2022 at 11:01 PM Mike Zhang <mike.zhang.2001...@gmail.com> > wrote: > >> I am reading the Iceberg code regarding the Parquet reading path and see >> the Parquet files are red to Arrow format first. I wonder how much >> performance gain we could have by doing that. Let’s take the example of the >> Spark application with Iceberg. If the Parquet file is red directly to >> Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark >> Record? Since Iceberg is converting to Arrow first today, there must be >> some benefits of that. So I feel I miss something. Can somebody help to >> explain? >> >