SongYadong created SPARK-25354:
----------------------------------
Summary: Parquet vectorized record reader has unneeded operation
in several methods
Key: SPARK-25354
URL: https://issues.apache.org/jira/browse/SPARK-25354
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 2.4.0
Reporter: SongYadong
VectorizedParquetRecordReader class has unneeded operation in nextKeyValue
method and other functions called from it:
1. In nextKeyValue() method, we call resultBatch() for only initializing a
columnar batch if not initialized, not for a return of columnar batch. so we
can move initBatch() operation to nextBatch();
2. In nextBatch() method, we need not reset columnVectors every time. When
rowsReturned >= totalRowCount, function return, reset cost is vasted. so we can
put "if (rowsReturned >= totalRowCount) return false;" before columnVectors
reset for performance.
3. In nextBatch() method, we need not call checkEndOfRowGroup() every time.
When rowsReturned != totalCountLoadedSoFar is true, checkEndOfRowGroup do
nothing but just return, so we can call checkEndOfRowGroup only when
rowsReturned == totalCountLoadedSoFar for reducing function calling.
4. In checkEndOfRowGroup() function, we need not get columns of requestedSchema
every time. we can get columns only for the first time and save it for future
use for performance.
Accoring to analysis of spark application with JMC tool, we found parquet
vectorized record reader call nextKeyValue() and subsequent function very very
frequent, performance gains from optimizition of this process is worth to do.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]