Chao Sun created SPARK-35640:
--------------------------------
Summary: Refactor Parquet vectorized reader to remove duplicated
code paths
Key: SPARK-35640
URL: https://issues.apache.org/jira/browse/SPARK-35640
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun
Currently in Parquet vectorized code path, there are many code duplications
such as the following:
{code:java}
public void readIntegers(
int total,
WritableColumnVector c,
int rowId,
int level,
VectorizedValuesReader data) throws IOException {
int left = total;
while (left > 0) {
if (this.currentCount == 0) this.readNextGroup();
int n = Math.min(left, this.currentCount);
switch (mode) {
case RLE:
if (currentValue == level) {
data.readIntegers(n, c, rowId);
} else {
c.putNulls(rowId, n);
}
break;
case PACKED:
for (int i = 0; i < n; ++i) {
if (currentBuffer[currentBufferIdx++] == level) {
c.putInt(rowId + i, data.readInteger());
} else {
c.putNull(rowId + i);
}
}
break;
}
rowId += n;
left -= n;
currentCount -= n;
}
}
{code}
This makes it hard to maintain as any change on this will need to be replicated
in 20+ places. The issue becomes more serious when we are going to implement
column index and complex type support for the vectorized path.
The original intention is for performance. However now days JIT compilers tend
to be smart on this and will inline virtual calls as much as possible.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]