Github user davies commented on the pull request:
https://github.com/apache/spark/pull/11624#issuecomment-194987680
Right now, it's not safe to re-use the objects from reader or UnsafeRow,
because some of the expression may hold the object (for example, aggregate
without grouping key, and some string functions). That's the reason we know the
cost of creating new object every time you access
UTF8String/Decimal/Array/MapArray/Struct, but have not optimize it yet.
I tried this patch locally, generate a parquet file with one decimal
column, then read it and aggregate with max(d) and min(d), the min(d) will
return wrong result:
```
>>> sqlContext.sql("select min(d), max(d) from t").show()
+------+------+
|min(d)|max(d)|
+------+------+
| 0.00| 99.00|
+------+------+
>>> sqlContext.sql("select min(d), max(d) from t2").show()
+------+------+
|min(d)|max(d)|
+------+------+
| 24.00| 99.00|
+------+------+
```
t1 is the table before saving as parquet file, t2 is the table loaded from
parquet file.
In order to having these optimization, we need to prove that we always make
the copy before holding a reference to a object that could be re-used. There
are still some places we are using MutableGenericInternalRow, we also should do
the copy when update it.
If we only re-use the object for new parquet reader, but do the copy for
all other places, this may cause performance regression for other data sources.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]