[GitHub] spark pull request: [SPARK-13790] Speed up ColumnVector's getDecim...

davies Thu, 10 Mar 2016 10:22:43 -0800

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11624#issuecomment-194987680
  
    Right now, it's not safe to re-use the objects from reader or UnsafeRow, 
because some of the expression may hold the object (for example, aggregate 
without grouping key, and some string functions). That's the reason we know the 
cost of creating new object every time you access 
UTF8String/Decimal/Array/MapArray/Struct, but have not optimize it yet.
    
    I tried this patch locally, generate a parquet file with one decimal 
column, then read it and aggregate with max(d) and min(d), the min(d) will 
return wrong result:
    
    ```
    >>> sqlContext.sql("select min(d), max(d) from t").show()
    +------+------+
    |min(d)|max(d)|
    +------+------+
    |  0.00| 99.00|
    +------+------+
    >>> sqlContext.sql("select min(d), max(d) from t2").show()
    +------+------+
    |min(d)|max(d)|
    +------+------+
    | 24.00| 99.00|
    +------+------+
    ```
    t1 is the table before saving as parquet file, t2 is the table loaded from 
parquet file.
    
    In order to having these optimization, we need to prove that we always make 
the copy before holding a reference to a object that could be re-used. There 
are still some places we are using MutableGenericInternalRow, we also should do 
the copy when update it.
    
    If we only re-use the object for new parquet reader, but do the copy for 
all other places, this may cause performance regression for other data sources.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-13790] Speed up ColumnVector's getDecim...

Reply via email to