[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

JoshRosen Fri, 29 Apr 2016 23:13:01 -0700

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215941197
  
    @NathanHowell, I played around with your code from 
https://github.com/apache/spark/pull/12750#issuecomment-215607825 and it was a 
bit slower than mine because it turns out that comparing `StructType`s for 
equality is really expensive and the benefits of avoiding object allocation 
were outweighed by the additional equality checks.
    
    There are a couple of minor tricks which can improve the performance of 
your approach, though. In
    
    ```scala
    if (biggerVal.dataType != smallerVal.dataType) {
              val merged = compatibleType(biggerVal.dataType, 
smallerVal.dataType)
              // test to see if the merged type is equivalent to one of the 
existing
              // StructField instances, reuse will reduce GC pressure
              if (smallerVal.dataType == merged) {
                bigger.update(biggerIdx, smallerVal)
              } else if (biggerVal.dataType == merged) {
                // do nothing, the bigger struct already has the correct field
              } else {
                // we can't reuse an existing field so allocate a new one
                bigger.update(biggerIdx, biggerVal.copy(dataType = merged))
              }
    [...]`
    ```
    
    note that  `compatibleType`'s first step is to compare its inputs for 
equality, so if the two input types are equal then one of those inputs will be 
re-used as the value of `merged`. Thus, I think you can remove some checks by 
rewriting this as
    
    ```scala
              val merged = compatibleType(biggerVal.dataType, 
smallerVal.dataType)
              if (smallerVal.dataType eq merged) {
                bigger.update(biggerIdx, smallerVal)
              } else if (biggerVal.dataType eq merged) {
                // do nothing, the bigger struct already has the correct field
              } else {
                // we can't reuse an existing field so allocate a new one
                bigger.update(biggerIdx, biggerVal.copy(dataType = merged))
              }
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Reply via email to