[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

NathanHowell Thu, 28 Apr 2016 18:30:06 -0700

Github user NathanHowell commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215607825
  
    Alright, here's a few ideas that will at least reduce allocations by a bit. 
Your version with the merge sort is likely better than the insertion sort here 
but I thought I'd try something different :trollface: any chance you can run it 
through your benchmark?
    
    I'll put my other comments inline, generally looks good to me.
    
    ``` scala
      /**
        * Combine two sorted arrays of StructFields into a new StructType
        */
      private def compatibleFields(
          fields1: Array[StructField],
          fields2: Array[StructField]): StructType = {
        // perform an insertion sort of the smaller struct into the larger one
        val (bigger, smaller) = if (fields1.length > fields2.length) {
          (fields1.toBuffer, fields2)
        } else {
          (fields2.toBuffer, fields1)
        }
    
        var biggerIdx = 0
        var smallerIdx = 0
        while (biggerIdx < bigger.length && smallerIdx < smaller.length) {
          val biggerVal = bigger(biggerIdx)
          val smallerVal = smaller(smallerIdx)
          val comp = biggerVal.name.compareTo(smallerVal.name)
          if (comp == 0) {
            if (biggerVal.dataType != smallerVal.dataType) {
              val merged = compatibleType(biggerVal.dataType, 
smallerVal.dataType)
              // test to see if the merged type is equivalent to one of the 
existing
              // StructField instances, reuse will reduce GC pressure
              if (smallerVal.dataType == merged) {
                bigger.update(biggerIdx, smallerVal)
              } else if (biggerVal.dataType == merged) {
                // do nothing, the bigger struct already has the correct field
              } else {
                // we can't reuse an existing field so allocate a new one
                bigger.update(biggerIdx, biggerVal.copy(dataType = merged))
              }
            }
            biggerIdx += 1
            smallerIdx += 1
          } else if (comp > 0) {
            bigger.insert(biggerIdx, smallerVal)
            // bump both indexes, the bigger struct has grown
            biggerIdx += 1
            smallerIdx += 1
          } else { // comp < 0
            // advance to the next field on the bigger struct
            // nothing else to do here
            biggerIdx += 1
          }
        }
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Reply via email to