siknezevic commented on issue #27246: [SPARK-30536][CORE][SQL] Sort-merge join 
operator spilling performance improvements
URL: https://github.com/apache/spark/pull/27246#issuecomment-577816507
 
 
   > Regarding the `ExternalAppendOnlyUnsafeRowArray` change:
   > 
   > * It looks like there's some benchmarks for this class in 
`ExternalAppendOnlyUnsafeRowArrayBenchmark`; we should probably re-run those 
after this change.
   > * It seems like a lot of the complexity in 
`ExternalAppendOnlyUnsafeRowArray` comes from the fact that we want to define a 
"do not spill" threshold in terms of an absolute row count. If it wasn't for 
this, I think we could just directly use an `UnsafeExternalSorter` here (the 
performance overhead inserting a record when you don't spill should be roughly 
the same AFAIK, since both paths are more-or-less just copying the row's memory 
into the end of buffer). It looks like this was previously discussed at [#16909 
(comment)](https://github.com/apache/spark/pull/16909#discussion_r101214678), 
where it sounds like the current design was motivated by performance 
considerations.
   > * Your idea of having a "merged iterator" between the in-memory and spill 
iterators might make sense; I left some feedback / comments on the current 
implementation of that idea (I think I've spotted a couple of bugs).
   
   I agree it would be good if we could use ExternalUnsafeSorter only. 
ExternalAppendOnlyUnsafeRowArray class has issue that it does not have 
protection from OutOfMemory (OOM) like ExternalUnsafeSorter does. OOM issue 
(and Spark executor loss) would happen at point of doubling of capacity of the 
ExternalAppendOnlyUnsafeRowArray object. This object gets full and needs to 
expand and then OOM happens when trying to allocate memory for new “double in 
capacity” object (ArrayBuffer).
   OOM would happen in ResizableArray.scala in ensureSize() line 103 : val 
newArray: Array[AnyRef] = new Array(newSize.toInt) . I have seen that over and 
over while testing with 100 TB data set and query 14. To avoid OOM issue you 
have to spill and then you run into “spilling performance” issue.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to