siknezevic commented on issue #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements URL: https://github.com/apache/spark/pull/27246#issuecomment-577816507 > Regarding the `ExternalAppendOnlyUnsafeRowArray` change: > > * It looks like there's some benchmarks for this class in `ExternalAppendOnlyUnsafeRowArrayBenchmark`; we should probably re-run those after this change. > * It seems like a lot of the complexity in `ExternalAppendOnlyUnsafeRowArray` comes from the fact that we want to define a "do not spill" threshold in terms of an absolute row count. If it wasn't for this, I think we could just directly use an `UnsafeExternalSorter` here (the performance overhead inserting a record when you don't spill should be roughly the same AFAIK, since both paths are more-or-less just copying the row's memory into the end of buffer). It looks like this was previously discussed at [#16909 (comment)](https://github.com/apache/spark/pull/16909#discussion_r101214678), where it sounds like the current design was motivated by performance considerations. > * Your idea of having a "merged iterator" between the in-memory and spill iterators might make sense; I left some feedback / comments on the current implementation of that idea (I think I've spotted a couple of bugs). I agree it would be good if we could use ExternalUnsafeSorter only. ExternalAppendOnlyUnsafeRowArray class has issue that it does not have protection from OutOfMemory (OOM) like ExternalUnsafeSorter does. OOM issue (and Spark executor loss) would happen at point of doubling of capacity of the ExternalAppendOnlyUnsafeRowArray object. This object gets full and needs to expand and then OOM happens when trying to allocate memory for new “double in capacity” object (ArrayBuffer). OOM would happen in ResizableArray.scala in ensureSize() line 103 : val newArray: Array[AnyRef] = new Array(newSize.toInt) . I have seen that over and over while testing with 100 TB data set and query 14. To avoid OOM issue you have to spill and then you run into “spilling performance” issue.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
