JoshRosen edited a comment on issue #20184: [SPARK-22987][Core] 
UnsafeExternalSorter cases OOM when invoking `getIterator` function.
URL: https://github.com/apache/spark/pull/20184#issuecomment-511240279
 
 
   I stumbled across this PR while looking through the open Spark core PRs.
   
   It sounds like the problem here is that we don't need to allocate the input 
stream and read buffer until it's actually time to read the spill, but we're 
currently doing that too early:
   
   - In `getSortedIterator()`, we have to construct all readers before we can 
return the first record because we must find the first record according to the 
sorted ordering and that requires looking at all spill files.
   - However, we do not have this constraint in `getIterator()`, which returns 
an unsorted iterator and is used in `ExternalAppendOnlyUnsafeRowArray` (which 
uses the sorter only for its spilling capabilities, not for sorting). In this 
case, we can initialize one-at-a-time only once we actually need to read the 
spill.
   
   Given this context, lazy initialization makes sense to me. However, this PR 
is a bit outdated and has some merge conflicts. I would be supportive of this 
change if the conflicts are resolved and the PR description is updated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to