westonpace commented on PR #14158:
URL: https://github.com/apache/arrow/pull/14158#issuecomment-1265504491

   > I care about this work very much as well and hope can understand this 
better. If I remember correctly the high level idea is that there are nodes 
that requires ordering (e.g., asof join) and if the input batches are out of 
order (indicated by batch index), the consumer node will cache/reorder out of 
order batches before processing them?
   
   Yes.  If a node relies on ordering then it will resequence the batches 
before processing them.  I try and take care to use both "reorder" and 
"resequence" independently as there are two rather different problems.
   
   The first problem is when the input has no known ordering or is in a 
completely random order.  In that case we must "reorder" which is "not 
streaming" and a "pipeline breaker" and requires us to cache all data in memory 
(or spill) in order to assign the order.
   
   The second problem is when the input is mostly ordered but might be a bit 
noisy due to something like a parallel scan.  In that case we already have a 
sequence number and we assume the sequence number is, generally, within some 
max delta from the correct ordering.  In that case we only need to resequence 
(not reorder).  This operation is "mostly streaming" and only sometimes a 
"pipeline breaker".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to