Ted-Jiang commented on PR #2580:
URL: 
https://github.com/apache/arrow-datafusion/pull/2580#issuecomment-1133607491

   > Thank you for the contribution @Ted-Jiang , howe I don't think this is a 
valid optimization.
   > 
   > Specifically, because a join can filter out rows, if you limit the input 
you may actually end up with fewer output rows than the limit.
   > 
   > Consider this input:
   > 
   > `left`:
   > 
   > l
   > 1
   > 2
   > ...
   > 100
   > `right`:
   > 
   > r
   > 99
   > 100
   > The output of `select * from left LEFT JOIN right ON (l = r)` should be:
   > 
   > l  r
   > 99 99
   > 100        100
   > However, if you push the limit down to the scan on `left` it would only 
send this into the JOIN
   > 
   > `left`:
   > 
   > l
   > 1
   > 2
   > ```
   > 
   > And thus would produce no output. 
   > 
   > If we want to optimize limits in Joins, I think it would have to be done 
in the Join Operator itself (to stop producing rows once the limit is hit). 
However, as long as the Join operator is producing rows in batches, the effect 
of implementing an internal limit will likely be small (because it would save 
only a part of one output batch, when the limit on the output of a join is hit)
   > ```
   
   I think this situation is `select * from left LEFT JOIN right ON (l = r)` 
without limit. There will no limit in left table_scan
   will still produce
   
   > `left`:
   > 
   > l
   > 1
   > 
   > ...
   > 100
   
   But this rule will apply `select * from left LEFT JOIN right ON (l = r) 
limit 2`
   which left table will get the limit  will send two values to `join`, i think 
this is right
   and the result will be
   
   ```
   l    r
   1       Null
   2    Null
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to