Github user hvanhovell commented on the pull request:
https://github.com/apache/spark/pull/10689#issuecomment-170597130
@gatorsmile I do see the performance benefits of ```limit``` while
processing. The reservation I am having is reasoning about non-toplevel
```limit``` statements. A set-operator example:
select a from db.tbl_a
intersect
select b from db.tbl_b
The result should all distinct rows in ```a``` for which we can find an
equal tuple in ```b```. Let's add limit to this:
select a from db.tbl_a limit 10
intersect
select b from db.tbl_b limit 10
The result now be the first (distinct?) 10 rows from ```a``` which will be
filtered by checking if they exist in the first 10 rows of ```b``` (I think). I
am not sure this is what a user expects, further more:
- You will probably end up with less then 10 rows here.
- The results will be probably non-deterministic (unless you would also
allow somekind of ordering in a subquery).
Do you have a concrete realworld example where you need this?
I don't really mind if we would put this back in the parser (the engine
supports it anyway). But I don't think we should just do something like this
without some consideration.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]