[GitHub] spark pull request: [SPARK-12745] [SQL] Hive Parser: Limit is not ...

hvanhovell Mon, 11 Jan 2016 08:03:53 -0800

Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/10689#issuecomment-170597130
  
    @gatorsmile I do see the performance benefits of ```limit``` while 
processing. The reservation I am having is reasoning about non-toplevel 
```limit``` statements. A set-operator example:
    
        select a from db.tbl_a
        intersect
        select b from db.tbl_b
    
    The result should all distinct rows in ```a``` for which we can find an 
equal tuple in ```b```. Let's add limit to this:
    
        select a from db.tbl_a limit 10
        intersect
        select b from db.tbl_b limit 10
    
    The result now be the first (distinct?) 10 rows from ```a``` which will be 
filtered by checking if they exist in the first 10 rows of ```b``` (I think). I 
am not sure this is what a user expects, further more:
    - You will probably end up with less then 10 rows here.
    - The results will be probably non-deterministic (unless you would also 
allow somekind of ordering in a subquery).
    
    Do you have a concrete realworld example where you need this?
    
    I don't really mind if we would put this back in the parser (the engine 
supports it anyway). But I don't think we should just do something like this 
without some consideration.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12745] [SQL] Hive Parser: Limit is not ...

Reply via email to