Sorry for the typo in last mail. Compared with the Query-2, we have questions in Query-1 and Query-3. Also, may I know the difference between CollectLimit and BaseLimit? Thanks so much.
Best, Liz > On 26 Oct 2016, at 7:25 PM, Liz Bai <liz...@icloud.com> wrote: > > Hi all, > > We used Parquet and Spark 2.0 to do the testing. The table below is the > summary of what we have found about `Limit` keyword. Query-2 reveals that > SparkSQL does early stop upon getting adequate results. But we are curious of > Query-1 and Query-2. *But we are curious of Query-1 and Query-3. > It seems that, either writing result RDD as Parquet or filtering on columns > will lead to scanning much more data. > No. > SQL statement > Filter > Method of saving result > Runtime(s) > Input data size > 1 > select ColA from Table limit 1 > no > writeParquet > 216 > 205MB > 2 > select ColA from Table limit 1 > no > Collect > 22 > 38.3KB > 3 > select ColA from Table where ColB = 50 limit 1 > yes > Collect > 229 > 1776.4MB > We are wondering if this is a bug or something else. Could you please help on > it? > Thanks. > > Best regards, > Liz