Luca Bruno created SPARK-13908:
----------------------------------

             Summary: Limit not pushed down
                 Key: SPARK-13908
                 URL: https://issues.apache.org/jira/browse/SPARK-13908
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
         Environment: Spark compiled from git with commit 53ba6d6
            Reporter: Luca Bruno


Hello,
I'm doing a simple query like this on a single parquet file:

{noformat}
SELECT *
FROM someparquet
LIMIT 1
{noformat}

The someparquet table is just a parquet read and registered as temporary table.
The query takes as much time (minutes) as it would by scanning all the records, 
instead of just taking the first record.
Using parquet-tools head is instead very fast (seconds), hence I guess it's a 
missing optimization opportunity from spark.

The physical plan is the following:

{noformat}
== Physical Plan ==                                                             
CollectLimit 1
+- WholeStageCodegen
   :  +- Scan ParquetFormat part: struct<>, data: struct<........>[...] 
InputPaths: ...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to