Luca Bruno created SPARK-13908:
----------------------------------
Summary: Limit not pushed down
Key: SPARK-13908
URL: https://issues.apache.org/jira/browse/SPARK-13908
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 2.0.0
Environment: Spark compiled from git with commit 53ba6d6
Reporter: Luca Bruno
Hello,
I'm doing a simple query like this on a single parquet file:
{noformat}
SELECT *
FROM someparquet
LIMIT 1
{noformat}
The someparquet table is just a parquet read and registered as temporary table.
The query takes as much time (minutes) as it would by scanning all the records,
instead of just taking the first record.
Using parquet-tools head is instead very fast (seconds), hence I guess it's a
missing optimization opportunity from spark.
The physical plan is the following:
{noformat}
== Physical Plan ==
CollectLimit 1
+- WholeStageCodegen
: +- Scan ParquetFormat part: struct<>, data: struct<........>[...]
InputPaths: ...
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]