[
https://issues.apache.org/jira/browse/SPARK-13908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Luca Bruno updated SPARK-13908:
-------------------------------
Description:
Hello,
I'm doing a simple query like this on a single parquet file:
{noformat}
SELECT *
FROM someparquet
LIMIT 1
{noformat}
The someparquet table is just a parquet read and registered as temporary table.
The query takes as much time (minutes) as it would by scanning all the records,
instead of just taking the first record.
Using parquet-tools head is instead very fast (seconds), hence I guess it's a
missing optimization opportunity from spark.
The physical plan is the following:
{noformat}
== Physical Plan ==
CollectLimit 1
+- WholeStageCodegen
: +- Scan ParquetFormat part: struct<>, data: struct<........>[...]
InputPaths: hdfs://...
{noformat}
was:
Hello,
I'm doing a simple query like this on a single parquet file:
{noformat}
SELECT *
FROM someparquet
LIMIT 1
{noformat}
The someparquet table is just a parquet read and registered as temporary table.
The query takes as much time (minutes) as it would by scanning all the records,
instead of just taking the first record.
Using parquet-tools head is instead very fast (seconds), hence I guess it's a
missing optimization opportunity from spark.
The physical plan is the following:
{noformat}
== Physical Plan ==
CollectLimit 1
+- WholeStageCodegen
: +- Scan ParquetFormat part: struct<>, data: struct<........>[...]
InputPaths: ...
{noformat}
> Limit not pushed down
> ---------------------
>
> Key: SPARK-13908
> URL: https://issues.apache.org/jira/browse/SPARK-13908
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Environment: Spark compiled from git with commit 53ba6d6
> Reporter: Luca Bruno
> Labels: performance
>
> Hello,
> I'm doing a simple query like this on a single parquet file:
> {noformat}
> SELECT *
> FROM someparquet
> LIMIT 1
> {noformat}
> The someparquet table is just a parquet read and registered as temporary
> table.
> The query takes as much time (minutes) as it would by scanning all the
> records, instead of just taking the first record.
> Using parquet-tools head is instead very fast (seconds), hence I guess it's a
> missing optimization opportunity from spark.
> The physical plan is the following:
> {noformat}
> == Physical Plan ==
>
> CollectLimit 1
> +- WholeStageCodegen
> : +- Scan ParquetFormat part: struct<>, data: struct<........>[...]
> InputPaths: hdfs://...
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]