[
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253044#comment-17253044
]
Nicholas Chammas edited comment on SPARK-12890 at 1/14/21, 5:41 PM:
--------------------------------------------------------------------
I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>> ascending=False).limit(1).explain()
== Physical Plan ==
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST],
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74,
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>> ascending=False).limit(1).show()
[Stage 23:=====> (2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
PRE file_date=2018-10-02/
PRE file_date=2018-10-08/
PRE file_date=2018-10-15/
...{code}
Schema merging is not enabled, as far as I can tell.
Shouldn't Spark be able to answer this query without going through ~20K files?
Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when
it's only projecting partitioning columns?
For the record, the best current workaround appears to be to [use the catalog
to list partitions and extract what's needed that
way|https://stackoverflow.com/a/65724151/877069]. But it seems like Spark
should handle this situation better.
was (Author: nchammas):
I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>> ascending=False).limit(1).explain()
== Physical Plan ==
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST],
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet,
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74,
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>> ascending=False).limit(1).show()
[Stage 23:=====> (2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
PRE file_date=2018-10-02/
PRE file_date=2018-10-08/
PRE file_date=2018-10-15/
...{code}
Schema merging is not enabled, as far as I can tell.
Shouldn't Spark be able to answer this query without going through ~20K files?
Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when
it's only projecting partitioning columns?
For the record, the best current workaround appears to be to [use the catalog
to list partitions and extract what's needed that
way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark
should handle this situation better.
> Spark SQL query related to only partition fields should not scan the whole
> data.
> --------------------------------------------------------------------------------
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Prakash Chockalingam
> Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]