[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

Nicholas Chammas (Jira) Thu, 14 Jan 2021 09:42:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253044#comment-17253044
 ]


Nicholas Chammas edited comment on SPARK-12890 at 1/14/21, 5:41 PM:
--------------------------------------------------------------------

I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan ==                                                             
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=====>                                            (2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
                           PRE file_date=2018-10-02/
                           PRE file_date=2018-10-08/
                           PRE file_date=2018-10-15/ 
                           ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/65724151/877069]. But it seems like Spark 
should handle this situation better.


was (Author: nchammas):
I think this is still an open issue. On Spark 2.4.6:
{code:java}
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).explain()
== Physical Plan ==                                                             
TakeOrderedAndProject(limit=1, orderBy=[file_date#144 DESC NULLS LAST], 
output=[file_date#144])
+- *(1) FileScan parquet [file_date#144] Batched: true, Format: Parquet, 
Location: InMemoryFileIndex[s3a://some/dataset], PartitionCount: 74, 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
>>> spark.read.parquet('s3a://some/dataset').select('file_date').orderBy('file_date',
>>>  ascending=False).limit(1).show()
[Stage 23:=====>                                            (2110 + 12) / 21049]
{code}
{{file_date}} is a partitioning column:
{code:java}
$ aws s3 ls s3://some/dataset/
                           PRE file_date=2018-10-02/
                           PRE file_date=2018-10-08/
                           PRE file_date=2018-10-15/ 
                           ...{code}
Schema merging is not enabled, as far as I can tell.

Shouldn't Spark be able to answer this query without going through ~20K files?

Is the problem that {{FileScan}} somehow needs to be enhanced to recognize when 
it's only projecting partitioning columns?

For the record, the best current workaround appears to be to [use the catalog 
to list partitions and extract what's needed that 
way|https://stackoverflow.com/a/57440760/877069]. But it seems like Spark 
should handle this situation better.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-12890
>                 URL: https://issues.apache.org/jira/browse/SPARK-12890
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Prakash Chockalingam
>            Priority: Minor
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

Reply via email to