Re: [PR] [GLUTEN-8580][CORE][Part-2] Don't validate project generated by PushDownInputFileExpression [incubator-gluten]

via GitHub Tue, 21 Jan 2025 20:09:20 -0800


zhztheplayer commented on PR #8585:
URL: 
https://github.com/apache/incubator-gluten/pull/8585#issuecomment-2606246178


   I'd help attach an query optimization example by the feature to help one 
better understand how the feature works:
   
   ```
   1. Input plan:
   
   CollectLimit 100
   +- Project [input_file_name() AS input_file_name()#208, a#195L]
      +- Union
         :- Project [a#195L]
         :  +- BatchScan json 
file:/tmp/spark-5de024cd-776a-4b52-bddc-d592d63abaf1[a#195L] JsonScan 
DataFilters: [], Format: json, Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-5de024cd-776a-4b52-bddc-d592d63abaf1], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct<a:bigint> RuntimeFilters: []
         +- Project [l_orderkey#76L AS a#207L]
            +- BatchScan parquet 
file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-data-parquet-velox/lineitem[l_orderkey#76L]
 ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da...,
 PartitionFilters: [], PushedAggregation: [], PushedFilters: [], PushedGroupBy: 
[], ReadSchema: struct<l_orderkey:bigint> RuntimeFilters: []
   
   2. Plan after applying the pre-offload rule:
   
   Project [input_file_name#169 AS input_file_name()#164, a#151L]
   +- Union
      :- Project [a#151L, input_file_name#169]
      :  +- Project [a#151L, input_file_name() AS input_file_name#169]
      :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, 
Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: 
[], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
      +- Project [l_orderkey#76L AS a#163L, input_file_name#170]
         +- Project [l_orderkey#76L, input_file_name() AS input_file_name#170]
            +- BatchScan[l_orderkey#76L] ParquetScan DataFilters: [], Format: 
parquet, Location: InMemoryFileIndex(1 
paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []
   
   3. Plan after applying offload rules:
   
   CollectLimit 100
   +- ProjectExecTransformer [input_file_name#169 AS input_file_name()#164, 
a#151L]
      +- ColumnarUnion
         :- ProjectExecTransformer [a#151L, input_file_name#169]
         :  +- Project [a#151L, input_file_name() AS input_file_name#169]
         :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, 
Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: 
[], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
         +- ProjectExecTransformer [l_orderkey#76L AS a#163L, 
input_file_name#170]
            +- Project [l_orderkey#76L, input_file_name() AS 
input_file_name#170]
               +- BatchScanExecTransformer[l_orderkey#76L] ParquetScan 
DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []
   
   4. Plan after applying post-offload rule:
   
   CollectLimit 100
   +- ProjectExecTransformer [input_file_name#169 AS input_file_name()#164, 
a#151L]
      +- ColumnarUnion
         :- ProjectExecTransformer [a#151L, input_file_name#169]
         :  +- Project [a#151L, input_file_name() AS input_file_name#169]
         :     +- BatchScan[a#151L] JsonScan DataFilters: [], Format: json, 
Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-efaf98cf-a5f0-4d62-ae94-23dec424764e], PartitionFilters: 
[], ReadSchema: struct<a:bigint>, PushedFilters: [] RuntimeFilters: []
         +- ProjectExecTransformer [l_orderkey#76L AS a#163L, 
input_file_name#170]
            +- BatchScanExecTransformer[l_orderkey#76L, input_file_name#170] 
ParquetScan DataFilters: [], Format: parquet, Location: InMemoryFileIndex(1 
paths)[file:/opt/gluten/backends-velox/target/scala-2.12/test-classes/tpch-da...,
 PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<l_orderkey:bigint>, PushedFilters: [] RuntimeFilters: []
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN-8580][CORE][Part-2] Don't validate project generated by PushDownInputFileExpression [incubator-gluten]

Reply via email to