Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/14240#discussion_r71585784
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/sources/PrunedScanSuite.scala ---
@@ -114,16 +114,15 @@ class PrunedScanSuite extends DataSourceTest with
SharedSQLContext {
testPruning("SELECT * FROM oneToTenPruned", "a", "b")
testPruning("SELECT a, b FROM oneToTenPruned", "a", "b")
testPruning("SELECT b, a FROM oneToTenPruned", "b", "a")
- testPruning("SELECT b, b FROM oneToTenPruned", "b")
+ testPruning("SELECT b, b FROM oneToTenPruned", "b", "b")
+ testPruning("SELECT b as alias_b, b FROM oneToTenPruned", "b")
--- End diff --
me too. : ( I am also learning this part for refactoring the Hive table
scan. That is why I found the behavior inconsistency between different types of
table scans. Let me try to summarize it in a more organized way.
Currently, when converting logical plans to physical plans, we have two
different strategies for the table scan (+ the adjacent `Filter` and `Project`,
if any).
- Hive Table Scan and In-memory Table Scan.
- Data Source Table Scan.
Thus, the basic functionalities are the same.
The inputs include the filter predicates and the output project list and a
leaf relation `LogicalRelation`, `MetastoreRelation`, and `InMemoryRelation`.
The inputs are generated by the `PhysicalOperation` pattern, which extracts and
normalizes Scan, Projects and Filters.
The target is to generate a physical plan, like:
```
ProjectExec
+- FilterExec
+- TableScan
```
`ProjectExec` is optional. When the filter predicates contain the
unnecessary attributes or output project list has aliases, we have to add
`ProjectExec`; otherwise, it is ok to exclude `ProjectExec`. The above logics
is implemented by the following conditions in both `pruneFilterProject`
functions:
```Scala
AttributeSet(projectList.map(_.toAttribute)) == projectSet &&
filterSet.subsetOf(projectSet))
```
The source codes:
- [The original `pruneFilterProject` for Data Source Table
Scan](https://github.com/apache/spark/blob/b1e5281c5cb429e338c3719c13c0b93078d7312a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L364-L366)
- [The `pruneFilterProject` for Hive Table Scan and In-memory Table
Scan](https://github.com/apache/spark/blob/865ec32dd997e63aea01a871d1c7b4947f43c111/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L88-L89
)
The above code shows that Data Source Table Scan has one extra condition:
```Scala
projectSet.size == projects.size
```
This condition is very specific for a rare case. Users select duplicate
columns without using any alias name. Thus, to make them consistent, we are
facing two options:
- add this condition to both scan scenarios
- remove this condition from both
Either is fine to me, but I think we need to make them consistent. Let me
know if my explanation is clear.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]