Github user mallman commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14690#discussion_r77425776
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
    @@ -79,8 +79,16 @@ object FileSourceStrategy extends Strategy with Logging {
             
ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
           logInfo(s"Pruning directories with: 
${partitionKeyFilters.mkString(",")}")
     
    +      val prunedFsRelation = fsRelation.location match {
    --- End diff --
    
    It appears `FileSourceScanExec` still does that. It's just redundant in 
this case.
    
    If I understand correctly, I think that what you're asking was how things 
worked in my first commit. In my second commit, I put this code (and the 
`TableFileCatalog.filterPartitions` method) into the planning stage for two 
reasons.
    
    First, if we need to do a file schema discovery and reconciliation against 
the metastore schema, we don't want to do that over the entire table. In my 
first commit, I avoided this operation by removing it completely. However, as a 
consequence it broke compatibility with parquet files with mixed case column 
names. As a fix, I planned to defer this schema reconciliation to physical 
planning, after the partition pruning. In the analysis phase, the analyzer 
would still use the metastore schema. This would be a compromise solution 
wherein only the pruned partitions' file schema would be scanned. However, I 
believe the work to retrieve and use the schema stored in the hive table 
properties will make this file scanning entirely unnecessary. That remains to 
be seen. I'm waiting for #14750 to be merged into master first.
    
    Second, I've long been frustrated by the inability of the planner to use 
the data size of a partition-pruned relation in deciding whether to apply an 
automatic broadcast join. I just realized that this commit won't help here, 
since that decision is made with the logical plan before it gets here. I could 
explore pruning in the optimizer instead. In fact, that's how we've done it on 
an internal spark clone. It helps seeing the pruning done in the optimized plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to