[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

mallman Fri, 02 Sep 2016 16:43:46 -0700

Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14690#discussion_r77425776

--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
---
@@ -79,8 +79,16 @@ object FileSourceStrategy extends Strategy with Logging {

ExpressionSet(normalizedFilters.filter(_.references.subsetOf(partitionSet)))
logInfo(s"Pruning directories with:
${partitionKeyFilters.mkString(",")}")

+ val prunedFsRelation = fsRelation.location match {
--- End diff --

It appears `FileSourceScanExec` still does that. It's just redundant in
this case.

If I understand correctly, I think that what you're asking was how things
worked in my first commit. In my second commit, I put this code (and the
`TableFileCatalog.filterPartitions` method) into the planning stage for two
reasons.

First, if we need to do a file schema discovery and reconciliation against
the metastore schema, we don't want to do that over the entire table. In my
first commit, I avoided this operation by removing it completely. However, as a
consequence it broke compatibility with parquet files with mixed case column
names. As a fix, I planned to defer this schema reconciliation to physical
planning, after the partition pruning. In the analysis phase, the analyzer
would still use the metastore schema. This would be a compromise solution
wherein only the pruned partitions' file schema would be scanned. However, I
believe the work to retrieve and use the schema stored in the hive table
properties will make this file scanning entirely unnecessary. That remains to
be seen. I'm waiting for #14750 to be merged into master first.

Second, I've long been frustrated by the inability of the planner to use
the data size of a partition-pruned relation in deciding whether to apply an
automatic broadcast join. I just realized that this commit won't help here,
since that decision is made with the logical plan before it gets here. I could
explore pruning in the optimizer instead. In fact, that's how we've done it on
an internal spark clone. It helps seeing the pruning done in the optimized plan.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

Reply via email to