aokolnychyi commented on issue #21320: [SPARK-4502][SQL] Parquet nested column 
pruning - foundation
URL: https://github.com/apache/spark/pull/21320#issuecomment-446020655
 
 
   @mallman @dbtsai @gatorsmile 
   
   One question on non-deterministic expressions. For example, let's consider a 
non-deterministic UDF.
   
   ```
   val nonDeterministicUdf = udf((first: String) => first + " " + 
Math.random()).asNondeterministic()
   val query = data.select(col("id"), nonDeterministicUdf(col("name.first")))
   ```
   
   As it is today, there will be no schema pruning due to the way how 
`collectProjectsAndFilters` is defined in `PhysicalOperation`.
   
   ```
   == Analyzed Logical Plan ==
   id: int, UDF(name.first): string
   Project [id#222, UDF(name#223.first) AS UDF(name.first)#246]
   +- Project [id#222, name#223, address#224, pets#225, friends#226, 
relatives#227, employer#228, p#229]
      +- SubqueryAlias `contacts`
         +- 
Relation[id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229]
 parquet
   
   == Optimized Logical Plan ==
   Project [id#222, UDF(name#223.first) AS UDF(name.first)#246]
   +- 
Relation[id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229]
 parquet
   
   == Physical Plan ==
   *(1) Project [id#222, UDF(name#223.first) AS UDF(name.first)#246]
   +- *(1) FileScan parquet 
[id#222,name#223,address#224,pets#225,friends#226,relatives#227,employer#228,p#229]
 Batched: false, DataFilters: [], Format: Parquet, Location: 
InMemoryFileIndex[file:/private/var/folders/f3/6jyczfzd15ndvh49zq0d_sg80000gn/T/spark-6b69e4e9-c6...,
 PartitionCount: 2, PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct<id:int,name:struct<first:string,middle:string,last:string>,address:string,pets:int,friends...
   ```
   
   To me, it seems valid to apply schema prunining in this case. What do you 
think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to