[GitHub] [spark] peter-toth commented on a change in pull request #33825: [SPARK-36568][SQL] Better FileScan statistics estimation

GitBox Wed, 25 Aug 2021 01:17:57 -0700


peter-toth commented on a change in pull request #33825:
URL: https://github.com/apache/spark/pull/33825#discussion_r695515564




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -187,7 +189,10 @@ trait FileScan extends Scan
     new Statistics {
       override def sizeInBytes(): OptionalLong = {
         val compressionFactor = 
sparkSession.sessionState.conf.fileCompressionFactor
-        val size = (compressionFactor * fileIndex.sizeInBytes).toLong
+        val size = (compressionFactor * fileIndex.sizeInBytes /
+          (dataSchema.defaultSize + fileIndex.partitionSchema.defaultSize) *
+          (readDataSchema.defaultSize + 
readPartitionSchema.defaultSize)).toLong

Review comment:
       Hmm, I don't think that file format has anything to do with this issue. 
Let me try to explain what I'm trying to fix in this PR. If you take the 
example V2 example from the description of SPARK-36568 the cost plan looks like 
this:
   
   ```
   == Optimized Logical Plan ==
   Project [key#251L, col1#252, col2#253, col3#254], 
Statistics(sizeInBytes=1696.5 GiB)
   +- Join Inner, (key#251L = key#259L), Statistics(sizeInBytes=1875.1 GiB)
      :- Filter isnotnull(key#251L), Statistics(sizeInBytes=1385.7 KiB)
      :  +- RelationV2[key#251L, col1#252, col2#253, col3#254] parquet 
file:/private/var/folders/zb/vz4yf0k97j13m39rsn5s1l5m0000gp/T/spark-9e9cfa69-b858-493a-bbfc-abf83248bf3c/tbl,
 Statistics(sizeInBytes=1385.7 KiB)
      +- Filter isnotnull(key#259L), Statistics(sizeInBytes=1385.7 KiB)
         +- RelationV2[key#259L] parquet 
file:/private/var/folders/zb/vz4yf0k97j13m39rsn5s1l5m0000gp/T/spark-9e9cfa69-b858-493a-bbfc-abf83248bf3c/lookup,
 Statistics(sizeInBytes=1385.7 KiB)
   ```
   The issue here is that `lookup` (2nd `RelationV2`) has the same stats 
(sizeInBytes=1385.7 KiB) as `tbl` (1st `RelationV2`) does, but we know that 
`lookup` is column pruned and only returns the `key` column.
   Please note that the cost plan is the same if you use other, non-columnar 
formats.
   
   After this PR the stats will change to:
   ```
   == Optimized Logical Plan ==
   Project [key#251L, col1#252, col2#253, col3#254], 
Statistics(sizeInBytes=199.6 GiB)
   +- Join Inner, (key#251L = key#259L), Statistics(sizeInBytes=220.6 GiB)
      :- Filter isnotnull(key#251L), Statistics(sizeInBytes=1385.7 KiB)
      :  +- RelationV2[key#251L, col1#252, col2#253, col3#254] parquet 
file:/private/var/folders/zb/vz4yf0k97j13m39rsn5s1l5m0000gp/T/spark-be132ca2-1ebb-42a4-b2e1-7518fc0210e8/tbl,
 Statistics(sizeInBytes=1385.7 KiB)
      +- Filter isnotnull(key#259L), Statistics(sizeInBytes=163.0 KiB)
         +- RelationV2[key#259L] parquet 
file:/private/var/folders/zb/vz4yf0k97j13m39rsn5s1l5m0000gp/T/spark-be132ca2-1ebb-42a4-b2e1-7518fc0210e8/lookup,
 Statistics(sizeInBytes=163.0 KiB)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] peter-toth commented on a change in pull request #33825: [SPARK-36568][SQL] Better FileScan statistics estimation

Reply via email to