peter-toth commented on a change in pull request #33825:
URL: https://github.com/apache/spark/pull/33825#discussion_r695515564
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScan.scala
##########
@@ -187,7 +189,10 @@ trait FileScan extends Scan
new Statistics {
override def sizeInBytes(): OptionalLong = {
val compressionFactor =
sparkSession.sessionState.conf.fileCompressionFactor
- val size = (compressionFactor * fileIndex.sizeInBytes).toLong
+ val size = (compressionFactor * fileIndex.sizeInBytes /
+ (dataSchema.defaultSize + fileIndex.partitionSchema.defaultSize) *
+ (readDataSchema.defaultSize +
readPartitionSchema.defaultSize)).toLong
Review comment:
Hmm, I don't think that file format has anything to do with this issue.
Let me try to explain what I'm trying to fix in this PR. If you take the
example V2 example from the description of SPARK-36568 the cost plan looks like
this:
```
== Optimized Logical Plan ==
Project [key#251L, col1#252, col2#253, col3#254],
Statistics(sizeInBytes=1696.5 GiB)
+- Join Inner, (key#251L = key#259L), Statistics(sizeInBytes=1875.1 GiB)
:- Filter isnotnull(key#251L), Statistics(sizeInBytes=1385.7 KiB)
: +- RelationV2[key#251L, col1#252, col2#253, col3#254] parquet
file:/private/var/folders/zb/vz4yf0k97j13m39rsn5s1l5m0000gp/T/spark-9e9cfa69-b858-493a-bbfc-abf83248bf3c/tbl,
Statistics(sizeInBytes=1385.7 KiB)
+- Filter isnotnull(key#259L), Statistics(sizeInBytes=1385.7 KiB)
+- RelationV2[key#259L] parquet
file:/private/var/folders/zb/vz4yf0k97j13m39rsn5s1l5m0000gp/T/spark-9e9cfa69-b858-493a-bbfc-abf83248bf3c/lookup,
Statistics(sizeInBytes=1385.7 KiB)
```
The issue here is that `lookup` (2nd `RelationV2`) has the same stats
(sizeInBytes=1385.7 KiB) as `tbl` (1st `RelationV2`) does, but we know that
`lookup` is column pruned and only returns the `key` column.
Please note that the cost plan is the same if you use other, non-columnar
formats.
After this PR the stats will change to:
```
== Optimized Logical Plan ==
Project [key#251L, col1#252, col2#253, col3#254],
Statistics(sizeInBytes=199.6 GiB)
+- Join Inner, (key#251L = key#259L), Statistics(sizeInBytes=220.6 GiB)
:- Filter isnotnull(key#251L), Statistics(sizeInBytes=1385.7 KiB)
: +- RelationV2[key#251L, col1#252, col2#253, col3#254] parquet
file:/private/var/folders/zb/vz4yf0k97j13m39rsn5s1l5m0000gp/T/spark-be132ca2-1ebb-42a4-b2e1-7518fc0210e8/tbl,
Statistics(sizeInBytes=1385.7 KiB)
+- Filter isnotnull(key#259L), Statistics(sizeInBytes=163.0 KiB)
+- RelationV2[key#259L] parquet
file:/private/var/folders/zb/vz4yf0k97j13m39rsn5s1l5m0000gp/T/spark-be132ca2-1ebb-42a4-b2e1-7518fc0210e8/lookup,
Statistics(sizeInBytes=163.0 KiB)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]