[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

GitBox Tue, 21 Jul 2020 10:44:25 -0700


rdblue commented on a change in pull request #1221:
URL: https://github.com/apache/iceberg/pull/1221#discussion_r458276824




##########
File path: site/docs/configuration.md
##########
@@ -109,14 +110,14 @@ spark.read
     .table("catalog.db.table")
 ```
 
-| Spark option    | Default               | Description                        
                                                       |
-| --------------- | --------------------- | 
-----------------------------------------------------------------------------------------
 |
-| snapshot-id     | (latest)              | Snapshot ID of the table snapshot 
to read                                                 |
-| as-of-timestamp | (latest)              | A timestamp in milliseconds; the 
snapshot used will be the snapshot current at this time. |
-| split-size      | As per table property | Overrides this table's 
read.split.target-size and read.split.metadata-target-size         |
-| lookback        | As per table property | Overrides this table's 
read.split.planning-lookback                                       |
-| file-open-cost  | As per table property | Overrides this table's 
read.split.open-file-cost                                          |
-
+| Spark option               | Default               | Description             
                                                                  |
+| -------------------------- | --------------------- | 
-----------------------------------------------------------------------------------------
 |
+| snapshot-id                | (latest)              | Snapshot ID of the 
table snapshot to read                                                 |
+| as-of-timestamp            | (latest)              | A timestamp in 
milliseconds; the snapshot used will be the snapshot current at this time. |
+| split-size                 | As per table property | Overrides this table's 
read.split.target-size and read.split.metadata-target-size         |
+| lookback                   | As per table property | Overrides this table's 
read.split.planning-lookback                                       |
+| file-open-cost             | As per table property | Overrides this table's 
read.split.open-file-cost                                          |
+| use-approximate-statistics | As per table property | Overrides this table's 
read.spark.read.spark.use-approximate-statistics                   |

Review comment:
       I'm not sure I understand the case you're talking about. What I'm saying 
is that because we are using reliable stats that are maintained in the table 
metadata, we can always use them if there are no filters. That takes care of 
the bad case in Spark 2.4. Since Spark 2.4 doesn't ever push filters before 
calling `estimateStatistics`, it will always use metadata stats and will avoid 
the issue entirely.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1221: Spark: Fix estimateStatistics when called without filters

Reply via email to