rdblue commented on a change in pull request #2805:
URL: https://github.com/apache/iceberg/pull/2805#discussion_r668347737



##########
File path: site/docs/spec.md
##########
@@ -375,7 +375,7 @@ The schema of a manifest file is a struct called 
`manifest_entry` with the follo
 | _optional_ | _optional_ | **`109  value_counts`**           | `map<119: int, 
120: long>`   | Map from column id to number of values in the column (including 
null and NaN values) |
 | _optional_ | _optional_ | **`110  null_value_counts`**      | `map<121: int, 
122: long>`   | Map from column id to number of null values in the column |
 | _optional_ | _optional_ | **`137  nan_value_counts`**       | `map<138: int, 
139: long>`   | Map from column id to number of NaN values in the column |
-| _optional_ |            | ~~**`111 distinct_counts`**~~     | `map<123: int, 
124: long>`   | **Deprecated. Do not write.** |
+| _optional_ | _optional_ | **`111  distinct_counts`**        | `map<123: int, 
124: long>`   | Map from column id to number of distinct values in the column; 
distinct counts must be produced using values in the file, not on merged counts 
from other metadata |

Review comment:
       I think it is a bit more clear to state that this can't be a merged 
count. If someone is compacting data files without reading the data, like 
merging Parquet row groups without re-encoding then it may seem reasonable to 
merge the counts somehow. I'm not sure that people would consider merging an 
operation that doesn't reflect the distinct count in the file, that's why I 
mentioned it specifically.
   
   Is there anything else that you're including in cases that would not 
"reflect the distinct count in the file"? I'm trying to use language that 
allows non-exact values. I think we should be in agreement that sketch-based 
counts are okay.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to