JanKaul commented on issue #8608: URL: https://github.com/apache/arrow-rs/issues/8608#issuecomment-3445935443
You're right, a sketch would be better as I'm only interested in approximate values for query planning. The issue is about having some kind of standard across different query engines. I think it's very difficult to establish a new standard for data sketches for parquet now. You have to agree on an appropriate sketch and define a new metadata field that will be accepted and used across the industry. The benefit of the 'distinct_count' metadata field is that it already exists. You do have the issue that it wasn't designed to be an approximate field. But I think as long as the approximation is an optional configuration option, it should be fine. Generally a sketch will be more accurate than the distinct_count field. But the default column chunk size is 1 million rows which should be large enough to provide reasonable distinct count approximations. For query engines, typically having some inaccurate metadata is better than having no metadata. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
