JFinis commented on PR #221: URL: https://github.com/apache/parquet-format/pull/221#issuecomment-2937628530
@orlp I actually had `nan_counts` in my [initial proposal](https://github.com/apache/parquet-format/pull/196). I'll try to summarize the gist of the discussion in that post (feel free to read it yourself) that prompted us to go to this approach instead. * It was argued that adding such special handling just for floating point is actually not that clean. What if the next type needs special handling for some values? Do we add these as well? If yes, statistics might over time get a mess where type-specific information spills into the generic statistics definition. If not, then why is float so special, that it requires special fields why other types do not. * NaNs are rare. The current main problem of Parquet is not that NaNs, if present, are inefficient. The problem is that NaNs make filtering impossible, *even if they are not present*. So the main focus of this proposal is to have a sane semantics that especially works well if there are no NaNs, while still working in the presence of NaNs. * The "NaN poisoning" you mention, which I also mentioned in my initial proposal, is somewhat arbitrary: Yes, NaN is considered an extreme value in this proposal and therefore overwrites other extreme values, making filtering less efficient, but is it really worth to make a case distinction here? E.g., what about infinity? You could argue that having infinity in statistics is similarly unuseful, so we could also introduce `infinity_counts` and then have the statistics only contain non-infinite, non-NaN values. So in the end, this boils down to: Are we willing to special case float to make filtering in the presence of NaNs more efficient, or do we go with a more streamlined implementation without special fields that only make sense for a single data type, at the cost of worse filtering performance when NaNs are present. The consensus was somehow the former, and over time I have come to agree with this. To your second comment about making the sign bit significant. I am working on an engine myself that does *not* care about sign bits in NaNs, but I still consider having them in Parquet okay. Our engine, when reading will simply normalize the NaN bit out (or just keep it as is). When writing, we will simply never write negative NaNs (or just write whatever bit pattern we have). Therefore, I don't see the danger you mention IMHO. Of course, each writer who wants to conform to the Parquet spec would have to handle this correctly, so either normalize the sign bit out, or consider it when computing min/max; otherwise it's an incorrect writer. The writing engine cannot just do whatever it does internally, it has to do whatever the Parquet spec mandates. Again, the engine I am working on handles NaNs internally differently as well, but of course our Parquet read/write code does whatever Parquet mandates, and not what we otherwise do internally. So basically your danger boils down to: "There is a chance that a sloppily programmed writer gets this wrong". That is indeed a risk. But IMHO, there are way more complex things in Parquet to get wrong, so this shouldn't be our bar. On the upside, what we gain by considering the sign bit for NaNs is: * a very efficient and straightforward software implementation for the comparison predicate * a comparison predicate that is actually standardized by IEEE, so one could argue that sticking to standards is preferrable over rolling out our own thing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
