JFinis commented on PR #221:
URL: https://github.com/apache/parquet-format/pull/221#issuecomment-2937628530

   @orlp I actually had `nan_counts` in my [initial 
proposal](https://github.com/apache/parquet-format/pull/196). I'll try to 
summarize the gist of the discussion in that post (feel free to read it 
yourself) that prompted us to go to this approach instead.
   
   * It was argued that adding such special handling just for floating point is 
actually not that clean. What if the next type needs special handling for some 
values? Do we add these as well? If yes, statistics might over time get a mess 
where type-specific information spills into the generic statistics definition. 
If not, then why is float so special, that it requires special fields why other 
types do not.
   * NaNs are rare. The current main problem of Parquet is not that NaNs, if 
present, are inefficient. The problem is that NaNs make filtering impossible, 
*even if they are not present*. So the main focus of this proposal is to have a 
sane semantics that especially works well if there are no NaNs, while still 
working in the presence of NaNs.
   * The "NaN poisoning" you mention, which I also mentioned in my initial 
proposal, is somewhat arbitrary: Yes, NaN is considered an extreme value in 
this proposal and therefore overwrites other extreme values, making filtering 
less efficient, but is it really worth to make a case distinction here? E.g., 
what about infinity? You could argue that having infinity in statistics is 
similarly unuseful, so we could also introduce `infinity_counts` and then have 
the statistics only contain non-infinite, non-NaN values.
   
   So in the end, this boils down to: Are we willing to special case float to 
make filtering in the presence of NaNs more efficient, or do we go with a more 
streamlined implementation without special fields that only make sense for a 
single data type, at the cost of worse filtering performance when NaNs are 
present. The consensus was somehow the former, and over time I have come to 
agree with this.
   
   To your second comment about making the sign bit significant. I am working 
on an engine myself that does *not* care about sign bits in NaNs, but I still 
consider having them in Parquet okay. Our engine, when reading will simply 
normalize the NaN bit out (or just keep it as is). When writing, we will simply 
never write negative NaNs (or just write whatever bit pattern we have). 
   
   Therefore, I don't see the danger you mention IMHO. Of course, each writer 
who wants to conform to the Parquet spec would have to handle this correctly, 
so either normalize the sign bit out, or consider it when computing min/max; 
otherwise it's an incorrect writer. The writing engine cannot just do whatever 
it does internally, it has to do whatever the Parquet spec mandates. Again, the 
engine I am working on handles NaNs internally differently as well, but of 
course our Parquet read/write code does whatever Parquet mandates, and not what 
we otherwise do internally. So basically your danger boils down to: "There is a 
chance that a sloppily programmed writer gets this wrong". That is indeed a 
risk. But IMHO, there are way more complex things in Parquet to get wrong, so 
this shouldn't be our bar.
   
   On the upside, what we gain by considering the sign bit for NaNs is:
   * a very efficient and straightforward software implementation for the 
comparison predicate
   * a comparison predicate that is actually standardized by IEEE, so one could 
argue that sticking to standards is preferrable over rolling out our own thing.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to