gszadovszky commented on PR #196: URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1614549513
@mapleFU, as I've written before that's why we initiated [ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863) to make the format open to specify orderings. I don't know how the other implementations use this already. In the current parquet-mr (since we introduced `ColumnOrder`) there is a logic that drops any statistics if the defined column order is not known. So we can safely initiate a new one. We can say that if the min/max value would contain a NaN, then we would write the new `IEEE_754` column order otherwise `TYPE_ORDER`. In this case we can simple skip the additional lists for marking all-NaN pages and write the NaN values into the statistics instead. The question is how older readers of the other implementations would handle an unknown `ColumnOrder`. It is an implementation detail that the NaN handling is java is different from what IEEE 754 says. Java has only one NaN bitmap. So handling this ordering will require additional work. I hope it can be implemented in a performant way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
