gszadovszky commented on pull request #855: URL: https://github.com/apache/parquet-mr/pull/855#issuecomment-762721754
Thanks a lot for working on this. It is a good catch! I've had to investigate a bit why we do not catch this in the unit tests. The answer is that in most cases a [NOOP](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/filter2/compat/FilterCompat.java#L62) filter is created instead of having a `null` value. We use this `NOOP` if no filter is specified in the builder/config. Meanwhile it is possible to set a `null` in the builder/config so the NPE may occur. I've also realized that in older filter implementations (row group level min/max, dictionary) the `NOOP` filter is fine and do not have significant performance costs over a `null` check. Meanwhile for column index of bloom filter based filtering do have performance costs because they read the related data from the file even for `NOOP`. If you don't mind, @wangyum, I would like to take care of this (including the NPE and the potential performance issues). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
