Hi. In Hive project, some of us are interested in adding the capability to cache Parquet data. I was hoping to do work mostly in Hive, however examining the code (ParquetFileReader, etc.), it looks like the best one can do for data pages is making a Hadoop FileSystem that will wrap a real one and incorporate caching, registering it with a custom scheme and altering split paths. That is viable, but not great from the design standpoint. In addition, we’d like the cached data to be separate for different columns; also with ORC we found caching uncompressed, yet RLE/etc. encoded data to be a good tradeoff between space usage and read costs. The former is possible with an additional hack (using chunk boundaries from metadata to determine caching boundaries in the custom-caching-FS); the latter seems to be impossible without massive code duplication with PFR, since pages are read by slicing a single column chunk ByteBuffer, and decompress is called unconditionally on a page.
* With that in mind, I wonder if it would make sense to the community if I add support for a pluggable source (and destination) for uncompressed cached pages in PFR, to read planning (ConsecutiveChunkList, etc.) and PageStore/PageReader-s? With ORC, the caching was added when it was still part of Hive project, and subsequently separated not so cleanly, so now there’s code duplication/etc. there that I would like to avoid. Or would you prefer/recommend a different approach? * Related to that, I also have a question about read granularity - it appears that ParquetFileReader always reads entire column chunks, however the documentation at https://parquet.apache.org/documentation/latest/ mentions that “smaller data pages allow for more fine grained reading” - is that somewhere else in the codebase, not implemented, or intended for other potential readers to implement? Thanks in advance!
