Hi.
In Hive project, some of us are interested in adding the capability to
cache Parquet data. I was hoping to do work mostly in Hive, however
examining the code (ParquetFileReader, etc.), it looks like the best one
can do for data pages is making a Hadoop FileSystem that will wrap a real
one and incorporate caching, registering it with a custom scheme and
altering split paths.
That is viable, but not great from the design standpoint. In addition,
we’d like the cached data to be separate for different columns; also with
ORC we found caching uncompressed, yet RLE/etc. encoded data to be a good
tradeoff between space usage and read costs. The former is possible with
an additional hack (using chunk boundaries from metadata to determine
caching boundaries in the custom-caching-FS); the latter seems to be
impossible without massive code duplication with PFR, since pages are read
by slicing a single column chunk ByteBuffer, and decompress is called
unconditionally on a page.

* With that in mind, I wonder if it would make sense to the community if I
add support for a pluggable source (and destination) for uncompressed
cached pages in PFR, to read planning (ConsecutiveChunkList, etc.) and
PageStore/PageReader-s? With ORC, the caching was added when it was still
part of Hive project, and subsequently separated not so cleanly, so now
there’s code duplication/etc. there that I would like to avoid.
Or would you prefer/recommend a different approach?
* Related to that, I also have a question about read granularity - it
appears that ParquetFileReader always reads entire column chunks, however
the documentation at https://parquet.apache.org/documentation/latest/
mentions that “smaller data pages allow for more fine grained reading” -
is that somewhere else in the codebase, not implemented, or intended for
other potential readers to implement?

Thanks in advance!

Reply via email to