Hi,

We're using Parquet as a storage format for some data that we'd like to quickly 
randomly access. Basically, we have our data sorted by a key, and then split 
into many Parquet files. We need to support fast retrieval of a key range (so 
if the keys are text, we'd want to fetch all records starting with AAAF-AAAH 
quickly, for example). The file names include the key range in that file, so we 
can quickly scan through hundreds of files in a directory and pick out the 1 or 
2 files that have the key range we're looking for, without opening the files or 
reading their footers.

The data within each .parquet file is also sorted. We use the newer filters (in 
the "filter2" namespace) to perform filtering in a single file. This is nice 
because it lets us throw out entire row groups that we know don't contain the 
data we're looking for, using the row group metadata (min/max values).

Unfortunately, what the filters don't seem to support is filtering record 
ranges by looking at page-level metadata (min/max values). We'd like to be able 
to skip entire pages worth of records quickly. My understanding is that this is 
complicated since each column has its records/values split into pages at 
different boundaries, so we'd need to implement the ability to skip multiple 
pages (on some columns) and only part of a page (on others).

My question is: Has anyone tried this before? Is there a reason why it's not 
done, or are there any complications we'd run into?

Right now, we're writing out small row groups (~4 MB) as a workaround. It works 
well enough for now, but we'd like to do better.

- Ethan
------------------------------------------------------------------------------

This message is intended only for the personal and confidential use of the 
recipients named above. If the reader of this email is not the intended 
recipient, you have received this email in error and any review, dissemination, 
distribution or copying is strictly prohibited. If you have received this email 
in error, please notify the sender immediately by return email and permanently 
delete the copy you received.

This message is provided for informational purposes and should not be construed 
as a solicitation or offer to buy or sell any securities or related financial 
instruments. Wolverine is not responsible for any recommendation, solicitation, 
offer or agreement or any information about any transaction, customer account 
or account activity that may be attached to or contained in this communication. 
Wolverine accepts no liability for any content contained in the email, or any 
errors or omissions arising as a result of e-mail transmission. Any opinions 
contained in this email constitute the sender's best judgment at this time and 
are subject to change without notice.

Reply via email to