Hi, We're using Parquet as a storage format for some data that we'd like to quickly randomly access. Basically, we have our data sorted by a key, and then split into many Parquet files. We need to support fast retrieval of a key range (so if the keys are text, we'd want to fetch all records starting with AAAF-AAAH quickly, for example). The file names include the key range in that file, so we can quickly scan through hundreds of files in a directory and pick out the 1 or 2 files that have the key range we're looking for, without opening the files or reading their footers.
The data within each .parquet file is also sorted. We use the newer filters (in the "filter2" namespace) to perform filtering in a single file. This is nice because it lets us throw out entire row groups that we know don't contain the data we're looking for, using the row group metadata (min/max values). Unfortunately, what the filters don't seem to support is filtering record ranges by looking at page-level metadata (min/max values). We'd like to be able to skip entire pages worth of records quickly. My understanding is that this is complicated since each column has its records/values split into pages at different boundaries, so we'd need to implement the ability to skip multiple pages (on some columns) and only part of a page (on others). My question is: Has anyone tried this before? Is there a reason why it's not done, or are there any complications we'd run into? Right now, we're writing out small row groups (~4 MB) as a workaround. It works well enough for now, but we'd like to do better. - Ethan ------------------------------------------------------------------------------ This message is intended only for the personal and confidential use of the recipients named above. If the reader of this email is not the intended recipient, you have received this email in error and any review, dissemination, distribution or copying is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and permanently delete the copy you received. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments. Wolverine is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity that may be attached to or contained in this communication. Wolverine accepts no liability for any content contained in the email, or any errors or omissions arising as a result of e-mail transmission. Any opinions contained in this email constitute the sender's best judgment at this time and are subject to change without notice.
