Hi, 

This looks very interesting and similar to a pattern we use in our custom file 
format (that we are gradually moving to parquet).

We use a skip list in a random accessible footer but this works by referencing 
column name (or name prefix) not IDs.
The reason is we can have very wide files (10^6 to 10^9 columns) where a 
hash-table mapping would add too much overhead.

https://www.datadoghq.com/blog/engineering/husky-storage-compaction/#the-fragment-file-storage-format

I understand we dont want to generally waste space by mandating column name 
(even truncated) in footer, but would it make sense to support a custom 
metadata map in each skip list node to not block this kind of optimisation ?

Thanks 

On 2026/03/27 10:32:44 Will Edwards via dev wrote:
> Thx for the interest :D
> 
> I ran some quick micro-benchmarks on my Java footer parser to see if the
> idea might work.  Sorry I can't share code, but I can share recipes etc.
> 
> This is running on a weak cloud VM.  The test files were created using an
> obvious Python pyarrow script.  The test runs a few hundred iterations for
> warmup then runs a few thousand iterations of each test to get a timing.
> 
> The *parquet-format* column is the
> standard org.apache.parquet.format.Util.readFileMetaData(new
> ByteArrayInputStream(footerBytes)).  This excludes IO (the Hadoop libraries
> are so slow, that's actually the number one thing to avoid).
> 
> The *full* column shows my footer parser performing a full footer decode.
> It spends about 5-10% of the time in the schema making the name tree and so
> on; the rest of the time is spent on column metadata.  It creates a long[]
> of num_row_groups * num_columns * num_interesting_fields to decode into.
> Strings/bytes are stored in the long[] as offsets into the Thrift footer.
> This is about 8x+ faster than the standard library, and more importantly
> for my use-cases it generates significantly less garbage.
> 
> The *by ordinal* column shows my footer parser skipping all columns except
> one.  The skipping is barely faster than the full decode (you can't early
> out after the interesting column in the last row-group if you are still
> waiting for metadata etc) but, importantly for my use-cases, it uses
> significantly less memory.
> 
> The *single col* column is O(1) random access using a jump table into the
> Thrift footer.  This is the speed of the proposal.
> 
> | Cols | RGs | Footer | parquet-format | full | by ordinal | single col |
> |------|-----|--------|---------------|------|-----------|------------|
> | 100 | 1 | 19 KB | 0.27 | 0.040 | 0.017 | 0.001 |
> | 100 | 2 | 30 KB | 0.35 | 0.040 | 0.031 | 0.001 |
> | 100 | 4 | 52 KB | 0.57 | 0.077 | 0.061 | 0.001 |
> | 1K | 1 | 195 KB | 2.58 | 0.21 | 0.17 | 0.000 |
> | 1K | 2 | 304 KB | 2.88 | 0.35 | 0.32 | 0.001 |
> | 1K | 4 | 528 KB | 5.39 | 0.69 | 0.63 | 0.001 |
> | 10K | 1 | 2.0 MB | 16.4 | 1.89 | 1.77 | 0.000 |
> | 10K | 2 | 3.1 MB | 28.9 | 3.58 | 3.25 | 0.001 |
> | 10K | 4 | 5.4 MB | 54.2 | 6.99 | 6.33 | 0.001 |
> | 50K | 1 | 10.3 MB | 85.1 | 9.94 | 9.54 | 0.000 |
> | 50K | 2 | 15.9 MB | 148.9 | 18.8 | 17.6 | 0.001 |
> | 50K | 4 | 27.4 MB | 280.7 | 36.7 | 34.1 | 0.001 |
> | 100K | 1 | 20.6 MB | 174.1 | 20.0 | 19.2 | 0.000 |
> | 100K | 2 | 31.9 MB | 305.7 | 37.8 | 35.3 | 0.001 |
> | 100K | 4 | 55.1 MB | 565.3 | 73.6 | 68.5 | 0.002 |
> 
> Performance is measured in milliseconds.  My fast footer parser is very
> consistent, but the parquet-format has higher variability because of
> garbage collections, etc.
> 
> The O(1) access is very stable: approximately 100 ns to create the arrays
> etc for the result and then approximately 270 ns per decoded ColumnChunk.
> 
> The jump table extension might work like this: we have a normal Thrift
> footer, and immediately after the normal version field, we add a new field
> that signals the presence of a footer index.  This is a binary field with a
> new high field-id.  It doesn't have to contain the index; that can be
> sandwiched between the end of the Thrift footer and the end of the file.
> It contains an uncompressed long that is the offset to the footer index's
> start.  Because the field is not a varint, a writer can patch it after
> writing the footer etc.  Readers who don't know about the index will just
> skip it.
> 
> The footer index has a little header indicating its contents: whether it
> includes a name hash table, a column ordinals jump table etc.  These are
> all flat, fixed-size arrays.  There is also lists the positions for any
> fields not in the index e.g. the position of the metadata and items the
> reader may want to access by skipping the parts that the index allows them
> to.
> 
> Would love to know how Java performance compares to rust etc :D
> 
> Yours hopefully,
> Will
> 
> On Thu, 26 Mar 2026 at 10:38, Andrew Lamb <[email protected]> wrote:
> 
> > Thank you Will,
> >
> > I think this idea is very interesting.
> >
> > >  it’s entirely possible this index-only approach was evaluated and
> > discarded early on.
> >
> > I don't know of any such discussion or evaluation. One challenge of this
> > approach might be getting thrift-compiler generated parsers to cooperate.
> >
> > >  would love to hear your thoughts on whether a lightweight index could
> > give us the O(1) read performance we want without the file bloat we are
> > trying to avoid.
> >
> > I personally think it sounds like a great idea -- and I suggest as a next
> > step a prototype in one of the open source parquet readers so we can
> > measure its performance. The arrow-rs implementation (now) has its own
> > custom thrift decoder, so it might be relatively easy to test out such a
> > scheme.
> >
> > Andrew
> >
> 

Reply via email to