Hello Team,

I had some time to poke into this a bit more. The current spec is:

A file data block consists of:

   - A long indicating the count of objects in this block.
   - A long indicating the size in bytes of the serialized objects in the
   current block, after any codec is applied
   - The serialized objects. If a codec is specified, this is compressed by
   that codec.
   - The file’s 16-byte sync marker.

I am proposing that this be changed to:

A file data block consists of:

   - A long indicating the count of objects in this block.
   - A long indicating the size in bytes of the serialized objects in the
   current block, after any codec is applied
   - Block metadata written as if defined by the following map
   <https://avro.apache.org/docs/1.11.1/specification/#schema-maps> schema:
   {"type": "map", "values": "bytes"}
   - The serialized objects. If a codec is specified, this is compressed by
   that codec.
   - The file’s 16-byte sync marker.


In particular, I am interested in storing block-level metadata such as:

   - Bloom Filters
      - https://parquet.apache.org/docs/file-format/bloomfilter/
   - "Paranoid" Checksums
      - https://github.com/google/leveldb/blob/main/doc/index.md#checksums
   - Statistics (e.g., min/max, distinct values, null counts, etc.)
      - https://parquet.apache.org/docs/file-format/pageindex/
      -
      
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.Statistics.html


It is "trivial" to add the block metadata map, the biggest concern of
course is backwards compatibility.

Ideally, existing readers should simply skip over this extra map if it is
present, but I cannot thus far think of a "clever" way of doing this, but I
am open to suggestions.

Secondarily, we can allow new reader versions to read both old and new by
setting a file header metadata entry (e.g., block version) in the file
metadata, absent of such metadata implies "v1" of the block spec. However,
existing readers (until updated to skip) would fail in some undefined way
since the spec has changed unknown to them.

The third thing to consider, we can pre-append some static value onto
Codecs (e.g., "block-meta+snappy") that new readers can handle (v2 block +
snappy codec), but existing readers would fail-fast with "missing codec"
exception.

Thanks.

On Tue, Sep 17, 2024 at 9:29 AM David <dam6...@gmail.com> wrote:

> Hello Gang,
>
> I've recently had some space to look at Avro again recently (I enjoy
> contributing to something that has such a wide industry impact).
>
> In thinking about the block format of Avro, it currently stores Metadata
> about the number of records in each block. I'm performing a thought
> exercise of replacing the count field with a map and allowing for a more
> generic set of metadata. In particular, would want to add better scan
> support: Bloom filters, min, max values.
>
> Making this backwards compatible looks hard at first, but does anyone in
> the community see value here?
>
>
> Thanks.
>

Reply via email to