Hello,
I feel like this is the kind of use case where a hypothetical extension
type mechanism would be a better fit than hardcoding dedicated logical
types in the Thrift definition.
Regards
Antoine.
Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
Hello Parquet community,
Unstructured data ingestion is getting extremely popular with the advances
in Generative AI. Today, our only means of dealing with unstructured data
is to store it as a byte array inside Parquet, or point to files that exist
in some object store with a string. These solutions fail to address these
use cases, because of scalability, usability, and governance issues.
We would like to introduce a new logical type annotation in Parquet called
“File” for storing a struct that contains a path reference to a file with
additional metadata.
We propose that the struct contains the following fields:
path STRING NOT NULL -- the opaque path to a file
size INT64 -- the size of the file in bytes
content_type STRING -- the mime/content type of the file
etag STRING -- the eTag identifier of the file. Can be used to detect
changes to a
-- file
The path will be stored as an opaque string; whatever the user provides. We
don’t do any special encoding on it. The size will be the size of the file
in bytes as long. We also store the content_type of the file, and its etag
.
We believe that these set of options are bare-bones and can be easily
extended by new optional fields in the future if desired that wouldn’t
impact the correctness of the file being read. We would like to introduce a
versioning field to the specification in case we need new fields in the
specification that may impact correctness, when accessing a file.
We would represent this in parquet.thrift
<https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift>
as:
/**
* File logical type annotation
*/
struct FileType {
// Versioning specification of the File struct contents. Can be used if a
new field is introduced to the
// struct representing the file, which may impact correctness when
accessing the file.
1: optional i8 specification_version
}
We believe that by natively supporting File references in Parquet, it will
become much simpler to build AI workloads on top of data stored in Parquet
across table formats and data processing engines. Looking forward to your
feedback!