Hi Burak,

Thanks for starting this discussion. I was also interested in raising this
topic within the Parquet community (unless it has already been discussed in
the past).
For users working with unstructured data today such as large text, images,
or video, a data type such as a "file" or "blob" would be useful.

I wanted to better understand one point. Based on the current spec you
shared I see you have a parameter for the following:
> size INT64 -- the size of the file in bytes

 Are you proposing that the "File" type always writes the binary content of
something such as an image or video directly within the Parquet file (i.e.,
"inlining")? Or would it make sense for the spec to have some field
distinguishing whether to store the content's bytes in the file itself vs
simply track a pointer to the actual file in storage (i.e., keeping it "out
of line"). I would assume there are use cases where you would want to store
the binary content of something, like a small image within the Parquet file
instead of storing a pointer to a large video file in object storage.

Regards,
Rahil Chertara

On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <[email protected]> wrote:

>
> Hello,
>
> I feel like this is the kind of use case where a hypothetical extension
> type mechanism would be a better fit than hardcoding dedicated logical
> types in the Thrift definition.
>
> Regards
>
> Antoine.
>
>
> Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
> > Hello Parquet community,
> >
> > Unstructured data ingestion is getting extremely popular with the
> advances
> > in Generative AI. Today, our only means of dealing with unstructured data
> > is to store it as a byte array inside Parquet, or point to files that
> exist
> > in some object store with a string. These solutions fail to address these
> > use cases, because of scalability, usability, and governance issues.
> >
> > We would like to introduce a new logical type annotation in Parquet
> called
> > “File” for storing a struct that contains a path reference to a file with
> > additional metadata.
> >
> > We propose that the struct contains the following fields:
> >
> > path STRING NOT NULL -- the opaque path to a file
> >
> > size INT64 -- the size of the file in bytes
> >
> > content_type STRING       -- the mime/content type of the file
> >
> > etag STRING -- the eTag identifier of the file. Can be used to detect
> > changes to a
> >
> > -- file
> >
> > The path will be stored as an opaque string; whatever the user provides.
> We
> > don’t do any special encoding on it. The size will be the size of the
> file
> > in bytes as long. We also store the content_type of the file, and its
> etag
> > .
> >
> > We believe that these set of options are bare-bones and can be easily
> > extended by new optional fields in the future if desired that wouldn’t
> > impact the correctness of the file being read. We would like to
> introduce a
> > versioning field to the specification in case we need new fields in the
> > specification that may impact correctness, when accessing a file.
> >
> > We would represent this in parquet.thrift
> > <
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> >
> > as:
> >
> > /**
> >
> >   * File logical type annotation
> >
> >   */
> >
> > struct FileType {
> >
> >    // Versioning specification of the File struct contents. Can be used
> if a
> > new field is introduced to the
> >
> >    // struct representing the file, which may impact correctness when
> > accessing the file.
> >
> >    1: optional i8 specification_version
> >
> > }
> >
> > We believe that by natively supporting File references in Parquet, it
> will
> > become much simpler to build AI workloads on top of data stored in
> Parquet
> > across table formats and data processing engines. Looking forward to your
> > feedback!
> >
>
>
>

Reply via email to