Do we have a proposal for this yet? I'm excited to go over it and I thought
one was mentioned in the last sync but I haven't seen it.

On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]> wrote:

> Hi all,
>
> Very sorry for the late reply, and thanks for the questions! The messages
> were not landing in my inbox properly.
>
> @Antoine
> > I feel like this is the kind of use case where a hypothetical extension
> type mechanism would be a better fit than hardcoding dedicated logical
> types in the Thrift definition.
>
> How would that look like? We wanted to introduce this logical type to
> Parquet specifically, so that table formats such as Delta and Iceberg can
> have a simpler protocol change, and that we could provide this as a
> consistent format across multiple data processing engines.
>
>
> @Rahil
> > I wanted to better understand one point. Based on the current spec you
> shared I see you have a parameter for the following:
> > > size INT64 -- the size of the file in bytes
> >  Are you proposing that the "File" type always writes the binary content
> of
> something such as an image or video directly within the Parquet file (i.e.,
> "inlining")? Or would it make sense for the spec to have some field
> distinguishing whether to store the content's bytes in the file itself vs
> simply track a pointer to the actual file in storage (i.e., keeping it "out
> of line").
>
> This is a great question. When it comes to FileType, the data will
> primarily be external to the parquet file, so the FileType would just store
> the pointer to the data.
> Now, can that data be inlined anyway? That is an optimization that can
> certainly be done. However, that requires some benchmarks to see how much
> the benefit would be.
> If compute engines were to carry this struct without any column pruning
> across all operations, having inline binary content would make operations
> like sorting and shuffling a lot more expensive.
> We couldn't instinctively justify whether this would be worth it just yet.
> However, the current proposed spec doesn't prevent you from also storing
> the content inline side by side with the pointer information.
>
>
>
> On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]> wrote:
>
> > Hi Burak,
> >
> > Thanks for starting this discussion. I was also interested in raising
> this
> > topic within the Parquet community (unless it has already been discussed
> in
> > the past).
> > For users working with unstructured data today such as large text,
> images,
> > or video, a data type such as a "file" or "blob" would be useful.
> >
> > I wanted to better understand one point. Based on the current spec you
> > shared I see you have a parameter for the following:
> > > size INT64 -- the size of the file in bytes
> >
> >  Are you proposing that the "File" type always writes the binary content
> of
> > something such as an image or video directly within the Parquet file
> (i.e.,
> > "inlining")? Or would it make sense for the spec to have some field
> > distinguishing whether to store the content's bytes in the file itself vs
> > simply track a pointer to the actual file in storage (i.e., keeping it
> "out
> > of line"). I would assume there are use cases where you would want to
> store
> > the binary content of something, like a small image within the Parquet
> file
> > instead of storing a pointer to a large video file in object storage.
> >
> > Regards,
> > Rahil Chertara
> >
> > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <[email protected]>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > I feel like this is the kind of use case where a hypothetical extension
> > > type mechanism would be a better fit than hardcoding dedicated logical
> > > types in the Thrift definition.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
> > > > Hello Parquet community,
> > > >
> > > > Unstructured data ingestion is getting extremely popular with the
> > > advances
> > > > in Generative AI. Today, our only means of dealing with unstructured
> > data
> > > > is to store it as a byte array inside Parquet, or point to files that
> > > exist
> > > > in some object store with a string. These solutions fail to address
> > these
> > > > use cases, because of scalability, usability, and governance issues.
> > > >
> > > > We would like to introduce a new logical type annotation in Parquet
> > > called
> > > > “File” for storing a struct that contains a path reference to a file
> > with
> > > > additional metadata.
> > > >
> > > > We propose that the struct contains the following fields:
> > > >
> > > > path STRING NOT NULL -- the opaque path to a file
> > > >
> > > > size INT64 -- the size of the file in bytes
> > > >
> > > > content_type STRING       -- the mime/content type of the file
> > > >
> > > > etag STRING -- the eTag identifier of the file. Can be used to detect
> > > > changes to a
> > > >
> > > > -- file
> > > >
> > > > The path will be stored as an opaque string; whatever the user
> > provides.
> > > We
> > > > don’t do any special encoding on it. The size will be the size of the
> > > file
> > > > in bytes as long. We also store the content_type of the file, and its
> > > etag
> > > > .
> > > >
> > > > We believe that these set of options are bare-bones and can be easily
> > > > extended by new optional fields in the future if desired that
> wouldn’t
> > > > impact the correctness of the file being read. We would like to
> > > introduce a
> > > > versioning field to the specification in case we need new fields in
> the
> > > > specification that may impact correctness, when accessing a file.
> > > >
> > > > We would represent this in parquet.thrift
> > > > <
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> > > >
> > > > as:
> > > >
> > > > /**
> > > >
> > > >   * File logical type annotation
> > > >
> > > >   */
> > > >
> > > > struct FileType {
> > > >
> > > >    // Versioning specification of the File struct contents. Can be
> used
> > > if a
> > > > new field is introduced to the
> > > >
> > > >    // struct representing the file, which may impact correctness when
> > > > accessing the file.
> > > >
> > > >    1: optional i8 specification_version
> > > >
> > > > }
> > > >
> > > > We believe that by natively supporting File references in Parquet, it
> > > will
> > > > become much simpler to build AI workloads on top of data stored in
> > > Parquet
> > > > across table formats and data processing engines. Looking forward to
> > your
> > > > feedback!
> > > >
> > >
> > >
> > >
> >
>

Reply via email to