Hi all,

Thank you all for the great discussion on the document! I made another pass
on the doc. During the Parquet sync, there was alignment around keeping the
field as simple and minimalistic as possible. I updated the doc in that way
(removed content_type from the field) to ensure that the fields available
are all functional fields for correctly reading a file.

Please let me know if you have more feedback!

If there are no strong arguments against the current proposal, may I follow
up with a pull request to apache/parquet-format
<https://github.com/apache/parquet-format>? What would be the next steps?
Or would I need to start a vote first?

Thanks,
Burak

On Wed, May 27, 2026 at 10:31 AM Burak Yavuz <[email protected]> wrote:

> Hello all,
>
> I'm sharing the design document for File Type here
> <https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?usp=sharing>.
> Please let me know what you think!
> Wanted to thank Micah Kornfield, Divjot Arora, and Daniel Weeks for their
> feedback working on this document.
>
> Steve, regarding your questions, my thoughts are inline:
> >    1. small inline blob somewhere within the parquet file (|data| =
> bytes)
> We have a lot of design options here. Does it need to be part of "File"?
> That's debatable. Engines/table formats can decide to coalesce a File
> reference with an inline value when available for example. Carrying an
> inline binary blob may make analytics workloads more inefficient,
> specifically if you have to carry them around as baggage through sorts and
> shuffles.
>
> > 2. Medium blob: data stored range limited within a larger file (|data| =
>    kilo to megabytes)
> Again, can be up to a table format to decide creating sidecar files, where
> the sidecar may be built on top of these file references.
>
> > 3. completely separate file (GB +), or somehow the data lifecycle isn't
>    managed with parquet file.
>
> This file reference solves this problem as well.
>
> > lifecycle management you don't want to discover that your photo
>    collection has been deleted by accident, and a data rewrite such as
>    applying DVs shouldn't mandate rebuilding of external binary files.
> > security, esp when providing credential access to tables. Credential
>    providers would also need to provide file access, so have to know which
>    binary files are associated with parquet files, somehow.
>
> These all sound like problems that should be handled at different layers
> of:
>   - table format
>   - engine
>   - catalog
> to me.
>
>
> Looking forward to your feedback! Also @Antoine, I put in a blurb around
> the extension framework in there. Would love your thoughts on that.
>
> Best,
> Burak
>
>
> On Wed, May 27, 2026 at 3:09 AM Steve Loughran <[email protected]>
> wrote:
>
>> I do think FILE would be good, even though it gets complicate fast.
>>
>> It'd be good to support all of
>>
>>    1. small inline blob somewhere within the parquet file (|data| = bytes)
>>    2. Medium blob: data stored range limited within a larger file (|data|
>> =
>>    kilo to megabytes)
>>    3. completely separate file (GB +), or somehow the data lifecycle isn't
>>    managed with parquet file.
>>
>> Issues I can see
>>
>>    - lifecycle management you don't want to discover that your photo
>>    collection has been deleted by accident, and a data rewrite such as
>>    applying DVs shouldn't mandate rebuilding of external binary files.
>>    - security, esp when providing credential access to tables. Credential
>>    providers would also need to provide file access, so have to know which
>>    binary files are associated with parquet files, somehow.
>>
>> What have other formats done here?
>>
>> On Thu, 21 May 2026 at 22:13, Ryan Blue <[email protected]> wrote:
>>
>> > For some reason, the original email never came through for me. This
>> thread
>> > starts with Rahil's email. In case other people are having the same
>> > problem, here's the thread Burak is talking about:
>> > https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy
>> >
>> > Ryan
>> >
>> > On Thu, May 21, 2026 at 1:30 PM Burak Yavuz <[email protected]> wrote:
>> >
>> > > I'll share something early next week. The original proposal is in the
>> > first
>> > > email in this thread.
>> > >
>> > > Best,
>> > > Burak
>> > >
>> > > On Thu, May 21, 2026, 1:15 PM Russell Spitzer <
>> [email protected]
>> > >
>> > > wrote:
>> > >
>> > > > Do we have a proposal for this yet? I'm excited to go over it and I
>> > > thought
>> > > > one was mentioned in the last sync but I haven't seen it.
>> > > >
>> > > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]>
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > Very sorry for the late reply, and thanks for the questions! The
>> > > messages
>> > > > > were not landing in my inbox properly.
>> > > > >
>> > > > > @Antoine
>> > > > > > I feel like this is the kind of use case where a hypothetical
>> > > extension
>> > > > > type mechanism would be a better fit than hardcoding dedicated
>> > logical
>> > > > > types in the Thrift definition.
>> > > > >
>> > > > > How would that look like? We wanted to introduce this logical
>> type to
>> > > > > Parquet specifically, so that table formats such as Delta and
>> Iceberg
>> > > can
>> > > > > have a simpler protocol change, and that we could provide this as
>> a
>> > > > > consistent format across multiple data processing engines.
>> > > > >
>> > > > >
>> > > > > @Rahil
>> > > > > > I wanted to better understand one point. Based on the current
>> spec
>> > > you
>> > > > > shared I see you have a parameter for the following:
>> > > > > > > size INT64 -- the size of the file in bytes
>> > > > > >  Are you proposing that the "File" type always writes the binary
>> > > > content
>> > > > > of
>> > > > > something such as an image or video directly within the Parquet
>> file
>> > > > (i.e.,
>> > > > > "inlining")? Or would it make sense for the spec to have some
>> field
>> > > > > distinguishing whether to store the content's bytes in the file
>> > itself
>> > > vs
>> > > > > simply track a pointer to the actual file in storage (i.e.,
>> keeping
>> > it
>> > > > "out
>> > > > > of line").
>> > > > >
>> > > > > This is a great question. When it comes to FileType, the data will
>> > > > > primarily be external to the parquet file, so the FileType would
>> just
>> > > > store
>> > > > > the pointer to the data.
>> > > > > Now, can that data be inlined anyway? That is an optimization that
>> > can
>> > > > > certainly be done. However, that requires some benchmarks to see
>> how
>> > > much
>> > > > > the benefit would be.
>> > > > > If compute engines were to carry this struct without any column
>> > pruning
>> > > > > across all operations, having inline binary content would make
>> > > operations
>> > > > > like sorting and shuffling a lot more expensive.
>> > > > > We couldn't instinctively justify whether this would be worth it
>> just
>> > > > yet.
>> > > > > However, the current proposed spec doesn't prevent you from also
>> > > storing
>> > > > > the content inline side by side with the pointer information.
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]>
>> wrote:
>> > > > >
>> > > > > > Hi Burak,
>> > > > > >
>> > > > > > Thanks for starting this discussion. I was also interested in
>> > raising
>> > > > > this
>> > > > > > topic within the Parquet community (unless it has already been
>> > > > discussed
>> > > > > in
>> > > > > > the past).
>> > > > > > For users working with unstructured data today such as large
>> text,
>> > > > > images,
>> > > > > > or video, a data type such as a "file" or "blob" would be
>> useful.
>> > > > > >
>> > > > > > I wanted to better understand one point. Based on the current
>> spec
>> > > you
>> > > > > > shared I see you have a parameter for the following:
>> > > > > > > size INT64 -- the size of the file in bytes
>> > > > > >
>> > > > > >  Are you proposing that the "File" type always writes the binary
>> > > > content
>> > > > > of
>> > > > > > something such as an image or video directly within the Parquet
>> > file
>> > > > > (i.e.,
>> > > > > > "inlining")? Or would it make sense for the spec to have some
>> field
>> > > > > > distinguishing whether to store the content's bytes in the file
>> > > itself
>> > > > vs
>> > > > > > simply track a pointer to the actual file in storage (i.e.,
>> keeping
>> > > it
>> > > > > "out
>> > > > > > of line"). I would assume there are use cases where you would
>> want
>> > to
>> > > > > store
>> > > > > > the binary content of something, like a small image within the
>> > > Parquet
>> > > > > file
>> > > > > > instead of storing a pointer to a large video file in object
>> > storage.
>> > > > > >
>> > > > > > Regards,
>> > > > > > Rahil Chertara
>> > > > > >
>> > > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <
>> [email protected]>
>> > > > > wrote:
>> > > > > >
>> > > > > > >
>> > > > > > > Hello,
>> > > > > > >
>> > > > > > > I feel like this is the kind of use case where a hypothetical
>> > > > extension
>> > > > > > > type mechanism would be a better fit than hardcoding dedicated
>> > > > logical
>> > > > > > > types in the Thrift definition.
>> > > > > > >
>> > > > > > > Regards
>> > > > > > >
>> > > > > > > Antoine.
>> > > > > > >
>> > > > > > >
>> > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
>> > > > > > > > Hello Parquet community,
>> > > > > > > >
>> > > > > > > > Unstructured data ingestion is getting extremely popular
>> with
>> > the
>> > > > > > > advances
>> > > > > > > > in Generative AI. Today, our only means of dealing with
>> > > > unstructured
>> > > > > > data
>> > > > > > > > is to store it as a byte array inside Parquet, or point to
>> > files
>> > > > that
>> > > > > > > exist
>> > > > > > > > in some object store with a string. These solutions fail to
>> > > address
>> > > > > > these
>> > > > > > > > use cases, because of scalability, usability, and governance
>> > > > issues.
>> > > > > > > >
>> > > > > > > > We would like to introduce a new logical type annotation in
>> > > Parquet
>> > > > > > > called
>> > > > > > > > “File” for storing a struct that contains a path reference
>> to a
>> > > > file
>> > > > > > with
>> > > > > > > > additional metadata.
>> > > > > > > >
>> > > > > > > > We propose that the struct contains the following fields:
>> > > > > > > >
>> > > > > > > > path STRING NOT NULL -- the opaque path to a file
>> > > > > > > >
>> > > > > > > > size INT64 -- the size of the file in bytes
>> > > > > > > >
>> > > > > > > > content_type STRING       -- the mime/content type of the
>> file
>> > > > > > > >
>> > > > > > > > etag STRING -- the eTag identifier of the file. Can be used
>> to
>> > > > detect
>> > > > > > > > changes to a
>> > > > > > > >
>> > > > > > > > -- file
>> > > > > > > >
>> > > > > > > > The path will be stored as an opaque string; whatever the
>> user
>> > > > > > provides.
>> > > > > > > We
>> > > > > > > > don’t do any special encoding on it. The size will be the
>> size
>> > of
>> > > > the
>> > > > > > > file
>> > > > > > > > in bytes as long. We also store the content_type of the
>> file,
>> > and
>> > > > its
>> > > > > > > etag
>> > > > > > > > .
>> > > > > > > >
>> > > > > > > > We believe that these set of options are bare-bones and can
>> be
>> > > > easily
>> > > > > > > > extended by new optional fields in the future if desired
>> that
>> > > > > wouldn’t
>> > > > > > > > impact the correctness of the file being read. We would
>> like to
>> > > > > > > introduce a
>> > > > > > > > versioning field to the specification in case we need new
>> > fields
>> > > in
>> > > > > the
>> > > > > > > > specification that may impact correctness, when accessing a
>> > file.
>> > > > > > > >
>> > > > > > > > We would represent this in parquet.thrift
>> > > > > > > > <
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
>> > > > > > > >
>> > > > > > > > as:
>> > > > > > > >
>> > > > > > > > /**
>> > > > > > > >
>> > > > > > > >   * File logical type annotation
>> > > > > > > >
>> > > > > > > >   */
>> > > > > > > >
>> > > > > > > > struct FileType {
>> > > > > > > >
>> > > > > > > >    // Versioning specification of the File struct contents.
>> Can
>> > > be
>> > > > > used
>> > > > > > > if a
>> > > > > > > > new field is introduced to the
>> > > > > > > >
>> > > > > > > >    // struct representing the file, which may impact
>> > correctness
>> > > > when
>> > > > > > > > accessing the file.
>> > > > > > > >
>> > > > > > > >    1: optional i8 specification_version
>> > > > > > > >
>> > > > > > > > }
>> > > > > > > >
>> > > > > > > > We believe that by natively supporting File references in
>> > > Parquet,
>> > > > it
>> > > > > > > will
>> > > > > > > > become much simpler to build AI workloads on top of data
>> stored
>> > > in
>> > > > > > > Parquet
>> > > > > > > > across table formats and data processing engines. Looking
>> > forward
>> > > > to
>> > > > > > your
>> > > > > > > > feedback!
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to