I do think FILE would be good, even though it gets complicate fast.

It'd be good to support all of

   1. small inline blob somewhere within the parquet file (|data| = bytes)
   2. Medium blob: data stored range limited within a larger file (|data| =
   kilo to megabytes)
   3. completely separate file (GB +), or somehow the data lifecycle isn't
   managed with parquet file.

Issues I can see

   - lifecycle management you don't want to discover that your photo
   collection has been deleted by accident, and a data rewrite such as
   applying DVs shouldn't mandate rebuilding of external binary files.
   - security, esp when providing credential access to tables. Credential
   providers would also need to provide file access, so have to know which
   binary files are associated with parquet files, somehow.

What have other formats done here?

On Thu, 21 May 2026 at 22:13, Ryan Blue <[email protected]> wrote:

> For some reason, the original email never came through for me. This thread
> starts with Rahil's email. In case other people are having the same
> problem, here's the thread Burak is talking about:
> https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy
>
> Ryan
>
> On Thu, May 21, 2026 at 1:30 PM Burak Yavuz <[email protected]> wrote:
>
> > I'll share something early next week. The original proposal is in the
> first
> > email in this thread.
> >
> > Best,
> > Burak
> >
> > On Thu, May 21, 2026, 1:15 PM Russell Spitzer <[email protected]
> >
> > wrote:
> >
> > > Do we have a proposal for this yet? I'm excited to go over it and I
> > thought
> > > one was mentioned in the last sync but I haven't seen it.
> > >
> > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Very sorry for the late reply, and thanks for the questions! The
> > messages
> > > > were not landing in my inbox properly.
> > > >
> > > > @Antoine
> > > > > I feel like this is the kind of use case where a hypothetical
> > extension
> > > > type mechanism would be a better fit than hardcoding dedicated
> logical
> > > > types in the Thrift definition.
> > > >
> > > > How would that look like? We wanted to introduce this logical type to
> > > > Parquet specifically, so that table formats such as Delta and Iceberg
> > can
> > > > have a simpler protocol change, and that we could provide this as a
> > > > consistent format across multiple data processing engines.
> > > >
> > > >
> > > > @Rahil
> > > > > I wanted to better understand one point. Based on the current spec
> > you
> > > > shared I see you have a parameter for the following:
> > > > > > size INT64 -- the size of the file in bytes
> > > > >  Are you proposing that the "File" type always writes the binary
> > > content
> > > > of
> > > > something such as an image or video directly within the Parquet file
> > > (i.e.,
> > > > "inlining")? Or would it make sense for the spec to have some field
> > > > distinguishing whether to store the content's bytes in the file
> itself
> > vs
> > > > simply track a pointer to the actual file in storage (i.e., keeping
> it
> > > "out
> > > > of line").
> > > >
> > > > This is a great question. When it comes to FileType, the data will
> > > > primarily be external to the parquet file, so the FileType would just
> > > store
> > > > the pointer to the data.
> > > > Now, can that data be inlined anyway? That is an optimization that
> can
> > > > certainly be done. However, that requires some benchmarks to see how
> > much
> > > > the benefit would be.
> > > > If compute engines were to carry this struct without any column
> pruning
> > > > across all operations, having inline binary content would make
> > operations
> > > > like sorting and shuffling a lot more expensive.
> > > > We couldn't instinctively justify whether this would be worth it just
> > > yet.
> > > > However, the current proposed spec doesn't prevent you from also
> > storing
> > > > the content inline side by side with the pointer information.
> > > >
> > > >
> > > >
> > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]> wrote:
> > > >
> > > > > Hi Burak,
> > > > >
> > > > > Thanks for starting this discussion. I was also interested in
> raising
> > > > this
> > > > > topic within the Parquet community (unless it has already been
> > > discussed
> > > > in
> > > > > the past).
> > > > > For users working with unstructured data today such as large text,
> > > > images,
> > > > > or video, a data type such as a "file" or "blob" would be useful.
> > > > >
> > > > > I wanted to better understand one point. Based on the current spec
> > you
> > > > > shared I see you have a parameter for the following:
> > > > > > size INT64 -- the size of the file in bytes
> > > > >
> > > > >  Are you proposing that the "File" type always writes the binary
> > > content
> > > > of
> > > > > something such as an image or video directly within the Parquet
> file
> > > > (i.e.,
> > > > > "inlining")? Or would it make sense for the spec to have some field
> > > > > distinguishing whether to store the content's bytes in the file
> > itself
> > > vs
> > > > > simply track a pointer to the actual file in storage (i.e., keeping
> > it
> > > > "out
> > > > > of line"). I would assume there are use cases where you would want
> to
> > > > store
> > > > > the binary content of something, like a small image within the
> > Parquet
> > > > file
> > > > > instead of storing a pointer to a large video file in object
> storage.
> > > > >
> > > > > Regards,
> > > > > Rahil Chertara
> > > > >
> > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <[email protected]>
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I feel like this is the kind of use case where a hypothetical
> > > extension
> > > > > > type mechanism would be a better fit than hardcoding dedicated
> > > logical
> > > > > > types in the Thrift definition.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
> > > > > > > Hello Parquet community,
> > > > > > >
> > > > > > > Unstructured data ingestion is getting extremely popular with
> the
> > > > > > advances
> > > > > > > in Generative AI. Today, our only means of dealing with
> > > unstructured
> > > > > data
> > > > > > > is to store it as a byte array inside Parquet, or point to
> files
> > > that
> > > > > > exist
> > > > > > > in some object store with a string. These solutions fail to
> > address
> > > > > these
> > > > > > > use cases, because of scalability, usability, and governance
> > > issues.
> > > > > > >
> > > > > > > We would like to introduce a new logical type annotation in
> > Parquet
> > > > > > called
> > > > > > > “File” for storing a struct that contains a path reference to a
> > > file
> > > > > with
> > > > > > > additional metadata.
> > > > > > >
> > > > > > > We propose that the struct contains the following fields:
> > > > > > >
> > > > > > > path STRING NOT NULL -- the opaque path to a file
> > > > > > >
> > > > > > > size INT64 -- the size of the file in bytes
> > > > > > >
> > > > > > > content_type STRING       -- the mime/content type of the file
> > > > > > >
> > > > > > > etag STRING -- the eTag identifier of the file. Can be used to
> > > detect
> > > > > > > changes to a
> > > > > > >
> > > > > > > -- file
> > > > > > >
> > > > > > > The path will be stored as an opaque string; whatever the user
> > > > > provides.
> > > > > > We
> > > > > > > don’t do any special encoding on it. The size will be the size
> of
> > > the
> > > > > > file
> > > > > > > in bytes as long. We also store the content_type of the file,
> and
> > > its
> > > > > > etag
> > > > > > > .
> > > > > > >
> > > > > > > We believe that these set of options are bare-bones and can be
> > > easily
> > > > > > > extended by new optional fields in the future if desired that
> > > > wouldn’t
> > > > > > > impact the correctness of the file being read. We would like to
> > > > > > introduce a
> > > > > > > versioning field to the specification in case we need new
> fields
> > in
> > > > the
> > > > > > > specification that may impact correctness, when accessing a
> file.
> > > > > > >
> > > > > > > We would represent this in parquet.thrift
> > > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> > > > > > >
> > > > > > > as:
> > > > > > >
> > > > > > > /**
> > > > > > >
> > > > > > >   * File logical type annotation
> > > > > > >
> > > > > > >   */
> > > > > > >
> > > > > > > struct FileType {
> > > > > > >
> > > > > > >    // Versioning specification of the File struct contents. Can
> > be
> > > > used
> > > > > > if a
> > > > > > > new field is introduced to the
> > > > > > >
> > > > > > >    // struct representing the file, which may impact
> correctness
> > > when
> > > > > > > accessing the file.
> > > > > > >
> > > > > > >    1: optional i8 specification_version
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > > > We believe that by natively supporting File references in
> > Parquet,
> > > it
> > > > > > will
> > > > > > > become much simpler to build AI workloads on top of data stored
> > in
> > > > > > Parquet
> > > > > > > across table formats and data processing engines. Looking
> forward
> > > to
> > > > > your
> > > > > > > feedback!
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to