> > If there are no strong arguments against the current proposal, may I follow > up with a pull request to apache/parquet-format > <https://github.com/apache/parquet-format>? What would be the next steps? > Or would I need to start a vote first?
Hi Burak, New feature steps are listed in the format contributors guide [1]. If there are no objections we can move to step 2 (completeness): A PR against parquet-format and updates to the reference implementations (hopefully these are pretty trivial for this case). I think we can probably start the PRs next week to give people a chance to digest the current proposal and speakup if there are hard objections. Cheers, Micah [1] https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format On Fri, Jun 5, 2026 at 8:25 AM Burak Yavuz <[email protected]> wrote: > Hi all, > > Thank you all for the great discussion on the document! I made another pass > on the doc. During the Parquet sync, there was alignment around keeping the > field as simple and minimalistic as possible. I updated the doc in that way > (removed content_type from the field) to ensure that the fields available > are all functional fields for correctly reading a file. > > Please let me know if you have more feedback! > > If there are no strong arguments against the current proposal, may I follow > up with a pull request to apache/parquet-format > <https://github.com/apache/parquet-format>? What would be the next steps? > Or would I need to start a vote first? > > Thanks, > Burak > > On Wed, May 27, 2026 at 10:31 AM Burak Yavuz <[email protected]> wrote: > > > Hello all, > > > > I'm sharing the design document for File Type here > > < > https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?usp=sharing > >. > > Please let me know what you think! > > Wanted to thank Micah Kornfield, Divjot Arora, and Daniel Weeks for their > > feedback working on this document. > > > > Steve, regarding your questions, my thoughts are inline: > > > 1. small inline blob somewhere within the parquet file (|data| = > > bytes) > > We have a lot of design options here. Does it need to be part of "File"? > > That's debatable. Engines/table formats can decide to coalesce a File > > reference with an inline value when available for example. Carrying an > > inline binary blob may make analytics workloads more inefficient, > > specifically if you have to carry them around as baggage through sorts > and > > shuffles. > > > > > 2. Medium blob: data stored range limited within a larger file (|data| > = > > kilo to megabytes) > > Again, can be up to a table format to decide creating sidecar files, > where > > the sidecar may be built on top of these file references. > > > > > 3. completely separate file (GB +), or somehow the data lifecycle isn't > > managed with parquet file. > > > > This file reference solves this problem as well. > > > > > lifecycle management you don't want to discover that your photo > > collection has been deleted by accident, and a data rewrite such as > > applying DVs shouldn't mandate rebuilding of external binary files. > > > security, esp when providing credential access to tables. Credential > > providers would also need to provide file access, so have to know > which > > binary files are associated with parquet files, somehow. > > > > These all sound like problems that should be handled at different layers > > of: > > - table format > > - engine > > - catalog > > to me. > > > > > > Looking forward to your feedback! Also @Antoine, I put in a blurb around > > the extension framework in there. Would love your thoughts on that. > > > > Best, > > Burak > > > > > > On Wed, May 27, 2026 at 3:09 AM Steve Loughran <[email protected]> > > wrote: > > > >> I do think FILE would be good, even though it gets complicate fast. > >> > >> It'd be good to support all of > >> > >> 1. small inline blob somewhere within the parquet file (|data| = > bytes) > >> 2. Medium blob: data stored range limited within a larger file > (|data| > >> = > >> kilo to megabytes) > >> 3. completely separate file (GB +), or somehow the data lifecycle > isn't > >> managed with parquet file. > >> > >> Issues I can see > >> > >> - lifecycle management you don't want to discover that your photo > >> collection has been deleted by accident, and a data rewrite such as > >> applying DVs shouldn't mandate rebuilding of external binary files. > >> - security, esp when providing credential access to tables. > Credential > >> providers would also need to provide file access, so have to know > which > >> binary files are associated with parquet files, somehow. > >> > >> What have other formats done here? > >> > >> On Thu, 21 May 2026 at 22:13, Ryan Blue <[email protected]> wrote: > >> > >> > For some reason, the original email never came through for me. This > >> thread > >> > starts with Rahil's email. In case other people are having the same > >> > problem, here's the thread Burak is talking about: > >> > https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy > >> > > >> > Ryan > >> > > >> > On Thu, May 21, 2026 at 1:30 PM Burak Yavuz <[email protected]> wrote: > >> > > >> > > I'll share something early next week. The original proposal is in > the > >> > first > >> > > email in this thread. > >> > > > >> > > Best, > >> > > Burak > >> > > > >> > > On Thu, May 21, 2026, 1:15 PM Russell Spitzer < > >> [email protected] > >> > > > >> > > wrote: > >> > > > >> > > > Do we have a proposal for this yet? I'm excited to go over it and > I > >> > > thought > >> > > > one was mentioned in the last sync but I haven't seen it. > >> > > > > >> > > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]> > >> wrote: > >> > > > > >> > > > > Hi all, > >> > > > > > >> > > > > Very sorry for the late reply, and thanks for the questions! The > >> > > messages > >> > > > > were not landing in my inbox properly. > >> > > > > > >> > > > > @Antoine > >> > > > > > I feel like this is the kind of use case where a hypothetical > >> > > extension > >> > > > > type mechanism would be a better fit than hardcoding dedicated > >> > logical > >> > > > > types in the Thrift definition. > >> > > > > > >> > > > > How would that look like? We wanted to introduce this logical > >> type to > >> > > > > Parquet specifically, so that table formats such as Delta and > >> Iceberg > >> > > can > >> > > > > have a simpler protocol change, and that we could provide this > as > >> a > >> > > > > consistent format across multiple data processing engines. > >> > > > > > >> > > > > > >> > > > > @Rahil > >> > > > > > I wanted to better understand one point. Based on the current > >> spec > >> > > you > >> > > > > shared I see you have a parameter for the following: > >> > > > > > > size INT64 -- the size of the file in bytes > >> > > > > > Are you proposing that the "File" type always writes the > binary > >> > > > content > >> > > > > of > >> > > > > something such as an image or video directly within the Parquet > >> file > >> > > > (i.e., > >> > > > > "inlining")? Or would it make sense for the spec to have some > >> field > >> > > > > distinguishing whether to store the content's bytes in the file > >> > itself > >> > > vs > >> > > > > simply track a pointer to the actual file in storage (i.e., > >> keeping > >> > it > >> > > > "out > >> > > > > of line"). > >> > > > > > >> > > > > This is a great question. When it comes to FileType, the data > will > >> > > > > primarily be external to the parquet file, so the FileType would > >> just > >> > > > store > >> > > > > the pointer to the data. > >> > > > > Now, can that data be inlined anyway? That is an optimization > that > >> > can > >> > > > > certainly be done. However, that requires some benchmarks to see > >> how > >> > > much > >> > > > > the benefit would be. > >> > > > > If compute engines were to carry this struct without any column > >> > pruning > >> > > > > across all operations, having inline binary content would make > >> > > operations > >> > > > > like sorting and shuffling a lot more expensive. > >> > > > > We couldn't instinctively justify whether this would be worth it > >> just > >> > > > yet. > >> > > > > However, the current proposed spec doesn't prevent you from also > >> > > storing > >> > > > > the content inline side by side with the pointer information. > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]> > >> wrote: > >> > > > > > >> > > > > > Hi Burak, > >> > > > > > > >> > > > > > Thanks for starting this discussion. I was also interested in > >> > raising > >> > > > > this > >> > > > > > topic within the Parquet community (unless it has already been > >> > > > discussed > >> > > > > in > >> > > > > > the past). > >> > > > > > For users working with unstructured data today such as large > >> text, > >> > > > > images, > >> > > > > > or video, a data type such as a "file" or "blob" would be > >> useful. > >> > > > > > > >> > > > > > I wanted to better understand one point. Based on the current > >> spec > >> > > you > >> > > > > > shared I see you have a parameter for the following: > >> > > > > > > size INT64 -- the size of the file in bytes > >> > > > > > > >> > > > > > Are you proposing that the "File" type always writes the > binary > >> > > > content > >> > > > > of > >> > > > > > something such as an image or video directly within the > Parquet > >> > file > >> > > > > (i.e., > >> > > > > > "inlining")? Or would it make sense for the spec to have some > >> field > >> > > > > > distinguishing whether to store the content's bytes in the > file > >> > > itself > >> > > > vs > >> > > > > > simply track a pointer to the actual file in storage (i.e., > >> keeping > >> > > it > >> > > > > "out > >> > > > > > of line"). I would assume there are use cases where you would > >> want > >> > to > >> > > > > store > >> > > > > > the binary content of something, like a small image within the > >> > > Parquet > >> > > > > file > >> > > > > > instead of storing a pointer to a large video file in object > >> > storage. > >> > > > > > > >> > > > > > Regards, > >> > > > > > Rahil Chertara > >> > > > > > > >> > > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou < > >> [email protected]> > >> > > > > wrote: > >> > > > > > > >> > > > > > > > >> > > > > > > Hello, > >> > > > > > > > >> > > > > > > I feel like this is the kind of use case where a > hypothetical > >> > > > extension > >> > > > > > > type mechanism would be a better fit than hardcoding > dedicated > >> > > > logical > >> > > > > > > types in the Thrift definition. > >> > > > > > > > >> > > > > > > Regards > >> > > > > > > > >> > > > > > > Antoine. > >> > > > > > > > >> > > > > > > > >> > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit : > >> > > > > > > > Hello Parquet community, > >> > > > > > > > > >> > > > > > > > Unstructured data ingestion is getting extremely popular > >> with > >> > the > >> > > > > > > advances > >> > > > > > > > in Generative AI. Today, our only means of dealing with > >> > > > unstructured > >> > > > > > data > >> > > > > > > > is to store it as a byte array inside Parquet, or point to > >> > files > >> > > > that > >> > > > > > > exist > >> > > > > > > > in some object store with a string. These solutions fail > to > >> > > address > >> > > > > > these > >> > > > > > > > use cases, because of scalability, usability, and > governance > >> > > > issues. > >> > > > > > > > > >> > > > > > > > We would like to introduce a new logical type annotation > in > >> > > Parquet > >> > > > > > > called > >> > > > > > > > “File” for storing a struct that contains a path reference > >> to a > >> > > > file > >> > > > > > with > >> > > > > > > > additional metadata. > >> > > > > > > > > >> > > > > > > > We propose that the struct contains the following fields: > >> > > > > > > > > >> > > > > > > > path STRING NOT NULL -- the opaque path to a file > >> > > > > > > > > >> > > > > > > > size INT64 -- the size of the file in bytes > >> > > > > > > > > >> > > > > > > > content_type STRING -- the mime/content type of the > >> file > >> > > > > > > > > >> > > > > > > > etag STRING -- the eTag identifier of the file. Can be > used > >> to > >> > > > detect > >> > > > > > > > changes to a > >> > > > > > > > > >> > > > > > > > -- file > >> > > > > > > > > >> > > > > > > > The path will be stored as an opaque string; whatever the > >> user > >> > > > > > provides. > >> > > > > > > We > >> > > > > > > > don’t do any special encoding on it. The size will be the > >> size > >> > of > >> > > > the > >> > > > > > > file > >> > > > > > > > in bytes as long. We also store the content_type of the > >> file, > >> > and > >> > > > its > >> > > > > > > etag > >> > > > > > > > . > >> > > > > > > > > >> > > > > > > > We believe that these set of options are bare-bones and > can > >> be > >> > > > easily > >> > > > > > > > extended by new optional fields in the future if desired > >> that > >> > > > > wouldn’t > >> > > > > > > > impact the correctness of the file being read. We would > >> like to > >> > > > > > > introduce a > >> > > > > > > > versioning field to the specification in case we need new > >> > fields > >> > > in > >> > > > > the > >> > > > > > > > specification that may impact correctness, when accessing > a > >> > file. > >> > > > > > > > > >> > > > > > > > We would represent this in parquet.thrift > >> > > > > > > > < > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift > >> > > > > > > > > >> > > > > > > > as: > >> > > > > > > > > >> > > > > > > > /** > >> > > > > > > > > >> > > > > > > > * File logical type annotation > >> > > > > > > > > >> > > > > > > > */ > >> > > > > > > > > >> > > > > > > > struct FileType { > >> > > > > > > > > >> > > > > > > > // Versioning specification of the File struct > contents. > >> Can > >> > > be > >> > > > > used > >> > > > > > > if a > >> > > > > > > > new field is introduced to the > >> > > > > > > > > >> > > > > > > > // struct representing the file, which may impact > >> > correctness > >> > > > when > >> > > > > > > > accessing the file. > >> > > > > > > > > >> > > > > > > > 1: optional i8 specification_version > >> > > > > > > > > >> > > > > > > > } > >> > > > > > > > > >> > > > > > > > We believe that by natively supporting File references in > >> > > Parquet, > >> > > > it > >> > > > > > > will > >> > > > > > > > become much simpler to build AI workloads on top of data > >> stored > >> > > in > >> > > > > > > Parquet > >> > > > > > > > across table formats and data processing engines. Looking > >> > forward > >> > > > to > >> > > > > > your > >> > > > > > > > feedback! > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >
