I'll share something early next week. The original proposal is in the first email in this thread.
Best, Burak On Thu, May 21, 2026, 1:15 PM Russell Spitzer <[email protected]> wrote: > Do we have a proposal for this yet? I'm excited to go over it and I thought > one was mentioned in the last sync but I haven't seen it. > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]> wrote: > > > Hi all, > > > > Very sorry for the late reply, and thanks for the questions! The messages > > were not landing in my inbox properly. > > > > @Antoine > > > I feel like this is the kind of use case where a hypothetical extension > > type mechanism would be a better fit than hardcoding dedicated logical > > types in the Thrift definition. > > > > How would that look like? We wanted to introduce this logical type to > > Parquet specifically, so that table formats such as Delta and Iceberg can > > have a simpler protocol change, and that we could provide this as a > > consistent format across multiple data processing engines. > > > > > > @Rahil > > > I wanted to better understand one point. Based on the current spec you > > shared I see you have a parameter for the following: > > > > size INT64 -- the size of the file in bytes > > > Are you proposing that the "File" type always writes the binary > content > > of > > something such as an image or video directly within the Parquet file > (i.e., > > "inlining")? Or would it make sense for the spec to have some field > > distinguishing whether to store the content's bytes in the file itself vs > > simply track a pointer to the actual file in storage (i.e., keeping it > "out > > of line"). > > > > This is a great question. When it comes to FileType, the data will > > primarily be external to the parquet file, so the FileType would just > store > > the pointer to the data. > > Now, can that data be inlined anyway? That is an optimization that can > > certainly be done. However, that requires some benchmarks to see how much > > the benefit would be. > > If compute engines were to carry this struct without any column pruning > > across all operations, having inline binary content would make operations > > like sorting and shuffling a lot more expensive. > > We couldn't instinctively justify whether this would be worth it just > yet. > > However, the current proposed spec doesn't prevent you from also storing > > the content inline side by side with the pointer information. > > > > > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]> wrote: > > > > > Hi Burak, > > > > > > Thanks for starting this discussion. I was also interested in raising > > this > > > topic within the Parquet community (unless it has already been > discussed > > in > > > the past). > > > For users working with unstructured data today such as large text, > > images, > > > or video, a data type such as a "file" or "blob" would be useful. > > > > > > I wanted to better understand one point. Based on the current spec you > > > shared I see you have a parameter for the following: > > > > size INT64 -- the size of the file in bytes > > > > > > Are you proposing that the "File" type always writes the binary > content > > of > > > something such as an image or video directly within the Parquet file > > (i.e., > > > "inlining")? Or would it make sense for the spec to have some field > > > distinguishing whether to store the content's bytes in the file itself > vs > > > simply track a pointer to the actual file in storage (i.e., keeping it > > "out > > > of line"). I would assume there are use cases where you would want to > > store > > > the binary content of something, like a small image within the Parquet > > file > > > instead of storing a pointer to a large video file in object storage. > > > > > > Regards, > > > Rahil Chertara > > > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <[email protected]> > > wrote: > > > > > > > > > > > Hello, > > > > > > > > I feel like this is the kind of use case where a hypothetical > extension > > > > type mechanism would be a better fit than hardcoding dedicated > logical > > > > types in the Thrift definition. > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit : > > > > > Hello Parquet community, > > > > > > > > > > Unstructured data ingestion is getting extremely popular with the > > > > advances > > > > > in Generative AI. Today, our only means of dealing with > unstructured > > > data > > > > > is to store it as a byte array inside Parquet, or point to files > that > > > > exist > > > > > in some object store with a string. These solutions fail to address > > > these > > > > > use cases, because of scalability, usability, and governance > issues. > > > > > > > > > > We would like to introduce a new logical type annotation in Parquet > > > > called > > > > > “File” for storing a struct that contains a path reference to a > file > > > with > > > > > additional metadata. > > > > > > > > > > We propose that the struct contains the following fields: > > > > > > > > > > path STRING NOT NULL -- the opaque path to a file > > > > > > > > > > size INT64 -- the size of the file in bytes > > > > > > > > > > content_type STRING -- the mime/content type of the file > > > > > > > > > > etag STRING -- the eTag identifier of the file. Can be used to > detect > > > > > changes to a > > > > > > > > > > -- file > > > > > > > > > > The path will be stored as an opaque string; whatever the user > > > provides. > > > > We > > > > > don’t do any special encoding on it. The size will be the size of > the > > > > file > > > > > in bytes as long. We also store the content_type of the file, and > its > > > > etag > > > > > . > > > > > > > > > > We believe that these set of options are bare-bones and can be > easily > > > > > extended by new optional fields in the future if desired that > > wouldn’t > > > > > impact the correctness of the file being read. We would like to > > > > introduce a > > > > > versioning field to the specification in case we need new fields in > > the > > > > > specification that may impact correctness, when accessing a file. > > > > > > > > > > We would represent this in parquet.thrift > > > > > < > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift > > > > > > > > > > as: > > > > > > > > > > /** > > > > > > > > > > * File logical type annotation > > > > > > > > > > */ > > > > > > > > > > struct FileType { > > > > > > > > > > // Versioning specification of the File struct contents. Can be > > used > > > > if a > > > > > new field is introduced to the > > > > > > > > > > // struct representing the file, which may impact correctness > when > > > > > accessing the file. > > > > > > > > > > 1: optional i8 specification_version > > > > > > > > > > } > > > > > > > > > > We believe that by natively supporting File references in Parquet, > it > > > > will > > > > > become much simpler to build AI workloads on top of data stored in > > > > Parquet > > > > > across table formats and data processing engines. Looking forward > to > > > your > > > > > feedback! > > > > > > > > > > > > > > > > > > > > > > >
