Thanks for pushing this along. I think it is important. Sorry I'm coming late to the conversation. Couple thoughts:
- Should we reconsider having this be an independent optional field as opposed to overloading customer_metadata? It avoids having the weird string prefixing behavior - I'd be inclined to be much more stringent about type naming. Maybe even make the name multiple parts to force the issue? On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Micah, > > I have just updated my PR per your comments with more examples of > extension types. > > https://github.com/apache/arrow/pull/4332 > > Are there more comments about this? I can start a vote in a couple of > days absent further opinions. > > Can someone volunteer to review David's Java PR? I would like to move > this along so we have a chance of having working extension types in > the 0.14 release. A number of people are also interested in bridging > between pandas's ExtensionArray facility (for custom DataFrame column > types [1]) and Arrow's ExtensionType > > Thanks > Wes > > [1]: > https://pandas.pydata.org/pandas-docs/stable/development/extending.html > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Hi Wes, > > Like I said I think this approach looks good, I think what I'm looking > for is a little more documentation/examples on how additional types would > be handled. I think Tensor would be a good example, we also had questions > about INET addresses previously, maybe this would be a another good > illustrative example. Providing examples of serialized metadata in the > docs would be useful (clarifying that these are opaque binary blobs, that > will be passed along to extension type factories?) > > > > In this regard, I think it might be good to provide a further > recommendations for the name of extension types: What do you think about > recommend organization/projects namespace them to according to some > convention, so that there aren't conflicts and extensions can be shared? > > > > Thanks, > > Micah > > > > > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> > >> > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <wesmck...@gmail.com> wrote: > >>> > >>> Hi Micah, > >>> > >>> The use cases I'm aware of are mostly coming from proprietary > applications. My idea was for the extension metadata to be as unobtrusive > as possible. The only alternative as I see it would be to have an Extension > value in the Type union which would be more intrusive to applications > handling data for which they have no special handling. That doesn't seem > desirable if there are alternatives. > >> > >> > >> The other (3rd) option would be to add an extra member to Field. This > is also a bit more intrusive than having fields in the custom_metadata > dictionary. > >> > >>> > >>> As an immediate use case we could use extension types to embed Tensor > values in Binary arrays. > >>> > >>> Wes > >>> > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >>>> > >>>> Hi Wes, > >>>> This approach seems reasonable to me. I'm a little concerned we > haven't > >>>> validated many use-cases against the approach (but I don't see any > obvious > >>>> flaws). > >>>> > >>>> Thanks, > >>>> Micah > >>>> > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >>>> > >>>> > As Micah brought up, as part of this we would like to formalize the > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is similar > to > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If someone > >>>> > has a different idea about what the prefix should be I'm open to > other > >>>> > ideas > >>>> > > >>>> > [1] : > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files > >>>> > > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >>>> > > > >>>> > > hi folks, > >>>> > > > >>>> > > In a prior mailing list thread from February [1] I brought up some > >>>> > > work I'd done in C++ to create an API to define custom data types > that > >>>> > > can be embedded in built-in Arrow logical types. These are > serialized > >>>> > > through IPC by adding special fields to the `custom_metadata` > member > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if an > >>>> > > implementation does not understand the custom type, then they can > >>>> > > still interact with the underlying data if need be, or pass on the > >>>> > > extension metadata in subsequent IPC messages. > >>>> > > > >>>> > > David Li has put up a WIP PR to implement this for Java [4], so to > >>>> > > help the project move forward I think it's a good time to > formalize > >>>> > > this, and if there are disagreements to hash them out now. I have > just > >>>> > > opened a PR to the Arrow specification documents [3] that > describes > >>>> > > the current state of C++ and also the WIP Java PR. > >>>> > > > >>>> > > Any thought about this? If there is consensus about this solution > >>>> > > approach then I can hold a vote. > >>>> > > > >>>> > > Thanks > >>>> > > Wes > >>>> > > > >>>> > > [1]: > >>>> > > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E > >>>> > > [2]: > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291 > >>>> > > [3]: https://github.com/apache/arrow/pull/4332 > >>>> > > [4]: https://github.com/apache/arrow/pull/4251 > >>>> > >