On Wed, Mar 11, 2020 at 2:22 PM Antoine Pitrou <solip...@pitrou.net> wrote: > > On Wed, 11 Mar 2020 12:44:26 -0500 > Wes McKinney <wesmck...@gmail.com> wrote: > > On this note, in Python we should probably re-evaluate the data > > structure returned when accessing the "metadata" field. > > I think it's ok for the convenience API to return a dict, if we also > expose e.g. a "metadata_items" that returns an iterable of key/value > pairs.
The alternative would be a dict-like object that implements some of the dict APIs. Might also be useful to expose KeyValueMetadata::Merge in Python > Regards > > Antoine. > > > > > > On Wed, Mar 11, 2020 at 12:42 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > In the C++ library at least, uniqueness is never asserted when reading > > > and writing the IPC metadata [1] [2]. If you use > > > KeyValueMetadata::FindKey and the keys are non-unique, it will return > > > the first one it finds. KeyValueMetadata::Merge assumes uniqueness, > > > and the KeyValueMetadata::ToUnorderedMap function will drop all but > > > one duplicate. > > > > > > In Parquet, the metadata is also a list of KeyValue pairs with no > > > qualifications [3] > > > > > > My weak preference is to leave it to applications to make assertions > > > about uniqueness. In either case since the metadata is ordered in the > > > integration tests it would make sense to serialize as a list of > > > key/value pairs like {"key": $key, "value": $value} > > > > > > [1]: > > > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/metadata_internal.cc#L463 > > > [2]: > > > https://github.com/apache/arrow/blob/apache-arrow-0.16.0/cpp/src/arrow/ipc/metadata_internal.cc#L471 > > > [3]: > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728 > > > > > > On Wed, Mar 11, 2020 at 12:11 PM Ben Kietzman <ben.kietz...@rstudio.com> > > > wrote: > > > > > > > > While working on https://issues.apache.org/jira/browse/ARROW-2255 > > > > (serialize custom_metadata in the integration tests), we had the > > > > following > > > > discussion on GitHub: > > > > https://github.com/apache/arrow/pull/6556#pullrequestreview-372405940 > > > > > > > > In short, although in Schema.fbs custom_metadata is declared as an > > > > array of > > > > KeyValue pairs (so duplicate keys would be possible), all reference > > > > implementations assume it to represent an associative map with unique > > > > keys. > > > > > > > > Is there a use case for duplicate metadata keys? It seems that an > > > > acceptable resolution might be to note in Schema.fbs that > > > > implementations > > > > are allowed to assume that keys are unique > > > > > > > > Ben > > > > >