Hello,

Sorry this hasn't gotten much attention recently. I just brought this up at
the Arrow community meeting, as I'd like to revive it.

It looks like there is a draft implementation up already [1].

I'm generally supportive of this, but I have a few questions:

1. Would we be able to make this extension type work on top of any of the
string types, including Utf8, LargeUtf8, and the (under consideration [2])
StringView types?
2. Does this imply a potential canonical extension type for every
text-based data format, such as HOCON, XML, and so on? If we agree JSON is
special, I think it's fine to have its own extension type. On the other
hand, it might be worth considering making a generic extension type for
serialized data, that is parameterized by the media type
("application/json" in this case).  This doesn't preclude the possibility
of building an extension type class / struct within Arrow implementations
that is specific to JSON; I don't think there's any hard rule that there
has to be a 1-1 correspondence between extension types in the format and
the concrete data structures in libraries.

Best,

Will Jones

[1] https://github.com/apache/arrow/pull/13901
[2] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt


On Thu, Dec 1, 2022 at 12:23 AM Antoine Pitrou <anto...@python.org> wrote:

>
> HOCON is a superset of JSON, so I'm not sure making it an extension type
> based it on JSON would be a good idea.
>
>
> Le 01/12/2022 à 06:23, Micah Kornfield a écrit :
> >>
> >> Can a logical extension be based on another logical extension?
> >
> > Potentially but this is mostly an implementation details, each type
> should
> > have their own specification IMO.
> >
> > HOCON support might be nice..
> >
> > I'm not sure if this is common enough to warrant a canonical type within
> > Arrow but you are welcome to propose something if you would like.
> >
> > Cheers,
> > Micah
> >
> > On Mon, Nov 28, 2022 at 11:55 AM Lee, David <david....@blackrock.com
> .invalid>
> > wrote:
> >
> >> Can a logical extension be based on another logical extension?
> >>
> >> HOCON support might be nice..
> >>
> >> -----Original Message-----
> >> From: Micah Kornfield <emkornfi...@gmail.com>
> >> Sent: Monday, November 28, 2022 11:50 AM
> >> To: dev@arrow.apache.org
> >> Subject: Re: [DISCUSS] JSON Canonical Extension Type
> >>
> >> External Email: Use caution with links and attachments
> >>
> >>
> >> This seems like a reasonable definition to me.  Since there hasn't been
> >> much feedback, I think maybe following through an implementation + this
> >> description in a PR would be the next steps.  If there isn't further
> >> feedback on this, once the PR is up we can have try to vote (which might
> >> bring up some more feedback, but hopefully wouldn't cause too much
> >> implementation churn).
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
> >> <pgollak...@google.com.invalid> wrote:
> >>
> >>> Hi folks!
> >>>
> >>> I put together this specification for canonicalizing the JSON type in
> >>> Arrow.
> >>>
> >>> ## Introduction
> >>> JSON is a widely used text based data interchange format. There are
> >>> many use cases where a user has a column whose contents are a JSON
> >>> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
> >>> Type][2] are two such examples.
> >>>
> >>> The JSON specification is defined in [RFC-8259][3]. However, many of
> >>> the most popular parsers support non standard extensions. Examples of
> >>> non standard extensions to JSON include comments, unquoted keys,
> >>> trailing commas, etc.
> >>>
> >>> ## Extension Specification
> >>> * The name of the extension is `arrow.json`
> >>> * The storage type of the extension is `utf8`
> >>> * The extension type has no parameters
> >>> * The metadata MUST be either empty or a valid JSON object
> >>>      - There is no canonical metadata
> >>>      - Implementations MAY include implementation-specific metadata by
> >>> using a namespaced key. For example `{"google.bigquery": {"my":
> >>> "metadata"}}`
> >>> * Implementations...
> >>>      - MUST produce valid UTF-8 encoded text
> >>>      - SHOULD produce valid standard JSON
> >>>      - MAY produce valid non-standard JSON
> >>>      - MUST support parsing standard JSON
> >>>      - MAY support parsing non standard JSON
> >>>      - SHOULD pass through contents that they do not understand
> >>>
> >>> ## Forward compatibility
> >>> In the future we might allow this logical type to annotate a byte
> >>> storage type with a different text encoding.  Implementations
> >>> consuming JSON logical types should verify this.
> >>>
> >>>      [1]:
> >>>
> >>>
> >>
> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
> >>>      [2]:
> >>>
> >>
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
> >>>      [3]:
> >>>
> >>
> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
> >>>
> >>
> >>
> >> This message may contain information that is confidential or privileged.
> >> If you are not the intended recipient, please advise the sender
> immediately
> >> and delete this message. See
> >> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> >> further information.  Please refer to
> >> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> >> information about BlackRock’s Privacy Policy.
> >>
> >>
> >> For a list of BlackRock's office addresses worldwide, see
> >> http://www.blackrock.com/corporate/about-us/contacts-locations.
> >>
> >> © 2022 BlackRock, Inc. All rights reserved.
> >>
> >
>

Reply via email to