Paul Rogers created DRILL-7598:
----------------------------------

             Summary: PostgreSQL-like functions for working with JSON
                 Key: DRILL-7598
                 URL: https://issues.apache.org/jira/browse/DRILL-7598
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.17.0
            Reporter: Paul Rogers


>From a contributor on the Drill user mailing list:
{quote}PostgreSQL has a practical way to manipulate the json data. You can 
read: [https://www.postgresql.org/docs/12/datatype-json.html].

{quote}
The user's use case is as follows:
{code:json}
{"a":"horses","b":"28","c":{"c1":"black","c2":"blue"}}
{"a":"rabbit","b":"14","c":{"c1":"green" ,"c4":"vanilla"}}
{"a":"cow"  ,"b":"28","c":{"c1":"blue" ,"c3":"black" ,"c5":{"d":"2","e":"3"}}}
{code}

Notice that the {{`c`}} column changes types. This causes Drill to fail in 
execution. Hence the suggestion to work with column {{c}} as JSON without 
parsing that JSON into Drill's relational schema.

Drill should offer such support. We've recently discussed introducing a similar 
feature in Drill which one could, with some humor, call "let JSON be JSON." The 
idea would be, as in PostreSQL, to simply represent JSON as text and allow the 
user to work with JSON using JSON-oriented functions. The PostreSQL link 
suggest that this is, in fact, a workable approach (though, as you not, doing 
so is slower than converting JSON to a relational structure.)

Today, however, Drill attempts to map JSON into a relational model so that the 
user can use [SQL operations to work on the 
data|https://drill.apache.org/docs/json-data-model/]. The Drill approach works 
well when the JSON is the output of a relational model (a dump of a relational 
table or query, say.) The approach does not work for "native" JSON in all its 
complexity. JSON is a superset of the relational model and so not all JSON 
files map to tables and columns.

To solve the user's use case, Drill would need to adopt a solution similar to 
PostgreSQL. In fact, Drill already has some of the pieces (such as the 
[CONVERT_TO/CONVERT_FROM 
operations|https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from]),
 but even these attempt to convert JSON to or from the relational model. What 
we need, so solve the general use case, are the kind of native JSON functions 
which PostgreSQL provides.

Fortunately, since Drill would store JSON as a VARCHAR, no work would be needed 
in the Drill "core". All that is needed is someone to provide a set of Drill 
functions (UDFs) to call out to some JSON library to perform the desired 
operations.

This feature would work best when the user can parse some parts of a JSON input 
file into relational structure, others as JSON. (This is the use case which the 
user list user faced.) So, we need a way to do that. See DRILL-7597 for a 
request for such a feature.

Combining the PostgreSQL-like JSON functions with the ability to read selected 
columns as JSON, might provide an elegant solution to the "messy JSON" problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to