[
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305690#comment-16305690
]
Paul Rogers commented on DRILL-6035:
------------------------------------
h4. Drill’s Preferred JSON Format
By now it should be clear that JSON supports a huge variety of data formats,
while Drill provides good support for one very specific format. Drill has
challenges to the degree that the actual format deviates from Drill’s
preference. (In this sense, Drill’s claim to be schema-free and based on
arbitrary JSON is more of an aspiration than a reality.)
Drill’s preferred JSON format is:
* Data presented as a series of objects which correspond to Drill rows.
* Every object has the same set of name/value pairs which correspond to Drill
columns.
* Within the top-level object, keys are column names, values are the (scalar)
value of that column.
* Every field has a single, fixed type.
* Fields with floating point numbers always include a decimal point.
* Null density is low. Specifically, the first batch of every file contains an
actual value for every field. (That is, no long runs of null or missing
columns.)
* If nested objects appear (singly, or in lists) they follow the same rules as
the top-level object, and directly represent application data. (That is, data
is not encoded in any fancy format.)
* Only single-dimension lists are allowed. Preferably, only a single tree of
lists (that can be expanded with the `flatten()` function.)
For example:
{code}
{ id: 101, name: “fred”, active: true, balance: 123.45,
ship_address: {line1: “301 Cobblestone Way”, city: “Bedrock},
bill_address: {line1: “345 Stonecave Road”, city: “Bedrock}
}
{code}
Drill works best when the JSON was created to comply with the above rules. If
we run the rules in reverse, we get the format that Drill creates when doing a
CTAS to JSON.
> Specify Drill's JSON behavior
> -----------------------------
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.13.0
> Reporter: Paul Rogers
> Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests
> that Drill may have limitations in the JSON that Drill supports. This ticket
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed
> specifications that clarifies what Drill does and does not support (or what
> is should and should not support.)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)