[ 
https://issues.apache.org/jira/browse/DRILL-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720708#comment-16720708
 ] 

Paul Rogers commented on DRILL-6390:
------------------------------------

Hi [~tta-scalefree], to recap, the issue occurs when the JSON schema is 
inconsistent and Drill attempts to use the experimental UNION, LIST or REPEATED 
LIST types. The "LATE" data type is used only by UNION, as I recall.

Your best bet remains to ensure your JSON follows a relational schema, which 
means every object (column) always has the same type.

In the example above, the initial null value forces Drill to choose a type, 
which it normally chooses as Nullable Int. Later, Drill discovers the type is 
an object (a nested record.)

One tying you could try would be to include an empty record in place of null:

{noformat}
{ data: [ { application: { } } ] }
{noformat}

Even better would be to use a JSON format closer to a relational model. This 
one has an array of objects containing objects. This model, while it should 
work in Drill, is awkward to use. The following might be easier to use:

{noformat}
{
    "data": 
        [
            {
                "application_id": null,
                "application_name": null,
                "application_object": null,
                "application_fee": null,
                "amount": 1
            },
            {
                "application_id": "some_id",
                "application_name":  "Some App",
                "application_object": "application",
                "application_fee": null,
                "amount": 5
            }
        ],
    "has_more": false,
    "object": "list",
    "url": "/v1/charges"
}
{noformat}

Then, you can use the new lateral join feature to flatten the data into the 
relational format that JDBC and ODBC expect.

Remember: Drill is a relational query engine. Though it reads JSON, it expects 
the JSON to be a valid, self-describing (extended) relational format. Drill 
can't predict the future, meaning that if you file starts with null values, 
Drill has no way to know the actual types.

At the most recent Drill Meetup, the developers described some exciting new 
work to allow you to specify a schema to resolve ambiguities, such as initial 
null values. Watch the dev list for updates as that work proceeds. This case is 
a very typical example of the kinds of problems this new feature is meant to 
solve.

> UnsupportedOperationException: Unable to get holder type for minor type 
> [LATE] and mode [OPTIONAL]
> --------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-6390
>                 URL: https://issues.apache.org/jira/browse/DRILL-6390
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types, Storage - JSON
>    Affects Versions: 1.11.0
>            Reporter: Trung Ta
>            Priority: Major
>
> I tried this query on some of our JSON files that are stored in a S3 Bucket, 
> in which I tried to convert JSON objects (application, application_fee) into 
> varchar:
> {code:java}
> SELECT
>      CONCAT(
>          '"', CONVERT_TO(_stg.data.application, 'JSON'), '"'
>          , ',', '"', CONVERT_TO(_stg.data.application_fee, 'JSON'), '"'
>      ) as JSONs
>  FROM (
>      SELECT
>          flatten(js.data) AS data
>      FROM
>          s3_bucket.`<PATH_TO_FILE_IN_S3>` AS js
>      ) _stg
> {code}
> On some of the datasets the query failed and I got this error message
> {quote}org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
> UnsupportedOperationException: Unable to get holder type for minor type 
> [LATE] and mode [OPTIONAL] Fragment 0:0 [Error Id: 
> 5e74e125-de22-46fa-8d31-233d0fc40140
> {quote}
> The dataset, on which the query failed looks like following:
> {code:java}
> {
>     "data": 
>         [
>             {
>                 "application": null,
>                 "application_fee": null,
>                 "amount": 1
>             },
>             {
>                 "application": 
>                     {
>                          "id": "some_id",
>                          "name": "Some App",
>                          "object": "application"
>                     },
>                 "application_fee": null,
>                 "amount": 5
>             }
>         ],
>     "has_more": false,
>     "object": "list",
>     "url": "/v1/charges"
> }
> {code}
> I ran some further tests on the query and found out that the query only fails 
> at object "application_fee", which is null in both arrays of object "data". 
> Which might explain why I never ran into this error in the other datasets (no 
> object arrays there are null).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to