[
https://issues.apache.org/jira/browse/DRILL-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621555#comment-14621555
]
ASF GitHub Bot commented on DRILL-3353:
---------------------------------------
GitHub user StevenMPhillips opened a pull request:
https://github.com/apache/drill/pull/86
DRILL-3353: Fix dropping nested fields
Use the SchemaChangeCallBack in more places to track schema changes
Reset the ephemeral transfer pair when making a new transfer pair for Map
or RepeatedMap
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StevenMPhillips/incubator-drill drill-3353
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/86.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #86
----
commit 6598d5efa99e7516a882ea17582d3014e13d3ca6
Author: Steven Phillips <[email protected]>
Date: 2015-07-09T00:35:09Z
DRILL-3353: Fix dropping nested fields
Use the SchemaChangeCallBack in more places to track schema changes
Reset the ephemeral transfer pair when making a new transfer pair for Map
or RepeatedMap
----
> Non data-type related schema changes errors
> -------------------------------------------
>
> Key: DRILL-3353
> URL: https://issues.apache.org/jira/browse/DRILL-3353
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - JSON
> Affects Versions: 1.0.0
> Reporter: Oscar Bernal
> Assignee: Steven Phillips
> Fix For: 1.2.0
>
> Attachments: i-bfbc0a5c-ios-PulsarEvent-2015-06-23_19.json.zip
>
>
> I'm having trouble querying a data set with varying schema for a nested
> object fields. The majority of my data for a specific type of record has the
> following nested data:
> {code}
> "attributes":{"daysSinceInstall":0,"destination":"none","logged":"no","nth":1,"type":"organic","wearable":"no"}}
> {code}
> Among those records (hundreds of them) I have only two with a slightly
> different schema:
> {code}
> "attributes":{"adSet":"Teste-Adwords-Engagement-Branch-iOS-230615-adset","campaign":"Teste-Adwords-Engagement-Branch-iOS-230615","channel":"Adwords","daysSinceInstall":0,"destination":"none","logged":"no","nth":4,"type":"branch","wearable":"no"}}
> {code}
> When trying to query the "new" fields, my queries fail:
> With {code:sql}ALTER SYSTEM SET `store.json.all_text_mode` = true;{code}
> {noformat}
> 0: jdbc:drill:zk=local> select log.event.attributes from
> `dfs`.`root`.`/file.json` as log where log.si =
> '07A3F985-4B34-4A01-9B83-3B14548EF7BE' and log.event.attributes.ad =
> 'Teste-FB-Engagement-Puro-iOS-230615';
> Error: SYSTEM ERROR: java.lang.NumberFormatException:
> Teste-FB-Engagement-Puro-iOS-230615"
> Fragment 0:0
> [Error Id: 22d37a65-7dd0-4661-bbfc-7a50bbee9388 on
> ip-10-0-1-16.sa-east-1.compute.internal:31010] (state=,code=0)
> {noformat}
> With {code:sql}ALTER SYSTEM SET `store.json.all_text_mode` = false;`{code}
> {noformat}
> 0: jdbc:drill:zk=local> select log.event.attributes from
> `dfs`.`root`.`/file.json` as log where log.si =
> '07A3F985-4B34-4A01-9B83-3B14548EF7BE';
> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a Bit type
> when you are using a ValueWriter of type NullableVarCharWriterImpl.
> File file.json
> Record 35
> Fragment 0:0
> [Error Id: 5746e3e9-48c0-44b1-8e5f-7c94e7c64d0f on
> ip-10-0-1-16.sa-east-1.compute.internal:31010] (state=,code=0)
> {noformat}
> If I try to extract all "attributes" from those events, Drill will only
> return a subset of the fields, ignoring the others.
> {noformat}
> 0: jdbc:drill:zk=local> select log.event.attributes from
> `dfs`.`root`.`/file.json` as log where log.si =
> '07A3F985-4B34-4A01-9B83-3B14548EF7BE' and log.type ='Opens App';
> +----------------------------------------------------+
> | EXPR$0 |
> +----------------------------------------------------+
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> +----------------------------------------------------+
> {noformat}
> What I find strange is that I have thousands of records in the same file with
> different schema for different record types and all other queries seem run
> well.
> Is there something about how Drill infers schema that I might be missing
> here? Does it infer based on a sample % of the data and fail for records that
> were not taken into account while inferring schema? I suspect I wouldn't have
> this error if I had 100's of records with that other schema inside the file,
> but I can't find anything in the docs or code to support that hypothesis.
> Perhaps it's just a bug? Is it expected?
> Troubleshooting guide seems to mention something about this but it's very
> vague in implying Drill doesn't fully support schema changes. I thought that
> was for data type changes mostly, for which there are other well documented
> issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)