[
https://issues.apache.org/jira/browse/DRILL-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Oscar Bernal updated DRILL-3353:
--------------------------------
Description:
I'm having trouble querying a data set with varying schema for a nested object
fields. The majority of my data for a specific type of record has the following
nested data:
{code}
"attributes":{"daysSinceInstall":0,"destination":"none","logged":"no","nth":1,"type":"organic","wearable":"no"}}
{code}
Among those records (hundreds of them) I have only two with a slightly
different schema:
{code}
"attributes":{"adSet":"Teste-Adwords-Engagement-Branch-iOS-230615-adset","campaign":"Teste-Adwords-Engagement-Branch-iOS-230615","channel":"Adwords","daysSinceInstall":0,"destination":"none","logged":"no","nth":4,"type":"branch","wearable":"no"}}
{code}
When trying to query the "new" fields, my queries fail:
With {code:sql}ALTER SYSTEM SET `store.json.all_text_mode` = true;{code}
{noformat}
0: jdbc:drill:zk=local> select log.event.attributes from
`dfs`.`root`.`/file.json` as log where log.si =
'07A3F985-4B34-4A01-9B83-3B14548EF7BE' and log.event.attributes.ad =
'Teste-FB-Engagement-Puro-iOS-230615';
Error: SYSTEM ERROR: java.lang.NumberFormatException:
Teste-FB-Engagement-Puro-iOS-230615"
Fragment 0:0
[Error Id: 22d37a65-7dd0-4661-bbfc-7a50bbee9388 on
ip-10-0-1-16.sa-east-1.compute.internal:31010] (state=,code=0)
{noformat}
With {code:sql}ALTER SYSTEM SET `store.json.all_text_mode` = false;`{code}
{noformat}
0: jdbc:drill:zk=local> select log.event.attributes from
`dfs`.`root`.`/file.json` as log where log.si =
'07A3F985-4B34-4A01-9B83-3B14548EF7BE';
Error: DATA_READ ERROR: Error parsing JSON - You tried to write a Bit type when
you are using a ValueWriter of type NullableVarCharWriterImpl.
File file.json
Record 35
Fragment 0:0
[Error Id: 5746e3e9-48c0-44b1-8e5f-7c94e7c64d0f on
ip-10-0-1-16.sa-east-1.compute.internal:31010] (state=,code=0)
{noformat}
If I try to extract all "attributes" from those events, Drill will only return
a subset of the fields, ignoring the others.
{noformat}
0: jdbc:drill:zk=local> select log.event.attributes from
`dfs`.`root`.`/file.json` as log where log.si =
'07A3F985-4B34-4A01-9B83-3B14548EF7BE' and log.type ='Opens App';
+----------------------------------------------------+
| EXPR$0 |
+----------------------------------------------------+
| {"logged":"no","wearable":"no","type":"xxxx"} |
| {"logged":"no","wearable":"no","type":"xxxx"} |
| {"logged":"no","wearable":"no","type":"xxxx"} |
| {"logged":"no","wearable":"no","type":"xxxx"} |
| {"logged":"no","wearable":"no","type":"xxxx"} |
+----------------------------------------------------+
{noformat}
What I find strange is that I have thousands of records in the same file with
different schema for different record types and all other queries seem run well.
Is there something about how Drill infers schema that I might be missing here?
Does it infer based on a sample % of the data and fail for records that were
not taken into account while inferring schema? I suspect I wouldn't have this
error if I had 100's of records with that other schema inside the file, but I
can't find anything in the docs or code to support that hypothesis. Perhaps
it's just a bug? Is it expected?
Troubleshooting guide seems to mention something about this but it's very vague
in implying Drill doesn't fully support schema changes. I thought that was for
data type changes mostly, for which there are other well documented issues.
was:
I'm having trouble querying a data set with varying schema for a nested object
fields. The majority of my data for a specific type of record has the following
nested data:
{code}
"attributes":{"daysSinceInstall":0,"destination":"none","logged":"no","nth":1,"type":"organic","wearable":"no"}}
{code}
Among those records (hundreds of them) I have only two with a slightly
different schema:
{code}
"attributes":{"adSet":"Teste-Adwords-Engagement-Branch-iOS-230615-adset","campaign":"Teste-Adwords-Engagement-Branch-iOS-230615","channel":"Adwords","daysSinceInstall":0,"destination":"none","logged":"no","nth":4,"type":"branch","wearable":"no"}}
{code}
When trying to query the "new" fields, my queries fail:
With {code:sql}ALTER SYSTEM SET `store.json.all_text_mode` = true;{code}
{noformat}
0: jdbc:drill:zk=local> select log.event.attributes from
`dfs`.`root`.`/file.json` as log where log.si =
'07A3F985-4B34-4A01-9B83-3B14548EF7BE' and log.event.attributes.ad =
'Teste-FB-Engagement-Puro-iOS-230615"';
Error: SYSTEM ERROR: java.lang.NumberFormatException:
Teste-FB-Engagement-Puro-iOS-230615"
Fragment 0:0
[Error Id: 22d37a65-7dd0-4661-bbfc-7a50bbee9388 on
ip-10-0-1-16.sa-east-1.compute.internal:31010] (state=,code=0)
{noformat}
With {code:sql}ALTER SYSTEM SET `store.json.all_text_mode` = false;`{code}
{noformat}
0: jdbc:drill:zk=local> select log.event.attributes from
`dfs`.`root`.`/file.json` as log where log.si =
'07A3F985-4B34-4A01-9B83-3B14548EF7BE';
Error: DATA_READ ERROR: Error parsing JSON - You tried to write a Bit type when
you are using a ValueWriter of type NullableVarCharWriterImpl.
File file.json
Record 35
Fragment 0:0
[Error Id: 5746e3e9-48c0-44b1-8e5f-7c94e7c64d0f on
ip-10-0-1-16.sa-east-1.compute.internal:31010] (state=,code=0)
{noformat}
If I try to extract all "attributes" from those events, Drill will only return
a subset of the fields, ignoring the others.
{noformat}
0: jdbc:drill:zk=local> select log.event.attributes from
`dfs`.`root`.`/file.json` as log where log.si =
'07A3F985-4B34-4A01-9B83-3B14548EF7BE' and log.type ='Opens App';
+----------------------------------------------------+
| EXPR$0 |
+----------------------------------------------------+
| {"logged":"no","wearable":"no","type":"xxxx"} |
| {"logged":"no","wearable":"no","type":"xxxx"} |
| {"logged":"no","wearable":"no","type":"xxxx"} |
| {"logged":"no","wearable":"no","type":"xxxx"} |
| {"logged":"no","wearable":"no","type":"xxxx"} |
+----------------------------------------------------+
{noformat}
What I find strange is that I have thousands of records in the same file with
different schema for different record types and all other queries seem run well.
Is there something about how Drill infers schema that I might be missing here?
Does it infer based on a sample % of the data and fail for records that were
not taken into account while inferring schema? I suspect I wouldn't have this
error if I had 100's of records with that other schema inside the file, but I
can't find anything in the docs or code to support that hypothesis. Perhaps
it's just a bug? Is it expected?
Troubleshooting guide seems to mention something about this but it's very vague
in implying Drill doesn't fully support schema changes. I thought that was for
data type changes mostly, for which there are other well documented issues.
> Non data-type related schema changes errors
> -------------------------------------------
>
> Key: DRILL-3353
> URL: https://issues.apache.org/jira/browse/DRILL-3353
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - JSON
> Affects Versions: 1.0.0
> Reporter: Oscar Bernal
> Assignee: Steven Phillips
> Fix For: 1.2.0
>
> Attachments: i-bfbc0a5c-ios-PulsarEvent-2015-06-23_19.json.zip
>
>
> I'm having trouble querying a data set with varying schema for a nested
> object fields. The majority of my data for a specific type of record has the
> following nested data:
> {code}
> "attributes":{"daysSinceInstall":0,"destination":"none","logged":"no","nth":1,"type":"organic","wearable":"no"}}
> {code}
> Among those records (hundreds of them) I have only two with a slightly
> different schema:
> {code}
> "attributes":{"adSet":"Teste-Adwords-Engagement-Branch-iOS-230615-adset","campaign":"Teste-Adwords-Engagement-Branch-iOS-230615","channel":"Adwords","daysSinceInstall":0,"destination":"none","logged":"no","nth":4,"type":"branch","wearable":"no"}}
> {code}
> When trying to query the "new" fields, my queries fail:
> With {code:sql}ALTER SYSTEM SET `store.json.all_text_mode` = true;{code}
> {noformat}
> 0: jdbc:drill:zk=local> select log.event.attributes from
> `dfs`.`root`.`/file.json` as log where log.si =
> '07A3F985-4B34-4A01-9B83-3B14548EF7BE' and log.event.attributes.ad =
> 'Teste-FB-Engagement-Puro-iOS-230615';
> Error: SYSTEM ERROR: java.lang.NumberFormatException:
> Teste-FB-Engagement-Puro-iOS-230615"
> Fragment 0:0
> [Error Id: 22d37a65-7dd0-4661-bbfc-7a50bbee9388 on
> ip-10-0-1-16.sa-east-1.compute.internal:31010] (state=,code=0)
> {noformat}
> With {code:sql}ALTER SYSTEM SET `store.json.all_text_mode` = false;`{code}
> {noformat}
> 0: jdbc:drill:zk=local> select log.event.attributes from
> `dfs`.`root`.`/file.json` as log where log.si =
> '07A3F985-4B34-4A01-9B83-3B14548EF7BE';
> Error: DATA_READ ERROR: Error parsing JSON - You tried to write a Bit type
> when you are using a ValueWriter of type NullableVarCharWriterImpl.
> File file.json
> Record 35
> Fragment 0:0
> [Error Id: 5746e3e9-48c0-44b1-8e5f-7c94e7c64d0f on
> ip-10-0-1-16.sa-east-1.compute.internal:31010] (state=,code=0)
> {noformat}
> If I try to extract all "attributes" from those events, Drill will only
> return a subset of the fields, ignoring the others.
> {noformat}
> 0: jdbc:drill:zk=local> select log.event.attributes from
> `dfs`.`root`.`/file.json` as log where log.si =
> '07A3F985-4B34-4A01-9B83-3B14548EF7BE' and log.type ='Opens App';
> +----------------------------------------------------+
> | EXPR$0 |
> +----------------------------------------------------+
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> | {"logged":"no","wearable":"no","type":"xxxx"} |
> +----------------------------------------------------+
> {noformat}
> What I find strange is that I have thousands of records in the same file with
> different schema for different record types and all other queries seem run
> well.
> Is there something about how Drill infers schema that I might be missing
> here? Does it infer based on a sample % of the data and fail for records that
> were not taken into account while inferring schema? I suspect I wouldn't have
> this error if I had 100's of records with that other schema inside the file,
> but I can't find anything in the docs or code to support that hypothesis.
> Perhaps it's just a bug? Is it expected?
> Troubleshooting guide seems to mention something about this but it's very
> vague in implying Drill doesn't fully support schema changes. I thought that
> was for data type changes mostly, for which there are other well documented
> issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)