[
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15630075#comment-15630075
]
Khurram Faraaz commented on DRILL-4653:
---------------------------------------
I don't this this is fixed, there are still some cases that need to be taken
care of. Please see below.
Also, more importantly this checking for malformed JSON should be ON/enabled by
default in Drill. Users will like to ignore bad records, rather than see an
Exception/Error and then our support suggest them to enable this
skip_invalid_records. This I believe should be ON by default in Drill.
[test@cent01 drill_4653]# cat badjson_01.json
{"key":"test string"}
{"key":"foo"}
{"key":"foobar"
{"key":"blah"}
{"key":"temp"}
[test@cent01 drill_4653]# cat badjson_02.json
{
"key":"foo",
"badarray":[1,3,4,5,6,7,8,,
"key":"test string",
"key":"foobar"
}
[test@cent01 drill_4653]#
[test@cent01 drill_4653]# cat badjson_03.json
{
"key":"foo",
"key":"foobar",
"key":"test string",
"key":"string",
"key":
}
[test@cent01 drill_4653]#
[test@cent01 drill_4653]# cat badjson_04.json
{"key":1}
{"key":2}
{"key":3}
{"key":
[test@cent01 drill_4653]
[test@cent01 drill_4653]# cat badjson_05.json
{
"key1":"foobar",
"key2":[1,3,4,5,6,7,8,9],
"key3":{ "key4":},
"key5":"foo"
}
[test@cent01 drill_4653]
[test@cent01 drill_4653]# cat badjson_06.json
{
"name":"John Doe",
"age":33,
"dept":"IT",
"address":{
"street":"some street",
"city":"some city",
"zip":
}
"isManager":"yes"
}
[test@cent01 drill_4653]
{noformat}
0: jdbc:drill:schema=dfs.tmp> alter session set
`store.json.reader.skip_invalid_records`=true;
+-------+--------------------------------------------------+
| ok | summary |
+-------+--------------------------------------------------+
| true | store.json.reader.skip_invalid_records updated. |
+-------+--------------------------------------------------+
1 row selected (0.334 seconds)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_01.json`;
+--------------+
| key |
+--------------+
| test string |
| foo |
| temp |
+--------------+
3 rows selected (0.466 seconds)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_01.json`;
+--------------+
| key |
+--------------+
| test string |
| foo |
| temp |
+--------------+
3 rows selected (0.222 seconds)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_02.json`;
Error: DATA_READ ERROR: Unexpected character (',' (code 44)): expected a valid
value (number, String, array, object, 'true', 'false' or 'null')
at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712;
line: 3, column: 32]
Line 3
Column 33
Field badarray
Fragment 0:0
[Error Id: 6da211b5-a287-4239-82b4-26a35e47ed10 on centos-01.qa.lab:31010]
(state=,code=0)
{noformat}
Stack trace from drillbit.log for above failure
{noformat}
org.apache.drill.common.exceptions.UserException: DATA_READ ERROR: Unexpected
character (',' (code 44)): expected a valid value (number, String, array,
object, 'true', 'false' or 'null')
at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712;
line: 3, column: 32]
Line 3
Column 33
Field badarray
[Error Id: 6da211b5-a287-4239-82b4-26a35e47ed10 ]
at
org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:543)
~[drill-common-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:586)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:372)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch(JsonReader.java:306)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector(JsonReader.java:247)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.write(JsonReader.java:202)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.store.easy.json.JSONRecordReader.next(JSONRecordReader.java:206)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.ScanBatch.next(ScanBatch.java:178)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext(AbstractSingleRecordBatch.java:51)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext(ProjectRecordBatch.java:135)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext(ScreenCreator.java:81)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:232)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:226)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at java.security.AccessController.doPrivileged(Native Method)
[na:1.8.0_91]
at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_91]
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1595)
[hadoop-common-2.7.0-mapr-1607.jar:na]
at
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:226)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
[drill-common-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_91]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character
(',' (code 44)): expected a valid value (number, String, array, object, 'true',
'false' or 'null')
at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@3e4c2712;
line: 3, column: 32]
at
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1586)
~[jackson-core-2.7.1.jar:2.7.1]
at
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:521)
~[jackson-core-2.7.1.jar:2.7.1]
at
com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:450)
~[jackson-core-2.7.1.jar:2.7.1]
at
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2628)
~[jackson-core-2.7.1.jar:2.7.1]
at
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:854)
~[jackson-core-2.7.1.jar:2.7.1]
at
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:748)
~[jackson-core-2.7.1.jar:2.7.1]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:537)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
... 24 common frames omitted
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_02.json`;
+------+
| key |
+------+
+------+
No rows selected (0.477 seconds)
{noformat}
This query should return "foo", "foobar", "test string", "string" in 4 rows.
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_03.json`;
+------+
| key |
+------+
+------+
No rows selected (0.208 seconds)
{noformat}
This query should return "foobar"
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_03.json` where key
='foobar';
+------+
| key |
+------+
+------+
No rows selected (0.253 seconds)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key from `badjson_04.json`;
+------+
| key |
+------+
| 1 |
| 2 |
| 3 |
+------+
3 rows selected (0.232 seconds)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_04.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input
within/between OBJECT entries
File /tmp/badjson_04.json
Record 4
Column 39
Fragment 0:0
[Error Id: a30668ff-8bdc-44bc-aeac-c566e2f731b6 on centos-01.qa.lab:31010]
(state=,code=0)
Stack trace from drillbit.log
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected
end-of-input within/between OBJECT entries
at [Source: org.apache.drill.exec.store.dfs.DrillFSDataInputStream@37039ebe;
line: 5, column: 39]
at
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1586)
~[jackson-core-2.7.1.jar:2.7.1]
at
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipColon2(UTF8StreamJsonParser.java:3038)
~[jackson-core-2.7.1.jar:2.7.1]
at
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._skipColon(UTF8StreamJsonParser.java:2950)
~[jackson-core-2.7.1.jar:2.7.1]
at
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:756)
~[jackson-core-2.7.1.jar:2.7.1]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.writeData(JsonReader.java:350)
~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.writeDataSwitch(JsonReader.java:306)
~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.writeToVector(JsonReader.java:247)
~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.vector.complex.fn.JsonReader.write(JsonReader.java:202)
~[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
at
org.apache.drill.exec.store.easy.json.JSONRecordReader.next(JSONRecordReader.java:206)
[drill-java-exec-1.9.0-SNAPSHOT.jar:1.9.0-SNAPSHOT]
... 19 common frames omitted
{noformat}
This query should return "foobar" in key1 and arracy [1,3,4,5,6,7,8,9] in key2
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_05.json`;
+-------+-------+
| key1 | key2 |
+-------+-------+
+-------+-------+
No rows selected (0.229 seconds)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key1 from `badjson_05.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code
125)): expected a value
File /tmp/badjson_05.json
Record 1
Column 22
Fragment 0:0
[Error Id: 01a8ce3b-b0c0-41c5-92cd-3467265b60a6 on centos-01.qa.lab:31010]
(state=,code=0)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select key2 from `badjson_05.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code
125)): expected a value
File /tmp/badjson_05.json
Record 1
Column 22
Fragment 0:0
[Error Id: 40bb646b-18e7-4dff-812d-f409ea1fcf27 on centos-01.qa.lab:31010]
(state=,code=0)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select * from `badjson_06.json`;
+-------+------+-------+----------+
| name | age | dept | address |
+-------+------+-------+----------+
+-------+------+-------+----------+
No rows selected (0.205 seconds)
{noformat}
{noformat}
0: jdbc:drill:schema=dfs.tmp> select name from `badjson_06.json`;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected character ('}' (code
125)): expected a value
File /tmp/badjson_06.json
Record 1
Column 16
Fragment 0:0
[Error Id: b549023e-1f54-418c-adc5-9a21cf0ec3aa on centos-01.qa.lab:31010]
(state=,code=0)
{noformat}
> Malformed JSON should not stop the entire query from progressing
> ----------------------------------------------------------------
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - JSON
> Affects Versions: 1.6.0
> Reporter: subbu srinivasan
> Fix For: 1.9.0
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something
> similar to a setting of (ignore.malformed.json) would help.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)