[
https://issues.apache.org/jira/browse/DRILL-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212257#comment-15212257
]
Shankar commented on DRILL-4520:
--------------------------------
@As you said, "Note that this is probably why you have success with Hive: it
simply deletes records with inconsistent schema."
@Hive - (Running hive on yarn and MR-2)
It seems, there is no data loss. it does not delete a single record. It keeps
all the records as it is in the format/data-types (here in my case STRING)
describe while creating hive table. So there is no question of Mixed schema.
Assuming that hive internally treat Mixed data types as a String and store the
same in parquet format.
We have checked on our data of 150GB where
{noformat}
[counts of lines] => [from json flat log files] = [from hdfs logs (uploaded
json flat logs)] = [from hive parquet tables].
{noformat}
Please find the below a complete stack trace for second problem (added without
and with casting):
{noformat}
0: jdbc:drill:> set `store.json.all_text_mode` = true;
+-------+------------------------------------+
| ok | summary |
+-------+------------------------------------+
| true | store.json.all_text_mode updated. |
+-------+------------------------------------+
0: jdbc:drill:> set `exec.enable_union_type` = true;
+-------+----------------------------------+
| ok | summary |
+-------+----------------------------------+
| true | exec.enable_union_type updated. |
+-------+----------------------------------+
1 row selected (1.451 seconds)
0: jdbc:drill:> select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > ajaxUrl,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t;
+----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
| timestamp | sessionid |
gameid |
ajaxUrl | ajaxData |
+----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
| 1457658600032 | BC497C7C39B3C90AC9E6E9E8194C3 | null
|
/player/updatebonus1 | null |
| 1457771458873 | D18104E8CA3071C7A8F4E141B127 |
https://daemon2.com/tournDetails.do?type=myGames&userId=1556148&jsonp_callback=jQuery213043
| [] | null |
| 1457958600032 | BC497C7C39B3C90AC9E6E9E8194C3 | null
|
/player/updatebonus2 | null |
+----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
3 rows selected (1.985 seconds)
0: jdbc:drill:>
0: jdbc:drill:> create table dfs.tmp.test1 AS
. . . . . . . > select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > ajaxUrl,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t ;
Error: SYSTEM ERROR: NullPointerException
Fragment 0:0
[Error Id: 41047931-3d29-46eb-a18f-f29d393d3bd0 on namenode:31010]
(state=,code=0)
0: jdbc:drill:>
0: jdbc:drill:> SET `exec.errors.verbose` = true ;
+-------+-------------------------------+
| ok | summary |
+-------+-------------------------------+
| true | exec.errors.verbose updated. |
+-------+-------------------------------+
1 row selected (0.095 seconds)
0: jdbc:drill:> create table dfs.tmp.test1 AS
. . . . . . . > select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > ajaxUrl,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t ;
Error: SYSTEM ERROR: NullPointerException
Fragment 0:0
[Error Id: 0f0a0349-4e32-4e4c-aa17-9219641def64 on namenode:31010]
(java.lang.NullPointerException) null
org.apache.parquet.column.statistics.Statistics.getStatsBasedOnType():48
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>():78
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>():52
org.apache.parquet.hadoop.ColumnChunkPageWriteStore.<init>():232
org.apache.parquet.hadoop.ColumnChunkPageWriteStoreExposer.newColumnChunkPageWriteStore():36
org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():192
org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172
org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
0: jdbc:drill:>
0: jdbc:drill:> select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > cast(ajaxUrl as varchar(65536)) as ajaxurl1,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t ;
Error: SYSTEM ERROR: DrillRuntimeException: Unable to cast union to LIST
Fragment 0:0
[Error Id: 37c95bde-09ef-4162-bc92-6bec04487c94 on namenode:31010]
(org.apache.drill.common.exceptions.DrillRuntimeException) Unable to cast
union to LIST
org.apache.drill.exec.expr.fn.ExceptionFunction.throwException():51
org.apache.drill.exec.test.generated.ProjectorGen5.doEval():145
org.apache.drill.exec.test.generated.ProjectorGen5.projectRecords():62
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
0: jdbc:drill:>
0: jdbc:drill:>
0: jdbc:drill:>
0: jdbc:drill:> create table dfs.tmp.test1 AS
. . . . . . . > select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > CONVERT_TO(ajaxUrl, 'UTF8') as ajaxurl1,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t ;
Error: SYSTEM ERROR: DrillRuntimeException: Unable to cast union to LIST
Fragment 0:0
[Error Id: 96529240-8f8f-437a-a424-ea865bd68a95 on namenode:31010]
(org.apache.drill.common.exceptions.DrillRuntimeException) Unable to cast
union to LIST
org.apache.drill.exec.expr.fn.ExceptionFunction.throwException():51
org.apache.drill.exec.test.generated.ProjectorGen6.doEval():147
org.apache.drill.exec.test.generated.ProjectorGen6.projectRecords():62
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():745 (state=,code=0)
0: jdbc:drill:>
{noformat}
> Error parsing JSON ( a column with different datatypes )
> --------------------------------------------------------
>
> Key: DRILL-4520
> URL: https://issues.apache.org/jira/browse/DRILL-4520
> Project: Apache Drill
> Issue Type: Test
> Reporter: Shankar
>
> I am stuck in the middle of somewhere. Could you please help me to resolve
> below error.
> I am running query on drill 1.6.0 in cluster on logs json data (150GB size of
> log file) ( 1 json / line).
> {quote}
> solution as per my opinion -
> 1. Either drill should able to ignore those lines(ANY data type) while
> reading or creating the table (CTAS).
> 2. Or Data will get stored as it is with ANY data type if any fields in data
> differs in their data types. This will be useful in the case where other
> columns (excluding ANY data type columns) carrying important informations.
> {quote}
> h4. -------------- test.json --------------
> Abount Data :
> 1. I have just extract 3 lines from logs for test purpose.
> 2. In data field called "ajaxUrl" is differ in datatype. Sometimes it
> contains string and sometime array of jsons and null as well.
> 3. Here in our case - Some events in 150 gb json file are like this where
> they differ in structure. I could say there are only 0.1% (per 150gb json
> file) are such events.
> {noformat}
> {"ajaxData":null,"metadata":null,"ajaxUrl":"/player/updatebonus1","selectedItem":null,"sessionid":"BC497C7C39B3C90AC9E6E9E8194C3","timestamp":1457658600032}
> {"gameId":"https://daemon2.com/tournDetails.do?type=myGames&userId=1556148&jsonp_callback=jQuery213043","ajaxData":null,"metadata":null,"ajaxUrl":[{"R":0,"rNo":1,"gid":4,"wal":0,"d":{"gid":4,"pt":3,"wc":2326,"top":"1","reg":true,"brkt":1457771400268,"sk":"25070010105301000009","id":56312439,"a":0,"st":1457771400000,"e":"0.0","j":0,"n":"Loot
> Qualifier
> 1","tc":94,"et":0,"syst":1457771456,"rc":145770000,"s":5,"t":10000,"tk":false,"prnId":56311896,"jc":10000,"tp":"100000.0","ro":145400000,"rp":0,"isprn":false},"fl":"192.168.35.42","aaid":"5828"}],"selectedItem":null,"sessionid":"D18104E8CA3071C7A8F4E141B127","timestamp":1457771458873}
> {"ajaxData":null,"metadata":null,"ajaxUrl":"/player/updatebonus2","selectedItem":null,"sessionid":"BC497C7C39B3C90AC9E6E9E8194C3","timestamp":1457958600032}
> {noformat}
> h4. -------------- Select Query (ERROR) --------------
> {noformat}
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> {noformat}
> {color:red}
> Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you are
> using a ValueWriter of type NullableVarCharWriterImpl.
> File /tmp/test.json
> Record 2
> Fragment 0:0
> {color}
> h4. -------------- Select Query (works Fine with UNION type) --------------
> Tried UNION type (an experimental feature)
> set `exec.enable_union_type` = true;
> {noformat}
> set `exec.enable_union_type` = true;
> +-------+----------------------------------+
> | ok | summary |
> +-------+----------------------------------+
> | true | exec.enable_union_type updated. |
> +-------+----------------------------------+
> 1 row selected (0.193 seconds)
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> | timestamp | sessionid |
> gameid |
> ajaxUrl | ajaxData |
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> | 1457658600032 | BC497C7C39B3C90AC9E6E9E8194C3 | null
> |
> /player/updatebonus1 | null |
> | 1457771458873 | D18104E8CA3071C7A8F4E141B127 |
> https://daemon2.com/tournDetails.do?type=myGames&userId=1556148&jsonp_callback=jQuery213043
> | [] | null |
> | 1457958600032 | BC497C7C39B3C90AC9E6E9E8194C3 | null
> |
> /player/updatebonus2 | null |
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> 3 rows selected (0.965 seconds)
> {noformat}
> h4. -------------- CTAS Query (ERROR) --------------
> {noformat}
> set `exec.enable_union_type` = true;
> +-------+----------------------------------+
> | ok | summary |
> +-------+----------------------------------+
> | true | exec.enable_union_type updated. |
> +-------+----------------------------------+
> 1 row selected (0.193 seconds)
> create table dfs.tmp.test1 AS
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> {noformat}
> {color:red}
> Error: SYSTEM ERROR: NullPointerException
> Fragment 0:0
> {color}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)