[ 
https://issues.apache.org/jira/browse/DRILL-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212257#comment-15212257
 ] 

Shankar commented on DRILL-4520:
--------------------------------

 @As you said, "Note that this is probably why you have success with Hive: it 
simply deletes records with inconsistent schema."

@Hive - (Running hive on yarn and MR-2)
It seems, there is no data loss. it does not delete a single record. It keeps 
all the records as it is in the format/data-types (here in my case STRING) 
describe while creating hive table. So there is no question of Mixed schema.  
Assuming that hive internally treat Mixed data types as a String and store the 
same in parquet format.

We have checked on our data of 150GB where 
{noformat}
[counts of lines] => [from json flat log files]  = [from hdfs logs (uploaded 
json flat logs)] = [from hive parquet tables]. 
{noformat}

Please find the below a complete stack trace for second problem (added without 
and with casting):


{noformat}

0: jdbc:drill:> set `store.json.all_text_mode` = true;
+-------+------------------------------------+
|  ok   |              summary               |
+-------+------------------------------------+
| true  | store.json.all_text_mode updated.  |
+-------+------------------------------------+



0: jdbc:drill:> set `exec.enable_union_type` = true;
+-------+----------------------------------+
|  ok   |             summary              |
+-------+----------------------------------+
| true  | exec.enable_union_type updated.  |
+-------+----------------------------------+
1 row selected (1.451 seconds)
0: jdbc:drill:> select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > ajaxUrl,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t;
+----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
|   timestamp    |           sessionid            |                             
               gameid                                            |        
ajaxUrl        | ajaxData  |
+----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
| 1457658600032  | BC497C7C39B3C90AC9E6E9E8194C3  | null                        
                                                                 | 
/player/updatebonus1  | null      |
| 1457771458873  | D18104E8CA3071C7A8F4E141B127   | 
https://daemon2.com/tournDetails.do?type=myGames&userId=1556148&jsonp_callback=jQuery213043
  | []                    | null      |
| 1457958600032  | BC497C7C39B3C90AC9E6E9E8194C3  | null                        
                                                                 | 
/player/updatebonus2  | null      |
+----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
3 rows selected (1.985 seconds)
0: jdbc:drill:> 





0: jdbc:drill:> create table dfs.tmp.test1 AS 
. . . . . . . > select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > ajaxUrl,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t ;
Error: SYSTEM ERROR: NullPointerException

Fragment 0:0

[Error Id: 41047931-3d29-46eb-a18f-f29d393d3bd0 on namenode:31010] 
(state=,code=0)
0: jdbc:drill:>




0: jdbc:drill:> SET `exec.errors.verbose` = true ;
+-------+-------------------------------+
|  ok   |            summary            |
+-------+-------------------------------+
| true  | exec.errors.verbose updated.  |
+-------+-------------------------------+
1 row selected (0.095 seconds)
0: jdbc:drill:> create table dfs.tmp.test1 AS 
. . . . . . . > select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > ajaxUrl,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t ;
Error: SYSTEM ERROR: NullPointerException

Fragment 0:0

[Error Id: 0f0a0349-4e32-4e4c-aa17-9219641def64 on namenode:31010]

  (java.lang.NullPointerException) null
    org.apache.parquet.column.statistics.Statistics.getStatsBasedOnType():48
    
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>():78
    
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.<init>():52
    org.apache.parquet.hadoop.ColumnChunkPageWriteStore.<init>():232
    
org.apache.parquet.hadoop.ColumnChunkPageWriteStoreExposer.newColumnChunkPageWriteStore():36
    org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():192
    org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172
    org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155
    org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1657
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():745 (state=,code=0)
0: jdbc:drill:> 




0: jdbc:drill:> select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > cast(ajaxUrl as varchar(65536)) as ajaxurl1,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t ;
Error: SYSTEM ERROR: DrillRuntimeException: Unable to cast union to LIST

Fragment 0:0

[Error Id: 37c95bde-09ef-4162-bc92-6bec04487c94 on namenode:31010]

  (org.apache.drill.common.exceptions.DrillRuntimeException) Unable to cast 
union to LIST
    org.apache.drill.exec.expr.fn.ExceptionFunction.throwException():51
    org.apache.drill.exec.test.generated.ProjectorGen5.doEval():145
    org.apache.drill.exec.test.generated.ProjectorGen5.projectRecords():62
    org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
    
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1657
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():745 (state=,code=0)
0: jdbc:drill:> 
0: jdbc:drill:> 
0: jdbc:drill:>



 
0: jdbc:drill:> create table dfs.tmp.test1 AS 
. . . . . . . > select
. . . . . . . > `timestamp`,
. . . . . . . > sessionid,
. . . . . . . > gameid,
. . . . . . . > CONVERT_TO(ajaxUrl, 'UTF8') as ajaxurl1,
. . . . . . . > ajaxData
. . . . . . . > from dfs.`/tmp/test.json` t ;
Error: SYSTEM ERROR: DrillRuntimeException: Unable to cast union to LIST

Fragment 0:0

[Error Id: 96529240-8f8f-437a-a424-ea865bd68a95 on namenode:31010]

  (org.apache.drill.common.exceptions.DrillRuntimeException) Unable to cast 
union to LIST
    org.apache.drill.exec.expr.fn.ExceptionFunction.throwException():51
    org.apache.drill.exec.test.generated.ProjectorGen6.doEval():147
    org.apache.drill.exec.test.generated.ProjectorGen6.projectRecords():62
    org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93
    
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():91
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():109
    org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
    
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
    org.apache.drill.exec.record.AbstractRecordBatch.next():162
    org.apache.drill.exec.physical.impl.BaseRootExec.next():104
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():94
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1657
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1142
    java.util.concurrent.ThreadPoolExecutor$Worker.run():617
    java.lang.Thread.run():745 (state=,code=0)
0: jdbc:drill:> 




{noformat}

> Error parsing JSON ( a column with different datatypes )
> --------------------------------------------------------
>
>                 Key: DRILL-4520
>                 URL: https://issues.apache.org/jira/browse/DRILL-4520
>             Project: Apache Drill
>          Issue Type: Test
>            Reporter: Shankar
>
> I am stuck in the middle of somewhere. Could you please help me to resolve 
> below error.
> I am running query on drill 1.6.0 in cluster on logs json data (150GB size of 
> log file) ( 1 json / line).
> {quote}
> solution as per my opinion - 
> 1. Either drill should able to ignore those lines(ANY data type) while 
> reading or creating the table (CTAS).
> 2. Or Data will get stored as it is with ANY data type if any fields in data 
> differs in their data types. This will be useful in the case where other 
> columns (excluding ANY data type columns) carrying important informations.
> {quote}
> h4. -------------- test.json --------------
> Abount Data : 
> 1. I have just extract 3 lines from logs for test purpose.
> 2. In data field called "ajaxUrl" is differ in datatype. Sometimes it 
> contains string and sometime array of jsons and null as well. 
> 3. Here in our case - Some events in 150 gb json file are like this where 
> they differ in structure. I could say there are only 0.1% (per 150gb json 
> file) are such events.
> {noformat}
> {"ajaxData":null,"metadata":null,"ajaxUrl":"/player/updatebonus1","selectedItem":null,"sessionid":"BC497C7C39B3C90AC9E6E9E8194C3","timestamp":1457658600032}
> {"gameId":"https://daemon2.com/tournDetails.do?type=myGames&userId=1556148&jsonp_callback=jQuery213043","ajaxData":null,"metadata":null,"ajaxUrl":[{"R":0,"rNo":1,"gid":4,"wal":0,"d":{"gid":4,"pt":3,"wc":2326,"top":"1","reg":true,"brkt":1457771400268,"sk":"25070010105301000009","id":56312439,"a":0,"st":1457771400000,"e":"0.0","j":0,"n":"Loot
>  Qualifier 
> 1","tc":94,"et":0,"syst":1457771456,"rc":145770000,"s":5,"t":10000,"tk":false,"prnId":56311896,"jc":10000,"tp":"100000.0","ro":145400000,"rp":0,"isprn":false},"fl":"192.168.35.42","aaid":"5828"}],"selectedItem":null,"sessionid":"D18104E8CA3071C7A8F4E141B127","timestamp":1457771458873}
> {"ajaxData":null,"metadata":null,"ajaxUrl":"/player/updatebonus2","selectedItem":null,"sessionid":"BC497C7C39B3C90AC9E6E9E8194C3","timestamp":1457958600032}
> {noformat}
> h4. -------------- Select Query  (ERROR) --------------
> {noformat}
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> {noformat}
> {color:red}
> Error: DATA_READ ERROR: Error parsing JSON - You tried to start when you are 
> using a ValueWriter of type NullableVarCharWriterImpl.
> File  /tmp/test.json
> Record  2
> Fragment 0:0
> {color}
> h4. -------------- Select Query (works Fine with UNION type) --------------
> Tried UNION type (an experimental feature)
> set `exec.enable_union_type` = true;
> {noformat}
> set `exec.enable_union_type` = true;
> +-------+----------------------------------+
> |  ok   |             summary              |
> +-------+----------------------------------+
> | true  | exec.enable_union_type updated.  |
> +-------+----------------------------------+
> 1 row selected (0.193 seconds)
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> |   timestamp    |           sessionid            |                           
>                  gameid                                            |        
> ajaxUrl        | ajaxData  |
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> | 1457658600032  | BC497C7C39B3C90AC9E6E9E8194C3  | null                      
>                                                                    | 
> /player/updatebonus1  | null      |
> | 1457771458873  | D18104E8CA3071C7A8F4E141B127   | 
> https://daemon2.com/tournDetails.do?type=myGames&userId=1556148&jsonp_callback=jQuery213043
>   | []                    | null      |
> | 1457958600032  | BC497C7C39B3C90AC9E6E9E8194C3  | null                      
>                                                                    | 
> /player/updatebonus2  | null      |
> +----------------+--------------------------------+----------------------------------------------------------------------------------------------+-----------------------+-----------+
> 3 rows selected (0.965 seconds)
> {noformat}
> h4. -------------- CTAS Query (ERROR) --------------
> {noformat}
> set `exec.enable_union_type` = true;
> +-------+----------------------------------+
> |  ok   |             summary              |
> +-------+----------------------------------+
> | true  | exec.enable_union_type updated.  |
> +-------+----------------------------------+
> 1 row selected (0.193 seconds)
> create table dfs.tmp.test1 AS 
> select
> `timestamp`,
> sessionid,
> gameid,
> ajaxUrl,
> ajaxData
> from dfs.`/tmp/test.json` t
> ;
> {noformat}
> {color:red}
> Error: SYSTEM ERROR: NullPointerException
> Fragment 0:0
> {color}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to