[ https://issues.apache.org/jira/browse/HIVE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723303#comment-13723303 ]
Chaoyu Tang commented on HIVE-4223: ----------------------------------- The previous comments are not in right format, re-post: I was not able to reproduce the said problem in hive-0.9.0 and wondering if it might be related to the data? Here is my test case; 1. create table bcd (col1 array <struct<col1:string, col2:string, col3:string,col4:string,col5:string,col6:string,col7:string,col8:array<struct<col1:string,col2:string,col3:string,col4:string,col5:string,col6:string,col7:string,col8:string,col9:string>>>>) row format delimited fields terminated by '\001' collection items terminated by '\002' lines terminated by '\n' stored as textfile; -- same as the case described in this JIRA 2. load data local inpath '/root/nest_struct.data' overwrite into table bcd; -- see attached nest_struct.data 3. select col1 from bcd; -- got expected result {code} [{"col1":"c1v","col2":"c2v","col3":"c3v","col4":"c4v","col5":"c5v","col6":"c6v","col7":"c7v","col8":[{"col1":"c11v","col2":"c22v","col3":"c33v","col4":"c44v","col5":"c55v","col6":"c66v","col7":"c77v","col8":"c88v","col9":"c99v"}]}] {code} > LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of > hive table > ------------------------------------------------------------------------------------ > > Key: HIVE-4223 > URL: https://issues.apache.org/jira/browse/HIVE-4223 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Affects Versions: 0.9.0 > Environment: Hive 0.9.0 > Reporter: Yong Zhang > Attachments: nest_struct.data > > > The LazySimpleSerDe will throw IndexOutOfBoundsException if the column > structure is struct containing array of struct. > I have a table with one column defined like this: > columnA > array < > struct< > col1:primiType, > col2:primiType, > col3:primiType, > col4:primiType, > col5:primiType, > col6:primiType, > col7:primiType, > col8:array< > struct< > col1:primiType, > col2::primiType, > col3::primiType, > col4:primiType, > col5:primiType, > col6:primiType, > col7:primiType, > col8:primiType, > col9:primiType > > > > > > > > > In this example, the outside struct has 8 columns (including the array), and > the inner struct has 9 columns. As long as the outside struct has LESS column > count than the inner struct column count, I think we will get the following > exception as stracktrace in LazeSimpleSerDe when it tries to serialize a row: > Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:443) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365) > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at > org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at > org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531) > ... 9 more > I am not very sure about exactly the reason of this problem. I believe that > the public static void serialize(ByteStream.Output out, Object > obj,ObjectInspector objInspector, byte[] separators, int level, Text > nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape) is > recursively invoking itself when facing nest structure. But for the nested > struct structure, the list reference will mass up, and the size() will return > wrong data. > In the above example case I faced, > for these 2 lines: > List<? extends StructField> fields = soi.getAllStructFieldRefs(); > list = soi.getStructFieldsDataAsList(obj); > my StructObjectInspector(soi) will return the CORRECT data for > getAllStructFieldRefs() and getStructFieldsDataAsList() methods. For example, > for one row, for the outsider 8 columns struct, I have 2 elements in the > inner array of struct, and each element will have 9 columns (as there are 9 > columns in the inner struct). During runtime, after I added more logging in > the LazySimpleSerDe, I will see the following behavior in the logging: > for 8 outside column, loop > for 9 inside columns, loop for serialize > for 9 inside columns, loop for serialize > code broken here, for the outside loop, it will try to access the 9th > element,which not exist in the outside loop, as you will see the stracktrace > as it tried to access location 8 of size 8 of list. > What I did is to change the following line of code, it look like fixing this > problem. But I don't know if it is the right way, but it did fix this > problem, and I did it on hive 0.9.0 version of code: > 481c481,482 > < for (int i = 0; i < list.size(); i++) { > --- > > int listSize = list.size(); > > for (int i = 0; i < listSize; i++) { > I believe the reason of this bug is that if the code did the current way like > for (int i = 0; i < list.size(); i++) > the method list.size() will be invoked for every loop. But in the nest > structure, the list.size() will return different result during the recursive > call, and that caused the problem I am currently facing. > Thanks > Yong Zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira