[ https://issues.apache.org/jira/browse/HIVE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13615020#comment-13615020 ]
Navis commented on HIVE-4223: ----------------------------- Can I ask that the query which made above exception is using UDTF? > LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of > hive table > ------------------------------------------------------------------------------------ > > Key: HIVE-4223 > URL: https://issues.apache.org/jira/browse/HIVE-4223 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Affects Versions: 0.9.0 > Environment: Hive 0.9.0 > Reporter: Yong Zhang > > The LazySimpleSerDe will throw IndexOutOfBoundsException if the column > structure is struct containing array of struct. > I have a table with one column defined like this: > columnA > array < > struct< > col1:primiType, > col2:primiType, > col3:primiType, > col4:primiType, > col5:primiType, > col6:primiType, > col7:primiType, > col8:array< > struct< > col1:primiType, > col2::primiType, > col3::primiType, > col4:primiType, > col5:primiType, > col6:primiType, > col7:primiType, > col8:primiType, > col9:primiType > > > > > > > > > In this example, the outside struct has 8 columns (including the array), and > the inner struct has 9 columns. As long as the outside struct has LESS column > count than the inner struct column count, I think we will get the following > exception as stracktrace in LazeSimpleSerDe when it tries to serialize a row: > Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8 > at java.util.ArrayList.RangeCheck(ArrayList.java:547) > at java.util.ArrayList.get(ArrayList.java:322) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:443) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381) > at > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365) > at > org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at > org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at > org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at > org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83) > at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471) > at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762) > at > org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531) > ... 9 more > I am not very sure about exactly the reason of this problem. I believe that > the public static void serialize(ByteStream.Output out, Object > obj,ObjectInspector objInspector, byte[] separators, int level, Text > nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape) is > recursively invoking itself when facing nest structure. But for the nested > struct structure, the list reference will mass up, and the size() will return > wrong data. > In the above example case I faced, > for these 2 lines: > List<? extends StructField> fields = soi.getAllStructFieldRefs(); > list = soi.getStructFieldsDataAsList(obj); > my StructObjectInspector(soi) will return the CORRECT data for > getAllStructFieldRefs() and getStructFieldsDataAsList() methods. For example, > for one row, for the outsider 8 columns struct, I have 2 elements in the > inner array of struct, and each element will have 9 columns (as there are 9 > columns in the inner struct). During runtime, after I added more logging in > the LazySimpleSerDe, I will see the following behavior in the logging: > for 8 outside column, loop > for 9 inside columns, loop for serialize > for 9 inside columns, loop for serialize > code broken here, for the outside loop, it will try to access the 9th > element,which not exist in the outside loop, as you will see the stracktrace > as it tried to access location 8 of size 8 of list. > What I did is to change the following line of code, it look like fixing this > problem. But I don't know if it is the right way, but it did fix this > problem, and I did it on hive 0.9.0 version of code: > 481c481,482 > < for (int i = 0; i < list.size(); i++) { > --- > > int listSize = list.size(); > > for (int i = 0; i < listSize; i++) { > I believe the reason of this bug is that if the code did the current way like > for (int i = 0; i < list.size(); i++) > the method list.size() will be invoked for every loop. But in the nest > structure, the list.size() will return different result during the recursive > call, and that caused the problem I am currently facing. > Thanks > Yong Zhang -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira