[ 
https://issues.apache.org/jira/browse/HIVE-4223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628418#comment-13628418
 ] 

Yong Zhang commented on HIVE-4223:
----------------------------------

I will debug more about this problem, right now I am busy with something, so 
didn't have time to dig into it yet.

Here are some my comments:
1) I look the code again and again. The list is a local variable, and I cannot 
understand how the reference will be changed in the recursive call, but it did 
happen in the log. I need to add more logging to find out why.
2) If you can compile hive by yourself, you can apply the change I made above, 
to see if it fixes your problem.
3) What I did to work around this problem is to add a fake column, make the 
inner/outer struct having same length elements.
4) I will test in hive 0.10 later to see if it fixes this, but from the code, 
it doesn't look like.
                
> LazySimpleSerDe will throw IndexOutOfBoundsException in nested structs of 
> hive table
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-4223
>                 URL: https://issues.apache.org/jira/browse/HIVE-4223
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.9.0
>         Environment: Hive 0.9.0
>            Reporter: Yong Zhang
>
> The LazySimpleSerDe will throw IndexOutOfBoundsException if the column 
> structure is struct containing array of struct. 
> I have a table with one column defined like this:
> columnA
> array <
>     struct<
>        col1:primiType,
>        col2:primiType,
>        col3:primiType,
>        col4:primiType,
>        col5:primiType,
>        col6:primiType,
>        col7:primiType,
>        col8:array<
>             struct<
>               col1:primiType,
>               col2::primiType,
>               col3::primiType,
>               col4:primiType,
>               col5:primiType,
>               col6:primiType,
>               col7:primiType,
>               col8:primiType,
>               col9:primiType
>             >
>        >
>     >
> >
> In this example, the outside struct has 8 columns (including the array), and 
> the inner struct has 9 columns. As long as the outside struct has LESS column 
> count than the inner struct column count, I think we will get the following 
> exception as stracktrace in LazeSimpleSerDe when it tries to serialize a row:
> Caused by: java.lang.IndexOutOfBoundsException: Index: 8, Size: 8
>         at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>         at java.util.ArrayList.get(ArrayList.java:322)
>         at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485)
>         at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:443)
>         at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:381)
>         at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:365)
>         at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:568)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at 
> org.apache.hadoop.hive.ql.exec.FilterOperator.processOp(FilterOperator.java:132)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:83)
>         at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:471)
>         at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:762)
>         at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:531)
>         ... 9 more
> I am not very sure about exactly the reason of this problem. I believe that 
> the   public static void serialize(ByteStream.Output out, Object 
> obj,ObjectInspector objInspector, byte[] separators, int level, Text 
> nullSequence, boolean escaped, byte escapeChar, boolean[] needsEscape) is 
> recursively invoking itself when facing nest structure. But for the nested 
> struct structure, the list reference will mass up, and the size() will return 
> wrong data.
> In the above example case I faced, 
> for these 2 lines:
>       List<? extends StructField> fields = soi.getAllStructFieldRefs();
>       list = soi.getStructFieldsDataAsList(obj);
> my StructObjectInspector(soi) will return the CORRECT data for 
> getAllStructFieldRefs() and getStructFieldsDataAsList() methods. For example, 
> for one row, for the outsider 8 columns struct, I have 2 elements in the 
> inner array of struct, and each element will have 9 columns (as there are 9 
> columns in the inner struct). During runtime, after I added more logging in 
> the LazySimpleSerDe, I will see the following behavior in the logging:
> for 8 outside column, loop
>     for 9 inside columns, loop for serialize
>     for 9 inside columns, loop for serialize
> code broken here, for the outside loop, it will try to access the 9th 
> element,which not exist in the outside loop, as you will see the stracktrace 
> as it tried to access location 8 of size 8 of list.
> What I did is to change the following line of code, it look like fixing this 
> problem. But I don't know if it is the right way, but it did fix this 
> problem, and I did it on hive 0.9.0 version of code:
> 481c481,482
> <         for (int i = 0; i < list.size(); i++) {
> ---
> >         int listSize = list.size();
> >         for (int i = 0; i < listSize; i++) {
> I believe the reason of this bug is that if the code did the current way like
>         for (int i = 0; i < list.size(); i++)
> the method list.size() will be invoked for every loop. But in the nest 
> structure, the list.size() will return different result during the recursive 
> call, and that caused the problem I am currently facing.
> Thanks
> Yong Zhang

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to