Sumit Kumar created HIVE-7239:
---------------------------------

             Summary: Fix bug in HiveIndexedInputFormat implementation that 
causes incorrect query result when input backed by Sequence/RC files
                 Key: HIVE-7239
                 URL: https://issues.apache.org/jira/browse/HIVE-7239
             Project: Hive
          Issue Type: Bug
          Components: Indexing
    Affects Versions: 0.13.1
            Reporter: Sumit Kumar
            Assignee: Sumit Kumar


In case of sequence files, it's crucial that splits are calculated around the 
boundaries enforced by the input sequence file. However by default hadoop 
creates input splits depending on the configuration parameters which may not 
match the boundaries for the input sequence file. Hive provides 
HiveIndexedInputFormat that provides extra logic and recalculates the split 
boundaries for each split depending on the sequence file's boundaries.

However we noticed this behavior of "over" reporting from data backed by 
sequence file. We've a sample data on which we experimented and fixed this bug, 
we have verified this fix by comparing the query output for input being 
sequence file format, rc file and regular format. However we have not able to 
find the right place to include this as a unit test that would execute as part 
of hive tests. We tried writing a "clientpositive" test as part of ql module 
but the output seems quite verbose and i couldn't interpret it that well. Can 
someone please review this change and guide on how to write a test that will 
execute as part of Hive testing?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to