[
https://issues.apache.org/jira/browse/HIVE-7239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Illya Yalovyy updated HIVE-7239:
--------------------------------
Affects Version/s: (was: 2.0.1)
(was: 1.2.1)
(was: 0.13.1)
Status: Patch Available (was: Open)
> Fix bug in HiveIndexedInputFormat implementation that causes incorrect query
> result when input backed by Sequence/RC files
> --------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-7239
> URL: https://issues.apache.org/jira/browse/HIVE-7239
> Project: Hive
> Issue Type: Bug
> Components: Indexing
> Affects Versions: 2.1.0
> Reporter: Sumit Kumar
> Assignee: Illya Yalovyy
> Attachments: HIVE-7239.2.patch, HIVE-7239.patch
>
>
> In case of sequence files, it's crucial that splits are calculated around the
> boundaries enforced by the input sequence file. However by default hadoop
> creates input splits depending on the configuration parameters which may not
> match the boundaries for the input sequence file. Hive provides
> HiveIndexedInputFormat that provides extra logic and recalculates the split
> boundaries for each split depending on the sequence file's boundaries.
> However we noticed this behavior of "over" reporting from data backed by
> sequence file. We've a sample data on which we experimented and fixed this
> bug, we have verified this fix by comparing the query output for input being
> sequence file format, rc file and regular format. However we have not able to
> find the right place to include this as a unit test that would execute as
> part of hive tests. We tried writing a "clientpositive" test as part of ql
> module but the output seems quite verbose and i couldn't interpret it that
> well. Can someone please review this change and guide on how to write a test
> that will execute as part of Hive testing?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)