[
https://issues.apache.org/jira/browse/HIVE-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ádám Szita resolved HIVE-24266.
-------------------------------
Fix Version/s: 4.0.0
Resolution: Fixed
Committed to master. Thanks for the review [~pvary]
> Committed rows in hflush'd ACID files may be missing from query result
> ----------------------------------------------------------------------
>
> Key: HIVE-24266
> URL: https://issues.apache.org/jira/browse/HIVE-24266
> Project: Hive
> Issue Type: Bug
> Reporter: Ádám Szita
> Assignee: Ádám Szita
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> in HDFS environment if a writer is using hflush to write ORC ACID files
> during a transaction commit, the results might be seen as missing when
> reading the table before this file is completely persisted to disk (thus
> synced)
> This is due to hflush not persisting the new buffers to disk, it rather just
> ensures that new readers can see the new content. This causes the block
> information to be incomplete, on which BISplitStrategy relies on. Although
> the side file (_flush_length) tracks the proper end of the file that is being
> written, this information is neglected in the favour of block information,
> and we may end up generating a very short split instead of the larger,
> available length.
> When ETLSplitStrategy is used there is not even a try to rely on ACID side
> file when calculating file length, so that needs to fixed too.
> Moreover we might see the newly committed rows not to appear due to OrcTail
> caching in ETLSplitStrategy. For now I'm just going to recommend turning that
> cache off to anyone that wants real time row updates to be read in:
> {code:java}
> set hive.orc.cache.stripe.details.mem.size=0; {code}
> ..as tweaking with that code would probably open a can of worms..
--
This message was sent by Atlassian Jira
(v8.3.4#803005)