[ https://issues.apache.org/jira/browse/HIVE-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ádám Szita resolved HIVE-24266. ------------------------------- Fix Version/s: 4.0.0 Resolution: Fixed Committed to master. Thanks for the review [~pvary] > Committed rows in hflush'd ACID files may be missing from query result > ---------------------------------------------------------------------- > > Key: HIVE-24266 > URL: https://issues.apache.org/jira/browse/HIVE-24266 > Project: Hive > Issue Type: Bug > Reporter: Ádám Szita > Assignee: Ádám Szita > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > in HDFS environment if a writer is using hflush to write ORC ACID files > during a transaction commit, the results might be seen as missing when > reading the table before this file is completely persisted to disk (thus > synced) > This is due to hflush not persisting the new buffers to disk, it rather just > ensures that new readers can see the new content. This causes the block > information to be incomplete, on which BISplitStrategy relies on. Although > the side file (_flush_length) tracks the proper end of the file that is being > written, this information is neglected in the favour of block information, > and we may end up generating a very short split instead of the larger, > available length. > When ETLSplitStrategy is used there is not even a try to rely on ACID side > file when calculating file length, so that needs to fixed too. > Moreover we might see the newly committed rows not to appear due to OrcTail > caching in ETLSplitStrategy. For now I'm just going to recommend turning that > cache off to anyone that wants real time row updates to be read in: > {code:java} > set hive.orc.cache.stripe.details.mem.size=0; {code} > ..as tweaking with that code would probably open a can of worms.. -- This message was sent by Atlassian Jira (v8.3.4#803005)