szlta opened a new pull request #1576:
URL: https://github.com/apache/hive/pull/1576


   in HDFS environment if a writer is using hflush to write ORC ACID files 
during a transaction commit, the results might be seen as missing when reading 
the table before this file is completely persisted to disk (thus synced)
   
   This is due to hflush not persisting the new buffers to disk, it rather just 
ensures that new readers can see the new content. This causes the block 
information to be incomplete, on which BISplitStrategy relies on. Although the 
side file (_flush_length) tracks the proper end of the file that is being 
written, this information is neglected in the favour of block information, and 
we may end up generating a very short split instead of the larger, available 
length.
   When ETLSplitStrategy is used there is not even a try to rely on ACID side 
file when calculating file length, so that needs to fixed too.
   
   Moreover we might see the newly committed rows not to appear due to OrcTail 
caching in ETLSplitStrategy. For now I'm just going to recommend turning that 
cache off to anyone that wants real time row updates to be read in:
   
   set hive.orc.cache.stripe.details.mem.size=0;  
   ..as tweaking with that code would probably open a can of worms..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Reply via email to