The hive acid format uses a side file that provides a sequence of the 8 byte 
file offsets for completed file footers. If the file is there, it passes the 
last offset to the reader and it will treat that as the end of the file. 

In the case where you don't have that, searching for the string “\003ORC” works 
really well for finding the tails. In the corrupted files I've seen I've never 
needed more than that. 

.. Owen

> On Jun 14, 2019, at 09:52, Xiening Dai <[email protected]> wrote:
> 
> Hi all,
> 
> In Orc appending scenario, the append operation (including writing the 
> additional data and the new footer) needs to be atomic. Otherwise if it 
> failed in between, the file tail would be unrecognizable. Unfortunately not 
> all file system can garantee atomic write. When failure does happen, in order 
> to recover the data before append, we would need to locate the previous file 
> footer by searching backward. And the only way to search for the footer is by 
> looking for the “ORC” magic string. But the current magic string only has 
> three characters and it’s likely the same string appears in user data which 
> will result in parsing a wrong footer, and the behavior is undefined.
> 
> So I am thinking that if we can change the magic string into some 16-byte 
> UUID. This way we can safely use it to locate the footer. The idea is very 
> similar to the sync maker in Avro.
> 
> Thanks.

Reply via email to