Is this expected behavior of ORC acid writers?  If so, is it documented 
somewhere?

-dain

----
Dain Sundstrom
Co-founder @ Presto Software Foundation, Co-creator of Presto 
(https://prestosql.io)

> On Jun 14, 2019, at 6:17 PM, Owen O'Malley <[email protected]> wrote:
> 
> The hive acid format uses a side file that provides a sequence of the 8 byte 
> file offsets for completed file footers. If the file is there, it passes the 
> last offset to the reader and it will treat that as the end of the file. 
> 
> In the case where you don't have that, searching for the string “\003ORC” 
> works really well for finding the tails. In the corrupted files I've seen 
> I've never needed more than that. 
> 
> .. Owen
> 
>> On Jun 14, 2019, at 09:52, Xiening Dai <[email protected]> wrote:
>> 
>> Hi all,
>> 
>> In Orc appending scenario, the append operation (including writing the 
>> additional data and the new footer) needs to be atomic. Otherwise if it 
>> failed in between, the file tail would be unrecognizable. Unfortunately not 
>> all file system can garantee atomic write. When failure does happen, in 
>> order to recover the data before append, we would need to locate the 
>> previous file footer by searching backward. And the only way to search for 
>> the footer is by looking for the “ORC” magic string. But the current magic 
>> string only has three characters and it’s likely the same string appears in 
>> user data which will result in parsing a wrong footer, and the behavior is 
>> undefined.
>> 
>> So I am thinking that if we can change the magic string into some 16-byte 
>> UUID. This way we can safely use it to locate the footer. The idea is very 
>> similar to the sync maker in Avro.
>> 
>> Thanks.

Reply via email to