Is this expected behavior of ORC acid writers? If so, is it documented somewhere?
-dain ---- Dain Sundstrom Co-founder @ Presto Software Foundation, Co-creator of Presto (https://prestosql.io) > On Jun 14, 2019, at 6:17 PM, Owen O'Malley <[email protected]> wrote: > > The hive acid format uses a side file that provides a sequence of the 8 byte > file offsets for completed file footers. If the file is there, it passes the > last offset to the reader and it will treat that as the end of the file. > > In the case where you don't have that, searching for the string “\003ORC” > works really well for finding the tails. In the corrupted files I've seen > I've never needed more than that. > > .. Owen > >> On Jun 14, 2019, at 09:52, Xiening Dai <[email protected]> wrote: >> >> Hi all, >> >> In Orc appending scenario, the append operation (including writing the >> additional data and the new footer) needs to be atomic. Otherwise if it >> failed in between, the file tail would be unrecognizable. Unfortunately not >> all file system can garantee atomic write. When failure does happen, in >> order to recover the data before append, we would need to locate the >> previous file footer by searching backward. And the only way to search for >> the footer is by looking for the “ORC” magic string. But the current magic >> string only has three characters and it’s likely the same string appears in >> user data which will result in parsing a wrong footer, and the behavior is >> undefined. >> >> So I am thinking that if we can change the magic string into some 16-byte >> UUID. This way we can safely use it to locate the footer. The idea is very >> similar to the sync maker in Avro. >> >> Thanks.
