Thanks for the notes, Ryan. I really appreciate it. I'm going to be taking a look into it soon.
-----Original Message----- From: Ryan Blue [mailto:[email protected]] Sent: Friday, July 28, 2017 7:52 PM To: Parquet Dev Subject: Re: recovering parquet tables with corrupt footers Mike, You should be able to. Starting at offset 4 until the end of the data, everything should be stored as a series of Pages without any space in between. If you seek to offset 4 and then start reading pages you should be able to recover all of them, see ParquetFileReader#readAllPages <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dhadoop_src_main_java_org_apache_parquet_hadoop_ParquetFileReader.java-23L862&d=DwIBaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=p7uiAfJkXEwbVhZPqB-VxtsgxuGNpO5tGgnMUX3wqrPAIvdxhcKmn9kvZiXDziBQ&m=1Pkk1LTQtAr2mbHqEXvzjfEJqVORH7rHB3_Z0r6PUG4&s=jK8aD2ru5h0bdUkqxTB28cmIi2ZRgVvr1fbCaY7RwhY&e= > for how we do this for row groups. Then it is just a matter of matching up pages to columns and using the methods in ParquetFileWriter <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dhadoop_src_main_java_org_apache_parquet_hadoop_ParquetFileWriter.java-23L358&d=DwIBaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=p7uiAfJkXEwbVhZPqB-VxtsgxuGNpO5tGgnMUX3wqrPAIvdxhcKmn9kvZiXDziBQ&m=1Pkk1LTQtAr2mbHqEXvzjfEJqVORH7rHB3_Z0r6PUG4&s=AFpkOQIR3JR113yowo9FZzQft_T9d0FYDtToVkEqoOE&e= > to write the pages out in the same order and calling startColumn / endColumn and startBlock / endBlock column chunks and row groups. I think the hard part will be mapping the pages to columns. To do that, there are a few good thing to know: * Dictionary pages always start a new column chunk * All pages in a column chunk will use the same repetition-level encoding and definition-level encoding. If one changes you know you have a new column * Uncompressed page sizes target 1MB by default. if you see a series of 1MB pages followed by a 300k page, that's probably the last page in a column chunk * Plain-encoded pages will tend to have a consistent value count and size/value ratio. If you have a few plain-encoded pages of 3 bytes/value that changes suddenly to 10 bytes/value, that's probably a column boundary rb On Fri, Jul 28, 2017 at 2:27 PM, Katelman, Michael < [email protected]> wrote: > Hi, > > Is there a way (straightforward or not so straightforward) to recover > fully written row groups from a parquet file that wasn't closed correctly? > If it helps, assume the schema is known. Thanks. > > -Mike > > > > > > DISCLAIMER: This e-mail message and any attachments are intended > solely for the use of the individual or entity to which it is > addressed and may contain information that is confidential or legally > privileged. If you are not the intended recipient, you are hereby > notified that any dissemination, distribution, copying or other use of > this message or its attachments is strictly prohibited. If you have > received this message in error, please notify the sender immediately > and permanently delete this message and any attachments. > > > > -- Ryan Blue Software Engineer Netflix DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
