RE: recovering parquet tables with corrupt footers

Katelman, Michael Sun, 30 Jul 2017 07:55:54 -0700

Thanks for the notes, Ryan. I really appreciate it.

I'm going to be taking a look into it soon.

-----Original Message-----
From: Ryan Blue [mailto:[email protected]] 
Sent: Friday, July 28, 2017 7:52 PM
To: Parquet Dev
Subject: Re: recovering parquet tables with corrupt footers

Mike,

You should be able to. Starting at offset 4 until the end of the data, 
everything should be stored as a series of Pages without any space in between. 
If you seek to offset 4 and then start reading pages you should be able to 
recover all of them, see ParquetFileReader#readAllPages 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dhadoop_src_main_java_org_apache_parquet_hadoop_ParquetFileReader.java-23L862&d=DwIBaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=p7uiAfJkXEwbVhZPqB-VxtsgxuGNpO5tGgnMUX3wqrPAIvdxhcKmn9kvZiXDziBQ&m=1Pkk1LTQtAr2mbHqEXvzjfEJqVORH7rHB3_Z0r6PUG4&s=jK8aD2ru5h0bdUkqxTB28cmIi2ZRgVvr1fbCaY7RwhY&e=
 > for how we do this for row groups. Then it is just a matter of matching up 
pages to columns and using the methods in ParquetFileWriter 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_blob_master_parquet-2Dhadoop_src_main_java_org_apache_parquet_hadoop_ParquetFileWriter.java-23L358&d=DwIBaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=p7uiAfJkXEwbVhZPqB-VxtsgxuGNpO5tGgnMUX3wqrPAIvdxhcKmn9kvZiXDziBQ&m=1Pkk1LTQtAr2mbHqEXvzjfEJqVORH7rHB3_Z0r6PUG4&s=AFpkOQIR3JR113yowo9FZzQft_T9d0FYDtToVkEqoOE&e=
 > to write the pages out in the same order and calling startColumn / endColumn 
and startBlock / endBlock column chunks and row groups.

I think the hard part will be mapping the pages to columns. To do that, there 
are a few good thing to know:

* Dictionary pages always start a new column chunk
* All pages in a column chunk will use the same repetition-level encoding and 
definition-level encoding. If one changes you know you have a new column
* Uncompressed page sizes target 1MB by default. if you see a series of 1MB 
pages followed by a 300k page, that's probably the last page in a column chunk
* Plain-encoded pages will tend to have a consistent value count and size/value 
ratio. If you have a few plain-encoded pages of 3 bytes/value that changes 
suddenly to 10 bytes/value, that's probably a column boundary

rb

On Fri, Jul 28, 2017 at 2:27 PM, Katelman, Michael < 
[email protected]> wrote:

> Hi,
>
> Is there a way (straightforward or not so straightforward) to recover 
> fully written row groups from a parquet file that wasn't  closed correctly?
> If it helps, assume the schema is known. Thanks.
>
> -Mike
>
>
>
>
>
> DISCLAIMER: This e-mail message and any attachments are intended 
> solely for the use of the individual or entity to which it is 
> addressed and may contain information that is confidential or legally 
> privileged. If you are not the intended recipient, you are hereby 
> notified that any dissemination, distribution, copying or other use of 
> this message or its attachments is strictly prohibited. If you have 
> received this message in error, please notify the sender immediately 
> and permanently delete this message and any attachments.
>
>
>
>

--
Ryan Blue
Software Engineer
Netflix

DISCLAIMER: This e-mail message and any attachments are intended solely for the 
use of the individual or entity to which it is addressed and may contain 
information that is confidential or legally privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, copying or other use of this message or its attachments is 
strictly prohibited. If you have received this message in error, please notify 
the sender immediately and permanently delete this message and any attachments.

RE: recovering parquet tables with corrupt footers

Reply via email to