Mike,

You should be able to. Starting at offset 4 until the end of the data,
everything should be stored as a series of Pages without any space in
between. If you seek to offset 4 and then start reading pages you should be
able to recover all of them, see ParquetFileReader#readAllPages
<https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L862>
for how we do this for row groups. Then it is just a matter of matching up
pages to columns and using the methods in ParquetFileWriter
<https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L358>
to write the pages out in the same order and calling startColumn /
endColumn and startBlock / endBlock column chunks and row groups.

I think the hard part will be mapping the pages to columns. To do that,
there are a few good thing to know:

* Dictionary pages always start a new column chunk
* All pages in a column chunk will use the same repetition-level encoding
and definition-level encoding. If one changes you know you have a new column
* Uncompressed page sizes target 1MB by default. if you see a series of 1MB
pages followed by a 300k page, that's probably the last page in a column
chunk
* Plain-encoded pages will tend to have a consistent value count and
size/value ratio. If you have a few plain-encoded pages of 3 bytes/value
that changes suddenly to 10 bytes/value, that's probably a column boundary

rb


On Fri, Jul 28, 2017 at 2:27 PM, Katelman, Michael <
[email protected]> wrote:

> Hi,
>
> Is there a way (straightforward or not so straightforward) to recover
> fully written row groups from a parquet file that wasn't  closed correctly?
> If it helps, assume the schema is known. Thanks.
>
> -Mike
>
>
>
>
>
> DISCLAIMER: This e-mail message and any attachments are intended solely
> for the use of the individual or entity to which it is addressed and may
> contain information that is confidential or legally privileged. If you are
> not the intended recipient, you are hereby notified that any dissemination,
> distribution, copying or other use of this message or its attachments is
> strictly prohibited. If you have received this message in error, please
> notify the sender immediately and permanently delete this message and any
> attachments.
>
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to