Mike, You should be able to. Starting at offset 4 until the end of the data, everything should be stored as a series of Pages without any space in between. If you seek to offset 4 and then start reading pages you should be able to recover all of them, see ParquetFileReader#readAllPages <https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L862> for how we do this for row groups. Then it is just a matter of matching up pages to columns and using the methods in ParquetFileWriter <https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L358> to write the pages out in the same order and calling startColumn / endColumn and startBlock / endBlock column chunks and row groups.
I think the hard part will be mapping the pages to columns. To do that, there are a few good thing to know: * Dictionary pages always start a new column chunk * All pages in a column chunk will use the same repetition-level encoding and definition-level encoding. If one changes you know you have a new column * Uncompressed page sizes target 1MB by default. if you see a series of 1MB pages followed by a 300k page, that's probably the last page in a column chunk * Plain-encoded pages will tend to have a consistent value count and size/value ratio. If you have a few plain-encoded pages of 3 bytes/value that changes suddenly to 10 bytes/value, that's probably a column boundary rb On Fri, Jul 28, 2017 at 2:27 PM, Katelman, Michael < [email protected]> wrote: > Hi, > > Is there a way (straightforward or not so straightforward) to recover > fully written row groups from a parquet file that wasn't closed correctly? > If it helps, assume the schema is known. Thanks. > > -Mike > > > > > > DISCLAIMER: This e-mail message and any attachments are intended solely > for the use of the individual or entity to which it is addressed and may > contain information that is confidential or legally privileged. If you are > not the intended recipient, you are hereby notified that any dissemination, > distribution, copying or other use of this message or its attachments is > strictly prohibited. If you have received this message in error, please > notify the sender immediately and permanently delete this message and any > attachments. > > > > -- Ryan Blue Software Engineer Netflix
