alkis commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1609491812


##########
README.md:
##########
@@ -107,12 +113,97 @@ start locations.  More details on what is contained in 
the metadata can be found
 in the Thrift definition.
 
 Metadata is written after the data to allow for single pass writing.
+This is especially useful when writing to backends such as S3.
 
 Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+### Parquet 3
+
+Parquet 3 files have the following overall structure:
+
+```
+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+
+<Column 1 Chunk 1 + Column Metadata>
+<Column 2 Chunk 1 + Column Metadata>
+...
+<Column N Chunk 1 + Column Metadata>
+<Column 1 Chunk 2 + Column Metadata>
+<Column 2 Chunk 2 + Column Metadata>
+...
+<Column N Chunk 2 + Column Metadata>
+...
+<Column 1 Chunk M + Column Metadata>
+<Column 2 Chunk M + Column Metadata>
+...
+<Column N Chunk M + Column Metadata>
+
+<File-level Column 1 Metadata v3>
+...
+<File-level Column N Metadata v3>
+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)
+4-byte magic number "PAR3"
+
+File Metadata
+4-byte length in bytes of File Metadata (little endian)
+4-byte magic number "PAR1"
+```
+
+Unlike the legacy File Metadata, the File Metadata v3 is designed to be 
light-weight
+to decode, regardless of the number of columns in the file. Individual column
+metadata can be opportunistically decoded depending on actual needs.
+
+This file structure is backwards-compatible. Parquet 1 readers will read and
+decode the legacy File Metadata in the file footer, while Parquet 3 readers
+will notice the "PAR3" magic number just before the File Metadata and will
+instead read and decode the File Metadata v3.

Review Comment:
   > you can also optimistically read, say, the last 4 MiB of the file and 
iterate from that
   
   All good engines fetch the tail of the file to speculatively fetch both the 
size and the footer in a single fetch. A single fetch is important because 
fetching the footer is on the critical path. I am not aware of an engine 
implementation that does 4MB tail reads from object stores for parquet.
   
   512KB S3 fetch is ~20ms.
   2MB S3 fetch is ~30ms
   4MB S3 fetch is ~50ms.
   8MB S3 fetch is ~100ms.
   
   A good engine does a 256-512kb tail fetch because most footers are small and 
reducing fetch size does not change the latency to get the bytes.
   
   For reference parsing a 4mb footer (wide table) tables ~40ms.
   
   If an engine leaves the tail fetch as is to avoid pessimizing reads for 
narrow tables, it means for wide tables it will have to do two fetches. 
Assuming new footer is typically 2x smaller than old:
   - old: 20ms 256kb fetch + 50ms 4mb fetch + 40ms parse --> 110ms
   - new before old: 20ms 256kb fetch + 80ms 6mb fetch + 2ms parse --> 102ms 
(1.1x speedup)
   - new after old: 20ms 256kb fetch + 30ms 2mb fetch + 2ms parse --> 52ms 
(2.2x speedup)
   
   If an engine increases the tail fetch, it will pessimize reads for narrow 
tables, with little if any benefit for wide tables.
   
   From the above analysis and assuming we want to make parquet efficient for 
datalakes (parquet is stored in s3 like storage), having the new footer after 
the old is a *requirement*.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to