alkis commented on code in PR #242:
URL: https://github.com/apache/parquet-format/pull/242#discussion_r1609861669


##########
README.md:
##########
@@ -107,12 +113,97 @@ start locations.  More details on what is contained in 
the metadata can be found
 in the Thrift definition.
 
 Metadata is written after the data to allow for single pass writing.
+This is especially useful when writing to backends such as S3.
 
 Readers are expected to first read the file metadata to find all the column
 chunks they are interested in.  The columns chunks should then be read 
sequentially.
 
  ![File 
Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)
 
+### Parquet 3
+
+Parquet 3 files have the following overall structure:
+
+```
+4-byte magic number "PAR1"
+4-byte magic number "PAR3"
+
+<Column 1 Chunk 1 + Column Metadata>
+<Column 2 Chunk 1 + Column Metadata>
+...
+<Column N Chunk 1 + Column Metadata>
+<Column 1 Chunk 2 + Column Metadata>
+<Column 2 Chunk 2 + Column Metadata>
+...
+<Column N Chunk 2 + Column Metadata>
+...
+<Column 1 Chunk M + Column Metadata>
+<Column 2 Chunk M + Column Metadata>
+...
+<Column N Chunk M + Column Metadata>
+
+<File-level Column 1 Metadata v3>
+...
+<File-level Column N Metadata v3>
+
+File Metadata v3
+4-byte length in bytes of File Metadata v3 (little endian)
+4-byte magic number "PAR3"
+
+File Metadata
+4-byte length in bytes of File Metadata (little endian)
+4-byte magic number "PAR1"
+```
+
+Unlike the legacy File Metadata, the File Metadata v3 is designed to be 
light-weight
+to decode, regardless of the number of columns in the file. Individual column
+metadata can be opportunistically decoded depending on actual needs.
+
+This file structure is backwards-compatible. Parquet 1 readers will read and
+decode the legacy File Metadata in the file footer, while Parquet 3 readers
+will notice the "PAR3" magic number just before the File Metadata and will
+instead read and decode the File Metadata v3.

Review Comment:
   On S3 reads:
   
   Guidance from S3 involves larger reads because of cost: reads from S3 
typically cost an API call and bandwidth/transfer volume is free (same zone). 
8MB reads will cost half the price of 4MB reads. Each connection can go up to 
~100MB/sec so just transfering 4MB of data is at least 25ms. If one is reading 
large files, doing 100s of 8MB reads in parallel will saturate any network card 
- which is typically what engines do.
   
   I posit that the vast majority of parquet encoded data is in some cloud 
somewhere. Hence working well with object stores (high latency, immutable 
files) is a requirement for any change. This is also lesson 4 from [An 
Empirical Evaluation of Columnar Storage 
Formats](https://arxiv.org/pdf/2304.05028).
   
   > With fetch latencies of that order, does the decode latency of the 
existing thrift payload even matter? 
   
   Yes it does. With the numbers above in mind:
   - cold reads: baseline 110ms (as above). Optimized metadata are down to 2mb 
and parsing in 5ms translates to 20ms + 30ms + 5ms = 55ms --> 2x speedup
   - warm reads: footer bytes are cached on 
disk/s3express/something-with-low-latency. It takes 5ms fetch + 40ms parse. 
Optimized, it takes 2ms to fetch + 5ms to parse = 7ms --> 6x speedup
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to