r4ntix commented on issue #5942:
URL: 
https://github.com/apache/arrow-datafusion/issues/5942#issuecomment-1509831466

   > I wonder if running 
[parquet-layout](https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-layout.rs)
 against the parquet file might prove insightful.
   > 
   > DataFusion is currently limited to row group level parallelism, and there 
certainly are parquet writers that write very large row groups which would 
cause issues for this - 
[apache/arrow#34280](https://github.com/apache/arrow/issues/34280). Longer-term 
I would like to eventually get back to #2504 but that is not likely in the next 
couple of months.
   
   The flexibility of the parquet file causes different Writers to use 
different file generation strategies. The data in a Parquet file can be spread 
over the row groups and the pages using any encoding and compression the writer 
or user wants. 
   
   If the physical layout of the parquet file affects the way different query 
engines `scan`, should we introduce a standard TPC-H Parquet file and re-run 
the performance comparison test?
   
   I also saw this issue in this paper: 
https://dl.gi.de/bitstream/handle/20.500.12116/40316/B3-1.pdf?sequence=1&isAllowed=y
   
   > we look at three different Parquet writers to show how much Parquet files 
differ even though they store the same data. Parquet Writer Comparison:
   > | Generator          | Rows per Row Group | Pages per Row Group | File 
Sizes(SF1,SF10,SF100) |
   > | ------------------ | ------------------ | ------------------- | 
-------------------------- |
   > | Spark              | 3,000,000          | 150                 | 192 MB, 
2.1 GB, 20 GB      |
   > | Spark uncompressed | 3,000,000          | 150                 | 333 MB, 
3.3 GB, 33 GB      |
   > | DuckDB             | 100,352            | 1                   | 281 MB, 
2.8 GB, 28 GB      |
   > | Arrow              | 67,108,864         | 15 - 1800           | 189 MB, 
2.0 GB, 20 GB      |
   >
   > For each generator, we measure the number of rows and the number of pages 
that are stored per row group. The Spark and DuckDB Parquet writers store a 
fixed number of elements per page and a fixed number of pages per row group. 
Since Parquet does not force synchronization between the column chunks, there 
are writers such as Arrow that do not store the same number of elements per 
page. Arrow uses a fixed data page size between roughly 0.5MB and 1 MB. For 
DuckDB and Spark, the page sizes vary from 0.5 MB to 6 MB. 
   >
   > Even though we only cover three different Parquet writers, we have already 
observed two extremes. DuckDB and Arrow do not take advantage of the 
hierarchical data layout: DuckDB will only use one page per row group, and 
Arrow stores the entire dataset in one row group for scale factor 1 and 10 
since each row group stores 67 million rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to