tustvold commented on code in PR #36027: URL: https://github.com/apache/arrow/pull/36027#discussion_r1225775542
########## docs/source/status.rst: ########## @@ -348,3 +348,107 @@ Notes: * \(1) Through JNI bindings. (Provided by ``org.apache.arrow.orc:arrow-orc``) * \(2) Through JNI bindings to Arrow C++ Datasets. (Provided by ``org.apache.arrow:arrow-dataset``) + + +Parquet format public API details +================================= + ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Format | C++ | Python | Java | Go | Rust | +| | | | | | | ++===========================================+=======+========+========+=======+=======+ +| Basic compression | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Brotli, LZ4, ZSTD | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| LZ4_RAW | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Hive-style partitioning | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| File metadata | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RowGroup metadata | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Column metadata | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Chunk metadta | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Sorting column | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| ColumnIndex statistics | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page statistics | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Statistics min_value | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| xxHash based bloom filter | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| bloom filter length | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Modular encryption | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| External column data | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Nanosecond support | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| FIXED_LEN_BYTE_ARRAY | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Complete Delta encoding support | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Complete RLE support | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| BYTE_STREAM_SPLIT | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Partition pruning on the partition column | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RowGroup pruning using statistics | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RowGroup pruning using bloom filter | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page pruning using projection pushdown | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page pruning using statistics | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page pruning using bloom filter | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Partition append / delete | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| RowGroup append / delete | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page append / delete | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Page CRC32 checksum | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Parallel partition processing | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Parallel RowGroup processing | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Parallel Page processing | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Storage-aware defaults (1) | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Adaptive concurrency (2) | | | | | | ++-------------------------------------------+-------+--------+--------+-------+-------+ +| Adaptive IO when pruning used (3) | | | | | | Review Comment: I'm not sure which parquet reader these features are based off, but my 2 cents is that they indicate a problematic IO abstraction that relies on prefetching heuristics instead of pushing vectored IO down into the IO subsystem (which the Rust, and proprietary DataBricks implementation does). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
