alamb commented on code in PR #101: URL: https://github.com/apache/parquet-site/pull/101#discussion_r1954224860
########## content/en/docs/File Format/implementationstatus.md: ########## @@ -115,12 +115,12 @@ Implementations: | Format | C++ | Java | Go | Rust | cuDF | | -------------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| External column data (1) | ✅ | ✅ | | | (W) | -| Row group "Sorting column" metadata (2) | ✅ | ❌ | | | (W) | -| Row group pruning using statistics | ❌ | ✅ | | | ✅ | -| Row group pruning using bloom filter | ❌ | ✅ | | | ✅ | -| Reading select columns only | ✅ | ✅ | | | ✅ | -| Page pruning using statistics | ❌ | ✅ | | | ❌ | +| External column data (1) | ✅ | ✅ | | ❌ | (W) | +| Row group "Sorting column" metadata (2) | ✅ | ❌ | | ✅ | (W) | +| Row group pruning using statistics | ❌ | ✅ | | ✅ | ✅ | +| Row group pruning using bloom filter | ❌ | ✅ | | ✅ | ✅ | +| Reading select columns only | ✅ | ✅ | | ✅ | ✅ | +| Page pruning using statistics | ❌ | ✅ | | ✅ | ❌ | Review Comment: I agree we should mark parquet-rs as supporting pruning Speicficially this structure gets the statistics as arrow record batches (either pages or row groups) - https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/statistics/struct.StatisticsConverter.html And then you can specify which row groups to read read via - https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_groups - https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection As @tustvold says parquet-rs doesn't provide a way to evaluate an expression on those arrow arrays, but you can use a query engine (like DataFusion!) to do so ########## content/en/docs/File Format/implementationstatus.md: ########## @@ -45,66 +45,66 @@ Implementations: | Data type | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| STRING | ✅ | ✅ | | | ✅ | -| ENUM | ❌ | ✅ | | | ❌ | -| UUID | ❌ | ✅ | | | ❌ | -| 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | | | ✅ | -| DECIMAL (INT32) | ✅ | ✅ | | | ✅ | -| DECIMAL (INT64) | ✅ | ✅ | | | ✅ | -| DECIMAL (BYTE_ARRAY) | ✅ | ✅ | | | ✅ | -| DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | | | ✅ | -| DATE | ✅ | ✅ | | | ✅ | -| TIME (INT32) | ✅ | ✅ | | | ✅ | -| TIME (INT64) | ✅ | ✅ | | | ✅ | -| TIMESTAMP (INT64) | ✅ | ✅ | | | ✅ | -| INTERVAL | ✅ | ✅(*)| | | ❌ | -| JSON | ✅ | ✅(*)| | | ❌ | -| BSON | ❌ | ✅(*)| | | ❌ | -| LIST | ✅ | ✅ | | | ✅ | -| MAP | ✅ | ✅ | | | ✅ | -| UNKNOWN (always null) | ✅ | ✅ | | | ✅ | -| FLOAT16 | ✅ | ✅(*)| | | ✅ | +| STRING | ✅ | ✅ | | ✅ | ✅ | +| ENUM | ❌ | ✅ | | ✅(*)| ❌ | +| UUID | ❌ | ✅ | | ✅(*)| ❌ | +| 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | | ✅ | ✅ | +| DECIMAL (INT32) | ✅ | ✅ | | ✅ | ✅ | +| DECIMAL (INT64) | ✅ | ✅ | | ✅ | ✅ | +| DECIMAL (BYTE_ARRAY) | ✅ | ✅ | | ✅ | ✅ | +| DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | | ✅ | ✅ | +| DATE | ✅ | ✅ | | ✅ | ✅ | +| TIME (INT32) | ✅ | ✅ | | ✅ | ✅ | +| TIME (INT64) | ✅ | ✅ | | ✅ | ✅ | +| TIMESTAMP (INT64) | ✅ | ✅ | | ✅ | ✅ | +| INTERVAL | ✅ | ✅(*)| | ✅ | ❌ | +| JSON | ✅ | ✅(*)| | ✅(*)| ❌ | +| BSON | ❌ | ✅(*)| | ✅(*)| ❌ | +| LIST | ✅ | ✅ | | ✅ | ✅ | +| MAP | ✅ | ✅ | | ✅ | ✅ | +| UNKNOWN (always null) | ✅ | ✅ | | ✅ | ✅ | +| FLOAT16 | ✅ | ✅(*)| | ✅ | ✅ | (*): Only supported to use its annotated physical type ### Encodings | Encoding | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| PLAIN | ✅ | ✅ | | | ✅ | -| PLAIN_DICTIONARY | ✅ | ✅ | | | ✅ | -| RLE_DICTIONARY | ✅ | ✅ | | | ✅ | -| RLE | ✅ | ✅ | | | ✅ | -| BIT_PACKED (deprecated) | ✅ | ✅ | | | (R) | -| DELTA_BINARY_PACKED | ✅ | ✅ | | | ✅ | -| DELTA_LENGTH_BYTE_ARRAY | ✅ | ✅ | | | ✅ | -| DELTA_BYTE_ARRAY | ✅ | ✅ | | | ✅ | -| BYTE_STREAM_SPLIT | ✅ | ✅ | | | ✅ | +| PLAIN | ✅ | ✅ | | ✅ | ✅ | +| PLAIN_DICTIONARY | ✅ | ✅ | | ✅ | ✅ | +| RLE_DICTIONARY | ✅ | ✅ | | ✅ | ✅ | +| RLE | ✅ | ✅ | | ✅ | ✅ | +| BIT_PACKED (deprecated) | ✅ | ✅ | | ❌ | (R) | +| DELTA_BINARY_PACKED | ✅ | ✅ | | ✅ | ✅ | +| DELTA_LENGTH_BYTE_ARRAY | ✅ | ✅ | | ✅ | ✅ | +| DELTA_BYTE_ARRAY | ✅ | ✅ | | ✅ | ✅ | +| BYTE_STREAM_SPLIT | ✅ | ✅ | | ✅ | ✅ | ### Compressions | Compression | C++ | Java | Go | Rust | cuDF | | ----------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| UNCOMPRESSED | ✅ | ✅ | | | ✅ | -| BROTLI | ✅ | ✅ | | | (R) | -| GZIP | ✅ | ✅ | | | (R) | -| LZ4 (deprecated) | ✅ | ❌ | | | ❌ | -| LZ4_RAW | ✅ | ✅ | | | ✅ | -| LZO | ❌ | ❌ | | | ❌ | -| SNAPPY | ✅ | ✅ | | | ✅ | -| ZSTD | ✅ | ✅ | | | ✅ | +| UNCOMPRESSED | ✅ | ✅ | | ✅ | ✅ | +| BROTLI | ✅ | ✅ | | ✅ | (R) | +| GZIP | ✅ | ✅ | | ✅ | (R) | +| LZ4 (deprecated) | ✅ | ❌ | | ✅ | ❌ | +| LZ4_RAW | ✅ | ✅ | | ✅ | ✅ | +| LZO | ❌ | ❌ | | ❌ | ❌ | Review Comment: https://docs.rs/parquet/latest/parquet/basic/enum.Compression.html claims to support LZO However I did some more digging and I agree that LZO does not appear to be supported https://github.com/apache/arrow-rs/blob/7781bc2170c84ada387901e09b2cdfe4235c3570/parquet/src/compression.rs#L195-L194 ########## content/en/docs/File Format/implementationstatus.md: ########## @@ -115,12 +115,12 @@ Implementations: | Format | C++ | Java | Go | Rust | cuDF | | -------------------------------------------- | ----- | ----- | ----- | ----- | ----- | -| External column data (1) | ✅ | ✅ | | | (W) | -| Row group "Sorting column" metadata (2) | ✅ | ❌ | | | (W) | -| Row group pruning using statistics | ❌ | ✅ | | | ✅ | -| Row group pruning using bloom filter | ❌ | ✅ | | | ✅ | -| Reading select columns only | ✅ | ✅ | | | ✅ | -| Page pruning using statistics | ❌ | ✅ | | | ❌ | +| External column data (1) | ✅ | ✅ | | ❌ | (W) | +| Row group "Sorting column" metadata (2) | ✅ | ❌ | | ✅ | (W) | +| Row group pruning using statistics | ❌ | ✅ | | ✅ | ✅ | +| Row group pruning using bloom filter | ❌ | ✅ | | ✅ | ✅ | +| Reading select columns only | ✅ | ✅ | | ✅ | ✅ | +| Page pruning using statistics | ❌ | ✅ | | ✅ | ❌ | Review Comment: BTW I wonder if we should propose adding a row for "predicate pushdown" (aka evaluating predicates based on scans) -- basically what https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/trait.ArrowPredicate.html provides -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
