alamb commented on code in PR #143:
URL: https://github.com/apache/parquet-site/pull/143#discussion_r2610242332
##########
layouts/shortcodes/implementation-status.html:
##########
@@ -0,0 +1,136 @@
+{{- /*
+ Shortcode to render implementation status tables from structured data.
+ Usage: {{< implementation-status >}}
+
+*/ -}}
Review Comment:
I also suggest here a breadcrumb to help people find the data without having
to look at the source code. Something like
```suggestion
{{- /*
Render implementation status tables from structured data in
`data/implementation`
Usage: {{< implementation-status >}}
*/ -}}
```
##########
data/implementations/features/format-features.yaml:
##########
@@ -0,0 +1,31 @@
+category_id: format-features
+features:
+ - id: format-bloom-filters
+ display_name: xxHash-based bloom filters
+ spec_url:
https://github.com/apache/parquet-format/blob/master/BloomFilter.md
Review Comment:
Now that you have moved the site to be based on parquet-format, we could
also change these links to link to the parquet.apache.org site, for example:
https://parquet.apache.org/docs/file-format/bloomfilter/
However, perhaps that would be better done as a follow on PR
##########
content/en/docs/File Format/implementationstatus.md:
##########
@@ -26,145 +27,4 @@ Implementations:
* [hyparquet](https://github.com/hyparam/hyparquet) (JavaScript)
* [duckdb](https://github.com/duckdb/duckdb) (C++)
-### Physical types
-
-Physical types are defined by the [`enum Type` in parquet.thrift]
-
-[`enum Type` in parquet.thrift]:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32
-
-
-| Data type | arrow | parquet-java | arrow-go
| arrow-rs | cudf | hyparquet | duckdb |
-| ----------------------------------------- | ----- | ------------- | --------
| -------- | ----- | --------- | ------ |
-| BOOLEAN | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| INT32 | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| INT64 | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| INT96 (1) | ✅ | ✅ | ✅ |
✅ | ✅ | (R) | (R) |
-| FLOAT | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| DOUBLE | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| BYTE_ARRAY | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| FIXED_LEN_BYTE_ARRAY | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-
-* \(1) This type is deprecated, but as of 2024 it's common in currently
produced parquet files
-
-
-### Logical types
-
-Logical types are defined by the [`union LogicalType` in parquet.thrift] and
described in [LogicalTypes.md]
-
-[`union LogicalType` in parquet.thrift]:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L471
-[LogicalTypes.md]:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
-
-| Data type | arrow | parquet-java | arrow-go |
arrow-rs | cudf | hyparquet | duckdb |
-|-----------------------------------------|------| ------------- | ------- |
--------- | ---- | -------- |--------|
-| STRING | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| ENUM | ❌ | ✅ | ✅ | ✅
(1) | ❌ | ✅ | ✅ |
-| UUID | ❌ | ✅ | ✅ | ✅
(1) | ❌ | ✅ | ✅ |
-| 8, 16, 32, 64 bit signed and unsigned INT | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| DECIMAL (INT32) | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| DECIMAL (INT64) | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| DECIMAL (BYTE_ARRAY) | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | (R) |
-| DECIMAL (FIXED_LEN_BYTE_ARRAY) | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| FLOAT16 | ✅ | ✅ (1) | ✅ | ✅
| ✅ | ✅ | ✅ |
-| DATE | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| TIME (INT32) | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| TIME (INT64) | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| TIMESTAMP (INT64) | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-| INTERVAL | ✅ | ✅ (1) | ✅ | ✅
| ❌ | ✅ | ✅ |
-| JSON | ✅ | ✅ (1) | ✅ | ✅
(1) | ❌ | ✅ | ✅ |
-| BSON | ❌ | ✅ (1) | ✅ | ✅
(1) | ❌ | ❌ | ❌ |
-| [VARIANT] | | ✅ | ✅ | ✅
| ❌ | ❌ | ✅ |
-| [GEOMETRY] | ✅ | ✅ | ❌ | ✅
| ❌ | ✅ | ✅ |
-| [GEOGRAPHY] | ✅ | ✅ | ❌ | ✅
| ❌ | ✅ | ✅ |
-| LIST | ✅ | ✅ | ✅ | ✅
| ✅ | (R) | ✅ |
-| MAP | ✅ | ✅ | ✅ | ✅
| ✅ | (R) | ✅ |
-| UNKNOWN (always null) | ✅ | ✅ | ✅ | ✅
| ✅ | ✅ | ✅ |
-
-* \(1) Only supported to use its annotated physical type
-
-[VARIANT]:
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
-[GEOMETRY]:
https://github.com/apache/parquet-format/blob/master/Geospatial.md#logical-types
-[GEOGRAPHY]:
https://github.com/apache/parquet-format/blob/master/Geospatial.md#logical-types
-
-
-### Encodings
-
-Encodings are defined by the [`enum Encoding` in parquet.thrift] and described
in [Encodings.md]
-
-[`enum Encoding` in parquet.thrift]:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L566
-[Encodings.md]:
https://github.com/apache/parquet-format/blob/master/Encodings.md
-
-| Encoding | arrow | parquet-java | arrow-go
| arrow-rs | cudf | hyparquet | duckdb |
-| ----------------------------------------- | ----- | ------------- | --------
| -------- | ----- | --------- | ------ |
-| PLAIN | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| PLAIN_DICTIONARY | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | (R) |
-| RLE_DICTIONARY | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| RLE | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| BIT_PACKED (deprecated) | ✅ | ✅ | ✅ |
❌ (1) | (R) | (R) | ❌ |
-| DELTA_BINARY_PACKED | ✅ | ✅ | ✅ |
✅ | ✅ | (R) | ✅ |
-| DELTA_LENGTH_BYTE_ARRAY | ✅ | ✅ | ✅ |
✅ | ✅ | (R) | ✅ |
-| DELTA_BYTE_ARRAY | ✅ | ✅ | ✅ |
✅ | ✅ | (R) | ✅ |
-| BYTE_STREAM_SPLIT | ✅ | ✅ | ✅ |
✅ | ✅ | (R) | ✅ |
-
-* \(1) Partial read support, but only in the case of level data with a
bitwidth of 0
-
-### Compressions
-
-Compressions are defined by the [`enum CompressionCodec` in parquet.thrift]
and described in [Compression.md]
-
-[`enum CompressionCodec` in parquet.thrift]:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L642
-[Compression.md]:
https://github.com/apache/parquet-format/blob/master/Compression.md
-
-| Compression | arrow | parquet-java | arrow-go
| arrow-rs | cudf | hyparquet | duckdb |
-| ----------------------------------------- | ----- | ------------- | --------
| -------- | ----- | --------- | ------ |
-| UNCOMPRESSED | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| BROTLI | ✅ | ✅ | ✅ |
✅ | (R) | (R) | ✅ |
-| GZIP | ✅ | ✅ | ✅ |
✅ | (R) | (R) | ✅ |
-| LZ4 (deprecated) | ✅ | ❌ | ❌ |
✅ | ❌ | (R) | ❌ |
-| LZ4_RAW | ✅ | ✅ | ✅ |
✅ | ✅ | (R) | ✅ |
-| LZO | ❌ | ❌ | ❌ |
❌ | ❌ | ❌ | ❌ |
-| SNAPPY | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| ZSTD | ✅ | ✅ | ✅ |
✅ | ✅ | (R) | ✅ |
-
-### Other format level features
-
-| Feature | arrow | parquet-java | arrow-go |
arrow-rs | cudf | hyparquet | duckdb |
-|---------------------------------| ----- | ------------- | -------- |
-------- | ---- | --------- | ------ |
-| [xxHash-based bloom filters] | (R) | ✅ | ✅ | ✅ |
(R) | | ✅ |
-| Bloom filter length (1) | (R) | ✅ | ✅ | ✅ |
(R) | | ✅ |
-| Statistics min_value, max_value | ✅ | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ |
-| [Page index] | ✅ | ✅ | ✅ | ✅ |
✅ | (R) | (R) |
-| Page CRC32 checksum | ✅ | ✅ | ❌ | ✅ |
❌ | ❌ | (R) |
-| [Modular encryption] | ✅ | ✅ | ✅ | ✅ |
❌ | ❌ | ✅ (*) |
-| Size statistics (2) | ✅ | ✅ | (R) | ✅ |
✅ | | (R) |
-| Data Page V2 (3) | ✅ | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ |
-
-* \(1) In [parquet.thrift]: ColumnMetaData->bloom_filter_length
-
-* \(2) In [parquet.thrift]: ColumnMetaData->size_statistics
-
-* \(3) In [parquet.thrift]: DataPageHeaderV2
-
-[xxHash-based bloom filters]:
https://github.com/apache/parquet-format/blob/master/BloomFilter.md
-[parquet.thrift]:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
-[Page index]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
-[Modular encryption]:
https://github.com/apache/parquet-format/blob/master/Encryption.md
-
-
-* (*) Partial support
-
-### High level data APIs for Parquet feature usage
-
-| Feature | arrow | parquet-java | arrow-go
| arrow-rs | cudf | hyparquet | duckdb |
-| ----------------------------------------- | ----- | ------------- | --------
| -------- | ----- | --------- | ------ |
-| External column data (1) | ✅ | ✅ | ❌ |
❌ | (W) | ✅ | ❌ |
-| Row group "Sorting column" metadata (2) | ✅ | ❌ | ✅ |
✅ | (W) | ❌ | (R) |
-| Row group pruning using statistics | ❌ | ✅ | ✅ (*) |
✅ | ✅ | ❌ | ✅ |
-| Row group pruning using bloom filter | ❌ | ✅ | ✅ (*) |
✅ | ✅ | ❌ | ✅ |
-| Reading select columns only | ✅ | ✅ | ✅ |
✅ | ✅ | ✅ | ✅ |
-| Page pruning using statistics | ❌ | ✅ | ✅ (*) |
✅ | ❌ | ❌ | ❌ |
-
-* \(1) In parquet.thrift: ColumnChunk->file_path
-
-* \(2) In parquet.thrift: RowGroup->sorting_columns
-
-* (*) Partial Support
+{{< implementation-status >}}
Review Comment:
It might help to leave some comments here helping people quickly go to the
source. Something like this perhaps
```suggestion
<!-- Status source in data/implementations -->
{{< implementation-status >}}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]