Script 'mail_helper' called by obssrc Hello community, here is the log from the commit of package apache-arrow for openSUSE:Factory checked in at 2024-02-25 14:06:15 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Comparing /work/SRC/openSUSE:Factory/apache-arrow (Old) and /work/SRC/openSUSE:Factory/.apache-arrow.new.1770 (New) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "apache-arrow" Sun Feb 25 14:06:15 2024 rev:9 rq:1150089 version:15.0.1 Changes: -------- --- /work/SRC/openSUSE:Factory/apache-arrow/apache-arrow.changes 2024-01-16 21:38:45.502166222 +0100 +++ /work/SRC/openSUSE:Factory/.apache-arrow.new.1770/apache-arrow.changes 2024-02-25 14:06:31.589784755 +0100 @@ -1,0 +2,246 @@ +Fri Feb 23 17:35:45 UTC 2024 - Ben Greiner <[email protected]> + +- Update to 15.0.1 + ## Bug Fixes + * [C++] "iso_calendar" kernel returns incorrect results for array + length > 32 (#39360) + * [C++] Explicit error in ExecBatchBuilder when appending var + length data exceeds offset limit (int32 max) (#39383) + * [C++][Parquet] Pass memory pool to decoders (#39526) + * [C++][Parquet] Validate page sizes before truncating to int32 + (#39528) + * [C++] Fix tail-word access cross buffer boundary in + `CompareBinaryColumnToRow` (#39606) + * [C++] Fix the issue of ExecBatchBuilder when appending + consecutive tail rows with the same id may exceed buffer + boundary (for fixed size types) (#39585) + * [Release] Update platform tags for macOS wheels to macosx_10_15 + (#39657) + * [C++][FlightRPC] Fix nullptr dereference in PollInfo (#39711) + * [C++] Fix tail-byte access cross buffer boundary in key hash + avx2 (#39800) + * [C++][Acero] Fix AsOfJoin with differently ordered schemas than + the output (#39804) + * [C++] Expression ExecuteScalarExpression execute empty args + function with a wrong result (#39908) + * [C++] Strip extension metadata when importing a registered + extension (#39866) + * [C#] Restore support for .NET 4.6.2 (#40008) + * [C++] Fix out-of-line data size calculation in + BinaryViewBuilder::AppendArraySlice (#39994) + * [C++][CI][Parquet] Fixing parquet column_writer_test building + (#40175) + ## New Features and Improvements + * [C++] PollFlightInfo does not follow rule of 5 + * [C++] Fix filter and take kernel for month_day_nano intervals + (#39795) + * [C++] Thirdparty: Bump zlib to 1.3.1 (#39877) + * [C++] Add missing "#include <algorithm>" (#40010) +- Release 15.0.0 + ## Bug Fixes + * [C++] Bring back case_when tests for union types (#39308) + * [C++] Fix the issue of ExecBatchBuilder when appending + consecutive tail rows with the same id may exceed buffer + boundary (#39234) + * [C++][Python] Add a no-op kernel for + dictionary_encode(dictionary) (#38349) + * [C++] Use the latest tagged version of flatbuffers (#38192) + * [C++] Don't use MSVC_VERSION to determin + -fms-compatibility-version (#36595) + * [C++] Optimize hash kernels for Dictionary ChunkedArrays + (#38394) + * [C++][Gandiva] Avoid registering exported functions multiple + times in gandiva (#37752) + * [C++][Acero] Fix race condition caused by straggling input in + the as-of-join node (#37839) + * [C++][Parquet] add more closed file checks for + ParquetFileWriter (#38390) + * [C++][FlightRPC] Add missing app_metadata arguments (#38231) + * [C++][Parquet] Fix Valgrind memory leak in + arrow-dataset-file-parquet-encryption-test (#38306) + * [C++][Parquet] Don't initialize OpenSSL explicitly with OpenSSL + 1.1 (#38379) + * [C++] Re-generate flatbuffers C++ for Skyhook (#38405) + * [C++] Avoid passing null pointer to LZ4 frame decompressor + (#39125) + * [C++] Add missing explicit size_t cast for i386 (#38557) + * [C++] Fix: add TestingEqualOptions for gtest functions. + (#38642) + * [C++][Gandiva] Use arrow io util to replace + std::filesystem::path in gandiva (#38698) + * [C++] Protect against PREALLOCATE preprocessor defined on macOS + (#38760) + * [C++] Check variadic buffer counts in bounds (#38740) + * [C++][FS][Azure] Do nothing for CreateDir("/container", true) + (#38783) + * Fix TestArrowReaderAdHoc.ReadFloat16Files to use new + uncompressed files (#38825) + * [C++] S3FileSystem export s3 sdk config + "use_virtual_addressing" to arrow::fs::S3Options (#38858) + * [C++][Gandiva] Fix Gandiva to_date function's validation for + supress errors parameter (#38987) + * [C++][Parquet] Fix spelling (#38959) + * [C++] Fix spelling (acero) (#38961) + * [C++] Fix spelling (compute) (#38965) + * [C++] Fix spelling (util) (#38967) + * [C++] Fix spelling (dataset) (#38969) + * [C++] Fix spelling (filesystem) (#38972) + * [C++] Fix spelling (#38978) + * [C++] Fix spelling (#38980) + * [C++][Acero] union node output batches should be unordered + (#39046) + * [C++][CI] Fix Valgrind failures (#39127) + * [C++] Remove needless system Protobuf dependency with + -DARROW_HDFS=ON (#39137) + * [C++][Compute] Fix negative duration division (#39158) + * [C++] Add missing data copy in StreamDecoder::Consume(data) + (#39164) + * [C++] Remove compiler warnings with -Wconversion + -Wno-sign-conversion in public headers (#39186) + * [C++][Benchmarking] Remove hardcoded min times (#39307) + * [C++] Don't use "if constexpr" in lambda (#39334) + * [C++] Disable -Werror=attributes for Azure SDK's identity.hpp + (#39448) + * [C++] Fix compile warning (#39389) + * [CI][JS] Force node 20 on JS build on arm64 to fix build issues + (#39499) + * [C++] Disable parallelism for jemalloc external project + (#39522) + * [C++][Parquet] Fix crash in test_parquet_dataset_lazy_filtering + (#39632) + * [C++] Disable parallelism for all `make`-based externalProjects + when CMake >= 3.28 is used + ## New Features and Improvements + * [C++][JSON] Change the max rows to Unlimited(int_32) (#38582) + * [C++][Python] Add "Z" to the end of timestamp print string when + tz defined (#39272) + * [C++][Python] DLPack implementation for Arrow Arrays (producer) + (#38472) + * [C++] Diffing of Run-End Encoded arrays (#35003) + * [C++][Python][R] Allow users to adjust S3 log level by + environment variable (#38267) + * [C++][Format] Implementation of the LIST_VIEW and + LARGE_LIST_VIEW array formats (#35345) + * [C++] Use Cast() instead of CastTo() for Scalar in test + (#39044) + * [C++][Python][Parquet] Implement Float16 logical type (#36073) + * [C++] Add Utf8View and BinaryView to the c ABI (#38443) + * [C++][Parquet] Add api to get RecordReader from RowGroupReader + (#37003) + * [C++] Expose a span converter for Buffer and ArraySpan (#38027) + * [C++] Add A Dictionary Compaction Function For DictionaryArray + (#37418) + * [C++] Add arrow::ipc::StreamDecoder::Reset() (#37970) + * [C++] Implement file reads for Azure filesystem (#38269) + * [C++][Integration] Add C++ Utf8View implementation (#37792) + * [C++][Gandiva] Add external function registry support (#38116) + * [C++][Gandiva] Migrate LLVM JIT engine from MCJIT to ORC + v2/LLJIT (#39098) + * [C++] Feature: support concatenate recordbatches. (#37896) + * [C++] Add support for specifying custom Array opening and + closing delimiters to arrow::PrettyPrintDelimiters (#38187) + * [R] Allow code() to return package name prefix. (#38144) + * [C++][Benchmark] Add non-stream Codec Compression/Decompression + (#38067) + * [C++][Parquet] Change DictEncoder dtor checking to warning log + (#38118) + * [C++][Parquet] Support reading parquet files with multiple gzip + members (#38272) + * [C++][Parquet] check the decompressed page size same as size in + page header (#38327) + * [C++][Azure] Use properties for input stream metadata (#38524) + * [C++][FS][Azure] Implement file writes (#38780) + * [C++] Implement GetFileInfo for a single file in Azure + filesystem (#38505) + * [C++][CMake] Use transitive dependency for system GoogleTest + (#38340) + * [C++][Parquet] Use new encrypted files for page index + encryption test (#38347) + * Add validation logic for offsets and values to + arrow.array.ListArray.fromArrays (#38531) + * [C++][Acero] Create a sorted merge node (#38380) + * [C++][Benchmark] Adding benchmark for LZ4/Snappy Compression + (#38453) + * [C++] Support LogicalNullCount for DictionaryArray (#38681) + * [C++][Parquet] Faster scalar BYTE_STREAM_SPLIT (#38529) + * [C++][Gandiva] Support registering external C functions + (#38632) + * [C++] Implement GetFileInfo(selector) for Azure filesystem + (#39009) + * [C++][FS][Azure] Implement CreateDir() (#38708) + * [C++][FS][Azure] Implement DeleteDir() (#38793) + * [C++][FS][Azure] Implement DeleteDirContents() (#38888) + * [C++] : Implement AzureFileSystem::DeleteRootDirContents + (#39151) + * [C++][FS][Azure] Implement CopyFile() (#39058) + * [C++][Go][Parquet] Add tests for reading Float16 files in + parquet-testing (#38753) + * [C++][FS][Azure] Rename AzurePath to AzureLocation (#38773) + * [C++] Implement directory semantics even when the storage + account doesn't support HNS (#39361) + * [C++][Parquet] Update parquet.thrift to sync with 2.10.0 + (#38815) + * [C++] Replace "#ifdef ARROW_WITH_GZIP" in dataset test to + ARROW_WITH_ZLIB (#38853) + * [C++][Parquet] Using length to optimize bloom filter read + (#38863) + * [C++][Parquet] Minor: making parquet TypedComparator operation + as const method (#38875) + * [C++] DatasetWriter release rows_in_flight_throttle when + allocate writing failed (#38885) + * [C++][Parquet] Move EstimatedBufferedValueBytes from + TypedColumnWriter to ColumnWriter (#39055) + * [C++] Stop installing internal bpacking_simd* headers (#38908) + * [C++][Gandiva] Refactor function holder to return arrow Result + (#38873) + * [C++] Use Cast() instead of CastTo() for Dictionary Scalar in + test (#39362) + * [C++] Use Cast() instead of CastTo() for Timestamp Scalar in + test (#39060) + * [C++] Use Cast() instead of CastTo() for List Scalar in test + (#39353) + * [C++][Parquet] Support row group filtering for nested paths for + struct fields (#39065) + * [C++] Refactor the Azure FS tests and filesystem class + instantiation (#39207) + * [C++][Parquet] Optimize FLBA record reader (#39124) + * Create module info compiler plugin (#39135) + * [C++] : Try to make Buffer::device_type_ non-optional (#39150) + * [C++][Parquet] Remove deprecated AppendRowGroup(int64_t + num_rows) (#39209) + * [C++][Parquet] Avoid WriteRecordBatch from produce zero-sized + RowGroup (#39211) + * [C++] Support binary to fixed_size_binary cast (#39236) + * [C++][Azure][FS] Add default credential auth configuration + (#39263) + * [C++] Don't install bundled Azure SDK for C++ with CMake 3.28+ + (#39269) + * [C++][FS] : Remove the AzureBackend enum and add more flexible + connection options (#39293) + * [C++][FS] : Inform caller of container not-existing when + checking for HNS support (#39298) + * [C++][FS][Azure] Add workload identity auth configuration + (#39319) + * [C++][FS][Azure] Add managed identity auth configuration + (#39321) + * [C++] Forward arguments to ExceptionToStatus all the way to + Status::FromArgs (#39323) + * [C++] Flaky DatasetWriterTestFixture.MaxRowsOneWriteBackpresure + test (#39379) + * [C++] Add ForceCachedHierarchicalNamespaceSupport to help with + testing (#39340) + * [C++][FS][Azure] Add client secret auth configuration (#39346) + * [C++] Reduce function.h includes (#39312) + * [C++] Use Cast() instead of CastTo() for Parquet (#39364) + * [C++][Parquet] Vectorize decode plain on FLBA (#39414) + * [C++][Parquet] Style: Using arrow::Buffer data_as api rather + than reinterpret_cast (#39420) + * [C++][ORC] Upgrade ORC to 1.9.2 (#39431) + * [C++] Use default Azure credentials implicitly and support + anonymous credentials explicitly (#39450) + * [C++][Parquet] Allow reading dictionary without reading data + via ByteArrayDictionaryRecordReader (#39153) +- Disable logging until compatibility with glog is restored + gh#apache/arrow#40181 + +------------------------------------------------------------------- Old: ---- apache-arrow-14.0.2.tar.gz arrow-testing-14.0.2.tar.gz parquet-testing-14.0.2.tar.gz New: ---- apache-arrow-15.0.1.tar.gz arrow-testing-15.0.1.tar.gz parquet-testing-15.0.1.tar.gz ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other differences: ------------------ ++++++ apache-arrow.spec ++++++ --- /var/tmp/diff_new_pack.Apniq2/_old 2024-02-25 14:06:32.941833696 +0100 +++ /var/tmp/diff_new_pack.Apniq2/_new 2024-02-25 14:06:32.941833696 +0100 @@ -20,13 +20,13 @@ # Required for runtime dispatch, not yet packaged %bcond_with xsimd -%define sonum 1400 +%define sonum 1500 # See git submodule /testing pointing to the correct revision -%define arrow_testing_commit 47f7b56b25683202c1fd957668e13f2abafc0f12 +%define arrow_testing_commit ad82a736c170e97b7c8c035ebd8a801c17eec170 # See git submodule /cpp/submodules/parquet-testing pointing to the correct revision -%define parquet_testing_commit b2e7cc755159196e3a068c8594f7acbaecfdaaac +%define parquet_testing_commit d69d979223e883faef9dc6fe3cf573087243c28a Name: apache-arrow -Version: 14.0.2 +Version: 15.0.1 Release: 0 Summary: A development platform for in-memory data License: Apache-2.0 AND BSD-3-Clause AND BSD-2-Clause AND MIT @@ -60,7 +60,7 @@ BuildRequires: pkgconfig(libbrotlidec) >= 1.0.7 BuildRequires: pkgconfig(libbrotlienc) >= 1.0.7 BuildRequires: pkgconfig(libcares) >= 1.15.0 -BuildRequires: pkgconfig(libglog) >= 0.3.5 +#BuildRequires: pkgconfig(libglog) >= 0.3.5 BuildRequires: pkgconfig(liblz4) >= 1.8.3 BuildRequires: pkgconfig(libopenssl) BuildRequires: pkgconfig(liburiparser) >= 0.9.3 @@ -282,7 +282,7 @@ -DARROW_JSON:BOOL=ON \ -DARROW_ORC:BOOL=OFF \ -DARROW_PARQUET:BOOL=ON \ - -DARROW_USE_GLOG:BOOL=ON \ + -DARROW_USE_GLOG:BOOL=OFF \ -DARROW_USE_OPENSSL:BOOL=ON \ -DARROW_WITH_BACKTRACE:BOOL=ON \ -DARROW_WITH_BROTLI:BOOL=ON \ ++++++ apache-arrow-14.0.2.tar.gz -> apache-arrow-15.0.1.tar.gz ++++++ /work/SRC/openSUSE:Factory/apache-arrow/apache-arrow-14.0.2.tar.gz /work/SRC/openSUSE:Factory/.apache-arrow.new.1770/apache-arrow-15.0.1.tar.gz differ: char 12, line 1 ++++++ arrow-testing-14.0.2.tar.gz -> arrow-testing-15.0.1.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/README.md new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/README.md --- old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/README.md 2023-03-20 19:38:00.000000000 +0100 +++ new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/README.md 2023-10-13 18:42:40.000000000 +0200 @@ -35,3 +35,12 @@ } } ``` + +Additional notes: + +| File | Description | +|:--|:--| +| alltypes_nulls_plain.avro | Contains a single row with null values for each scalar data type, i.e, `{"string_col":null,"int_col":null,"bool_col":null,"bigint_col":null,"float_col":null,"double_col":null,"bytes_col":null}`. Generated from https://gist.github.com/nenorbot/5a92e24f8f3615488f75e2a18a105c76 | +| nested_records.avro | Contains two rows of nested record types. Generated from https://github.com/sarutak/avro-data-generator/blob/master/src/bin/nested-records.rs | +| simple_enum.avro | Contains four rows of enum types. Generated from https://github.com/sarutak/avro-data-generator/blob/master/src/bin/simple-enum.rs | +| simple_fixed | Contains two rows of fixed types. Generated from https://github.com/sarutak/avro-data-generator/blob/master/src/bin/simple-fixed.rs | \ No newline at end of file Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/alltypes_nulls_plain.avro and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/alltypes_nulls_plain.avro differ Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/alltypes_plain.bzip2.avro and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/alltypes_plain.bzip2.avro differ Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/alltypes_plain.snappy.avro and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/alltypes_plain.snappy.avro differ Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/alltypes_plain.xz.avro and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/alltypes_plain.xz.avro differ Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/alltypes_plain.zstandard.avro and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/alltypes_plain.zstandard.avro differ Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/nested_records.avro and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/nested_records.avro differ Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/simple_enum.avro and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/simple_enum.avro differ Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/avro/simple_fixed.avro and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/avro/simple_fixed.avro differ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/parquet/README.md new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/parquet/README.md --- old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/parquet/README.md 2023-03-20 19:38:00.000000000 +0100 +++ new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/parquet/README.md 2023-10-13 18:42:40.000000000 +0200 @@ -21,4 +21,5 @@ | File | Description | | --- | --- | -| ARROW-17100.parquet | Parquet file written by PyArrow 2.0 with DataPageV2 and compressed columns. Prior to PyArrow 3.0, pages were compressed even if the is_compressed flag was 0. This was fixed in ARROW-10353, but for backwards compatibility readers may wish to support such a file. | \ No newline at end of file +| ARROW-17100.parquet | Parquet file written by PyArrow 2.0 with DataPageV2 and compressed columns. Prior to PyArrow 3.0, pages were compressed even if the is_compressed flag was 0. This was fixed in ARROW-10353, but for backwards compatibility readers may wish to support such a file. | +| alltypes-java.parquet | Parquet file written by using the Java DatasetWriter class in Arrow 14.0. Types supported do not include Map, Sparse and DenseUnion, Interval Day, Year, or MonthDayNano, and Float16. This file is used by https://github.com/apache/arrow/pull/38249 and was generated using the TestAllTypes#testAllTypesParquet() test case in the Java Dataset module. | Binary files old/arrow-testing-47f7b56b25683202c1fd957668e13f2abafc0f12/data/parquet/alltypes-java.parquet and new/arrow-testing-ad82a736c170e97b7c8c035ebd8a801c17eec170/data/parquet/alltypes-java.parquet differ ++++++ parquet-testing-14.0.2.tar.gz -> parquet-testing-15.0.1.tar.gz ++++++ diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' '--exclude=.svnignore' old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/README.md new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/README.md --- old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/README.md 2023-03-06 12:29:32.000000000 +0100 +++ new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/README.md 2023-11-23 17:41:57.000000000 +0100 @@ -28,6 +28,7 @@ | delta_encoding_optional_column.parquet | optional INT64 and STRING columns with delta encoding. See [delta_encoding_optional_column.md](delta_encoding_optional_column.md) for details. | | nested_structs.rust.parquet | Used to test that the Rust Arrow reader can lookup the correct field from a nested struct. See [ARROW-11452](https://issues.apache.org/jira/browse/ARROW-11452) | | data_index_bloom_encoding_stats.parquet | optional STRING column. Contains optional metadata: bloom filters, column index, offset index and encoding stats. | +| data_index_bloom_encoding_with_length.parquet | Same as `data_index_bloom_encoding_stats.parquet` but has `bloom_filter_length` populated in the ColumnMetaData | | null_list.parquet | an empty list. Generated from this json `{"emptylist":[]}` and for the purposes of testing correct read/write behaviour of this base case. | | alltypes_tiny_pages.parquet | small page sizes with dictionary encoding with page index from [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet). | | alltypes_tiny_pages_plain.parquet | small page sizes with plain encoding with page index [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet). | @@ -44,6 +45,10 @@ | rle-dict-snappy-checksum.parquet | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC | | plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC | | rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC | +| large_string_map.brotli.parquet | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below | +| float16_nonzeros_and_nans.parquet | Float16 (logical type) column with NaNs and nonzero finite min/max values | +| float16_zeros_and_nans.parquet | Float16 (logical type) column with NaNs and zeros as min/max values. . See [note](#float16-files) below | +| concatenated_gzip_members.parquet | 513 UINT64 numbers compressed using 2 concatenated gzip members in a single data page | TODO: Document what each file is in the table above. @@ -56,14 +61,15 @@ https://github.com/apache/parquet-format/blob/encryption/Encryption.md ``` -Following are the keys and key ids (when using key\_retriever) used to encrypt the encrypted columns and footer in the all the encrypted files: +Following are the keys and key ids (when using key\_retriever) used to encrypt +the encrypted columns and footer in all the encrypted files: * Encrypted/Signed Footer: * key: {0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5} * key_id: "kf" -* Encrypted column named double_field: +* Encrypted column named double_field (including column and offset index): * key: {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,0} * key_id: "kc1" -* Encrypted column named float_field: +* Encrypted column named float_field (including column and offset index): * key: {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,1} * key_id: "kc2" @@ -72,10 +78,11 @@ 2. encrypt\_columns\_and\_footer\_aad.parquet.encrypted -A sample that reads and checks these files can be found at the following tests: +A sample that reads and checks these files can be found at the following tests +in Parquet C++: ``` -cpp/src/parquet/encryption-read-configurations-test.cc -cpp/src/parquet/test-encryption-util.h +cpp/src/parquet/encryption/read-configurations-test.cc +cpp/src/parquet/encryption/test-encryption-util.h ``` The `external_key_material_java.parquet.encrypted` file was encrypted using parquet-mr with @@ -91,7 +98,7 @@ message m { required int32 a; required int32 b; -} +} ``` The detailed structure for these files is as follows: @@ -179,7 +186,7 @@ metadata.row_group(0).column(0) # <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0> # file_offset: 88 -# file_path: +# file_path: # type: DOUBLE # num_values: 2 # path_in_schema: x @@ -202,3 +209,115 @@ # total_compressed_size: 84 # total_uncompressed_size: 80 ``` + +## Large string map + +The file `large_string_map.brotli.parquet` was generated with: +```python +import pyarrow as pa +import pyarrow.parquet as pq + +arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32())) +arr = pa.chunked_array([arr, arr]) +tab = pa.table({ "arr": arr }) + +pq.write_table(tab, "test.parquet", compression='BROTLI') +``` + +It is meant to exercise reading of structured data where each value +is smaller than 2GB but the combined uncompressed column chunk size +is greater than 2GB. + +## Float16 Files + +The files `float16_zeros_and_nans.parquet` and `float16_nonzeros_and_nans.parquet` +are meant to exercise a variety of test cases regarding `Float16` columns (which +are represented as 2-byte `FixedLenByteArray`s), including: +* Basic binary representations of standard values, +/- zeros, and NaN +* Comparisons between finite values +* Exclusion of NaNs from statistics min/max +* Normalizing min/max values when only zeros are present (i.e. `min` is always -0 and `max` is always +0) + +The aforementioned files were generated with: + +```python +import pyarrow as pa +import pyarrow.parquet as pq +import numpy as np + +t1 = pa.Table.from_arrays( + [pa.array([None, + np.float16(0.0), + np.float16(np.NaN)], type=pa.float16())], + names="x") +t2 = pa.Table.from_arrays( + [pa.array([None, + np.float16(1.0), + np.float16(-2.0), + np.float16(np.NaN), + np.float16(0.0), + np.float16(-1.0), + np.float16(-0.0), + np.float16(2.0)], + type=pa.float16())], + names="x") + +pq.write_table(t1, "float16_zeros_and_nans.parquet", compression='none') +pq.write_table(t2, "float16_nonzeros_and_nans.parquet", compression='none') + +m1 = pq.read_metadata("float16_zeros_and_nans.parquet") +m2 = pq.read_metadata("float16_nonzeros_and_nans.parquet") + +print(m1.row_group(0).column(0)) +print(m2.row_group(0).column(0)) +# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f79e9a3d850> +# file_offset: 68 +# file_path: +# physical_type: FIXED_LEN_BYTE_ARRAY +# num_values: 3 +# path_in_schema: x +# is_stats_set: True +# statistics: +# <pyarrow._parquet.Statistics object at 0x7f79e9a3d940> +# has_min_max: True +# min: b'\x00\x80' +# max: b'\x00\x00' +# null_count: 1 +# distinct_count: None +# num_values: 2 +# physical_type: FIXED_LEN_BYTE_ARRAY +# logical_type: Float16 +# converted_type (legacy): NONE +# compression: UNCOMPRESSED +# encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY') +# has_dictionary_page: True +# dictionary_page_offset: 4 +# data_page_offset: 22 +# total_compressed_size: 64 +# total_uncompressed_size: 64 +# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f79ea003c40> +# file_offset: 80 +# file_path: +# physical_type: FIXED_LEN_BYTE_ARRAY +# num_values: 8 +# path_in_schema: x +# is_stats_set: True +# statistics: +# <pyarrow._parquet.Statistics object at 0x7f79e9a3d8a0> +# has_min_max: True +# min: b'\x00\xc0' +# max: b'\x00@' +# null_count: 1 +# distinct_count: None +# num_values: 7 +# physical_type: FIXED_LEN_BYTE_ARRAY +# logical_type: Float16 +# converted_type (legacy): NONE +# compression: UNCOMPRESSED +# encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY') +# has_dictionary_page: True +# dictionary_page_offset: 4 +# data_page_offset: 32 +# total_compressed_size: 76 +# total_uncompressed_size: 76 +``` Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/concatenated_gzip_members.parquet and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/concatenated_gzip_members.parquet differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/data_index_bloom_encoding_with_length.parquet and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/data_index_bloom_encoding_with_length.parquet differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/encrypt_columns_and_footer.parquet.encrypted and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/encrypt_columns_and_footer.parquet.encrypted differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/encrypt_columns_and_footer_aad.parquet.encrypted and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/encrypt_columns_and_footer_aad.parquet.encrypted differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/encrypt_columns_and_footer_ctr.parquet.encrypted and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/encrypt_columns_and_footer_ctr.parquet.encrypted differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/encrypt_columns_plaintext_footer.parquet.encrypted and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/encrypt_columns_plaintext_footer.parquet.encrypted differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/float16_nonzeros_and_nans.parquet and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/float16_nonzeros_and_nans.parquet differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/float16_zeros_and_nans.parquet and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/float16_zeros_and_nans.parquet differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/large_string_map.brotli.parquet and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/large_string_map.brotli.parquet differ Binary files old/parquet-testing-b2e7cc755159196e3a068c8594f7acbaecfdaaac/data/uniform_encryption.parquet.encrypted and new/parquet-testing-d69d979223e883faef9dc6fe3cf573087243c28a/data/uniform_encryption.parquet.encrypted differ
