[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415233940 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, Review comment: might be worth a quick discussion on whether we should vendor this (or maybe just vendor the lookup tables). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on pull request #7029: URL: https://github.com/apache/arrow/pull/7029#issuecomment-619493366 Just curious if you see and impact on parquet-arrow-reader-writer benchmarks? That is the ultimate goal of the speedup. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415231929 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + int num_valid_values = 0; + arrow::internal::BitmapReader valid_bits_reader(valid_bits, valid_bits_offset, + num_values); + for (int32_t i = 0; i < num_values; i++) { +if (valid_bits_reader.IsSet()) { + output[num_valid_values++] = values[i]; +} +valid_bits_reader.Next(); + } + return num_valid_values; +} + +template +int DecodeSpacedScalar(T* buffer, int num_values, int null_count, + const uint8_t* valid_bits, int64_t valid_bits_offset) { + const int values_read = num_values - null_count; + + // Depending on the number of nulls, some of the value slots in buffer may + // be uninitialized, and this will cause valgrind warnings / potentially UB + memset(static_cast(buffer + values_read), 0, + (num_values - values_read) * sizeof(T)); + + // Add spacing for null entries. As we have filled the buffer from the front, + // we need to add the spacing from the back. + int values_to_move = values_read - 1; + // We stop early on one of two conditions: + // 1. There are no more null values that need spacing. Note we infer this + // backwards, when 'i' is equal to 'values_to_move' it indicates + //all nulls have been consumed. + // 2. There are no more non-null values that need to move which indicates + //all remaining slots are null, so their exact value doesn't matter. + for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); i--) { +if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) { + buffer[i] = buffer[values_to_move]; + values_to_move--; +} + } + return num_values; +} + +#if defined(ARROW_HAVE_AVX512) +template +int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + assert(sizeof(T) == 4 || sizeof(T) == 8); // Only support epi32 and epi64 + constexpr int kBatchSize = sizeof(__m512i) / sizeof(T); + constexpr int kBatchValidBytes = kBatchSize / 8; + int num_valid_values = 0; + int idx_values = 0; + int64_t idx_valid_bits = valid_bits_offset; + + // First handle the front suffix + const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8); + for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); i++) { +if (BitUtil::GetBit(valid_bits, idx_valid_bits)) { + output[num_valid_values] = values[idx_values]; + num_valid_values++; +} +idx_values++; +idx_valid_bits++; + } + + // The parts can fill into batches + uint8_t valid_count; + int64_t idx_valid_bytes = BitUtil::BytesForBits(idx_valid_bits + 1) - 1; + static const __m512i zero = _mm512_set_epi64(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0); + while (num_values - idx_values >= kBatchSize) { +// count the valid numbers of one batch. +valid_count = BitUtil::kBytePopcount[valid_bits[idx_valid_bytes]]; +if (kBatchValidBytes > 1) { + valid_count += BitUtil::kBytePopcount[valid_bits[idx_valid_bytes + 1]]; +} + +// pack the data +if (valid_count > 0) { + __m512i src = _mm512_loadu_si512(values + idx_values); + __m512i result; + if (sizeof(T) == 4) { +// 16 float for one m512i block, two bytes in valid_bits +__mmask16 k = *(reinterpret_cast(valid_bits + idx_valid_bytes)); +result = _mm512_mask_compress_epi32(zero, k, src); + } else { +// 8 double for one m512i block, one byte in valid_bits +__mmask8 k = *(valid_bits + idx_valid_bytes); +result = _mm512_mask_compress_epi64(zero, k, src); + } + +
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415231200 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + int num_valid_values = 0; + arrow::internal::BitmapReader valid_bits_reader(valid_bits, valid_bits_offset, + num_values); + for (int32_t i = 0; i < num_values; i++) { +if (valid_bits_reader.IsSet()) { + output[num_valid_values++] = values[i]; +} +valid_bits_reader.Next(); + } + return num_valid_values; +} + +template +int DecodeSpacedScalar(T* buffer, int num_values, int null_count, + const uint8_t* valid_bits, int64_t valid_bits_offset) { + const int values_read = num_values - null_count; + + // Depending on the number of nulls, some of the value slots in buffer may + // be uninitialized, and this will cause valgrind warnings / potentially UB + memset(static_cast(buffer + values_read), 0, + (num_values - values_read) * sizeof(T)); + + // Add spacing for null entries. As we have filled the buffer from the front, + // we need to add the spacing from the back. + int values_to_move = values_read - 1; + // We stop early on one of two conditions: + // 1. There are no more null values that need spacing. Note we infer this + // backwards, when 'i' is equal to 'values_to_move' it indicates + //all nulls have been consumed. + // 2. There are no more non-null values that need to move which indicates + //all remaining slots are null, so their exact value doesn't matter. + for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); i--) { +if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) { + buffer[i] = buffer[values_to_move]; + values_to_move--; +} + } + return num_values; +} + +#if defined(ARROW_HAVE_AVX512) +template +int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + assert(sizeof(T) == 4 || sizeof(T) == 8); // Only support epi32 and epi64 + constexpr int kBatchSize = sizeof(__m512i) / sizeof(T); + constexpr int kBatchValidBytes = kBatchSize / 8; + int num_valid_values = 0; + int idx_values = 0; + int64_t idx_valid_bits = valid_bits_offset; + + // First handle the front suffix + const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8); + for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); i++) { +if (BitUtil::GetBit(valid_bits, idx_valid_bits)) { + output[num_valid_values] = values[idx_values]; + num_valid_values++; +} +idx_values++; +idx_valid_bits++; + } + + // The parts can fill into batches + uint8_t valid_count; + int64_t idx_valid_bytes = BitUtil::BytesForBits(idx_valid_bits + 1) - 1; + static const __m512i zero = _mm512_set_epi64(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0); + while (num_values - idx_values >= kBatchSize) { +// count the valid numbers of one batch. +valid_count = BitUtil::kBytePopcount[valid_bits[idx_valid_bytes]]; +if (kBatchValidBytes > 1) { + valid_count += BitUtil::kBytePopcount[valid_bits[idx_valid_bytes + 1]]; +} + +// pack the data +if (valid_count > 0) { + __m512i src = _mm512_loadu_si512(values + idx_values); + __m512i result; + if (sizeof(T) == 4) { +// 16 float for one m512i block, two bytes in valid_bits +__mmask16 k = *(reinterpret_cast(valid_bits + idx_valid_bytes)); +result = _mm512_mask_compress_epi32(zero, k, src); + } else { +// 8 double for one m512i block, one byte in valid_bits +__mmask8 k = *(valid_bits + idx_valid_bytes); +result = _mm512_mask_compress_epi64(zero, k, src); + } + +
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415231122 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + int num_valid_values = 0; + arrow::internal::BitmapReader valid_bits_reader(valid_bits, valid_bits_offset, + num_values); + for (int32_t i = 0; i < num_values; i++) { +if (valid_bits_reader.IsSet()) { + output[num_valid_values++] = values[i]; +} +valid_bits_reader.Next(); + } + return num_valid_values; +} + +template +int DecodeSpacedScalar(T* buffer, int num_values, int null_count, + const uint8_t* valid_bits, int64_t valid_bits_offset) { + const int values_read = num_values - null_count; + + // Depending on the number of nulls, some of the value slots in buffer may + // be uninitialized, and this will cause valgrind warnings / potentially UB + memset(static_cast(buffer + values_read), 0, + (num_values - values_read) * sizeof(T)); + + // Add spacing for null entries. As we have filled the buffer from the front, + // we need to add the spacing from the back. + int values_to_move = values_read - 1; + // We stop early on one of two conditions: + // 1. There are no more null values that need spacing. Note we infer this + // backwards, when 'i' is equal to 'values_to_move' it indicates + //all nulls have been consumed. + // 2. There are no more non-null values that need to move which indicates + //all remaining slots are null, so their exact value doesn't matter. + for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); i--) { +if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) { + buffer[i] = buffer[values_to_move]; + values_to_move--; +} + } + return num_values; +} + +#if defined(ARROW_HAVE_AVX512) +template +int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + assert(sizeof(T) == 4 || sizeof(T) == 8); // Only support epi32 and epi64 + constexpr int kBatchSize = sizeof(__m512i) / sizeof(T); + constexpr int kBatchValidBytes = kBatchSize / 8; + int num_valid_values = 0; + int idx_values = 0; + int64_t idx_valid_bits = valid_bits_offset; + + // First handle the front suffix + const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8); + for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); i++) { +if (BitUtil::GetBit(valid_bits, idx_valid_bits)) { + output[num_valid_values] = values[idx_values]; + num_valid_values++; +} +idx_values++; +idx_valid_bits++; + } + + // The parts can fill into batches + uint8_t valid_count; + int64_t idx_valid_bytes = BitUtil::BytesForBits(idx_valid_bits + 1) - 1; + static const __m512i zero = _mm512_set_epi64(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0); + while (num_values - idx_values >= kBatchSize) { +// count the valid numbers of one batch. +valid_count = BitUtil::kBytePopcount[valid_bits[idx_valid_bytes]]; +if (kBatchValidBytes > 1) { + valid_count += BitUtil::kBytePopcount[valid_bits[idx_valid_bytes + 1]]; +} + +// pack the data +if (valid_count > 0) { + __m512i src = _mm512_loadu_si512(values + idx_values); + __m512i result; + if (sizeof(T) == 4) { +// 16 float for one m512i block, two bytes in valid_bits +__mmask16 k = *(reinterpret_cast(valid_bits + idx_valid_bytes)); +result = _mm512_mask_compress_epi32(zero, k, src); + } else { +// 8 double for one m512i block, one byte in valid_bits +__mmask8 k = *(valid_bits + idx_valid_bytes); +result = _mm512_mask_compress_epi64(zero, k, src); + } + +
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415230959 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + int num_valid_values = 0; + arrow::internal::BitmapReader valid_bits_reader(valid_bits, valid_bits_offset, + num_values); + for (int32_t i = 0; i < num_values; i++) { +if (valid_bits_reader.IsSet()) { + output[num_valid_values++] = values[i]; +} +valid_bits_reader.Next(); + } + return num_valid_values; +} + +template +int DecodeSpacedScalar(T* buffer, int num_values, int null_count, + const uint8_t* valid_bits, int64_t valid_bits_offset) { + const int values_read = num_values - null_count; + + // Depending on the number of nulls, some of the value slots in buffer may + // be uninitialized, and this will cause valgrind warnings / potentially UB + memset(static_cast(buffer + values_read), 0, + (num_values - values_read) * sizeof(T)); + + // Add spacing for null entries. As we have filled the buffer from the front, + // we need to add the spacing from the back. + int values_to_move = values_read - 1; + // We stop early on one of two conditions: + // 1. There are no more null values that need spacing. Note we infer this + // backwards, when 'i' is equal to 'values_to_move' it indicates + //all nulls have been consumed. + // 2. There are no more non-null values that need to move which indicates + //all remaining slots are null, so their exact value doesn't matter. + for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); i--) { +if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) { + buffer[i] = buffer[values_to_move]; + values_to_move--; +} + } + return num_values; +} + +#if defined(ARROW_HAVE_AVX512) +template +int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + assert(sizeof(T) == 4 || sizeof(T) == 8); // Only support epi32 and epi64 + constexpr int kBatchSize = sizeof(__m512i) / sizeof(T); + constexpr int kBatchValidBytes = kBatchSize / 8; + int num_valid_values = 0; + int idx_values = 0; + int64_t idx_valid_bits = valid_bits_offset; + + // First handle the front suffix + const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8); + for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); i++) { Review comment: i < min(offset_suffix_front, num_values)? if you take the min outside of the loop, you also don't need to increment idx_values_* each time inside the loop (they can be calculated with a single addition). Also by doing the calculation outside the loop I think you can call the scalar function instead of duplicating code? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415230291 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + int num_valid_values = 0; + arrow::internal::BitmapReader valid_bits_reader(valid_bits, valid_bits_offset, + num_values); + for (int32_t i = 0; i < num_values; i++) { +if (valid_bits_reader.IsSet()) { + output[num_valid_values++] = values[i]; +} +valid_bits_reader.Next(); + } + return num_valid_values; +} + +template +int DecodeSpacedScalar(T* buffer, int num_values, int null_count, + const uint8_t* valid_bits, int64_t valid_bits_offset) { + const int values_read = num_values - null_count; + + // Depending on the number of nulls, some of the value slots in buffer may + // be uninitialized, and this will cause valgrind warnings / potentially UB + memset(static_cast(buffer + values_read), 0, + (num_values - values_read) * sizeof(T)); + + // Add spacing for null entries. As we have filled the buffer from the front, + // we need to add the spacing from the back. + int values_to_move = values_read - 1; + // We stop early on one of two conditions: + // 1. There are no more null values that need spacing. Note we infer this + // backwards, when 'i' is equal to 'values_to_move' it indicates + //all nulls have been consumed. + // 2. There are no more non-null values that need to move which indicates + //all remaining slots are null, so their exact value doesn't matter. + for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); i--) { +if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) { + buffer[i] = buffer[values_to_move]; + values_to_move--; +} + } + return num_values; +} + +#if defined(ARROW_HAVE_AVX512) +template +int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + assert(sizeof(T) == 4 || sizeof(T) == 8); // Only support epi32 and epi64 + constexpr int kBatchSize = sizeof(__m512i) / sizeof(T); + constexpr int kBatchValidBytes = kBatchSize / 8; + int num_valid_values = 0; + int idx_values = 0; + int64_t idx_valid_bits = valid_bits_offset; + + // First handle the front suffix + const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8); + for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); i++) { +if (BitUtil::GetBit(valid_bits, idx_valid_bits)) { + output[num_valid_values] = values[idx_values]; + num_valid_values++; +} +idx_values++; Review comment: nit: can't these be moved out of the loop? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415230291 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + int num_valid_values = 0; + arrow::internal::BitmapReader valid_bits_reader(valid_bits, valid_bits_offset, + num_values); + for (int32_t i = 0; i < num_values; i++) { +if (valid_bits_reader.IsSet()) { + output[num_valid_values++] = values[i]; +} +valid_bits_reader.Next(); + } + return num_valid_values; +} + +template +int DecodeSpacedScalar(T* buffer, int num_values, int null_count, + const uint8_t* valid_bits, int64_t valid_bits_offset) { + const int values_read = num_values - null_count; + + // Depending on the number of nulls, some of the value slots in buffer may + // be uninitialized, and this will cause valgrind warnings / potentially UB + memset(static_cast(buffer + values_read), 0, + (num_values - values_read) * sizeof(T)); + + // Add spacing for null entries. As we have filled the buffer from the front, + // we need to add the spacing from the back. + int values_to_move = values_read - 1; + // We stop early on one of two conditions: + // 1. There are no more null values that need spacing. Note we infer this + // backwards, when 'i' is equal to 'values_to_move' it indicates + //all nulls have been consumed. + // 2. There are no more non-null values that need to move which indicates + //all remaining slots are null, so their exact value doesn't matter. + for (int i = num_values - 1; (i > values_to_move) && (values_to_move >= 0); i--) { +if (BitUtil::GetBit(valid_bits, valid_bits_offset + i)) { + buffer[i] = buffer[values_to_move]; + values_to_move--; +} + } + return num_values; +} + +#if defined(ARROW_HAVE_AVX512) +template +int PutSpacedAvx512Compress(const T* values, int num_values, const uint8_t* valid_bits, +int64_t valid_bits_offset, T* output) { + assert(sizeof(T) == 4 || sizeof(T) == 8); // Only support epi32 and epi64 + constexpr int kBatchSize = sizeof(__m512i) / sizeof(T); + constexpr int kBatchValidBytes = kBatchSize / 8; + int num_valid_values = 0; + int idx_values = 0; + int64_t idx_valid_bits = valid_bits_offset; + + // First handle the front suffix + const int64_t offset_suffix_front = 8 - (valid_bits_offset % 8); + for (int64_t i = 0; (i < offset_suffix_front) && (idx_values < num_values); i++) { +if (BitUtil::GetBit(valid_bits, idx_valid_bits)) { + output[num_valid_values] = values[idx_values]; + num_valid_values++; +} +idx_values++; Review comment: nit: can't these be moved out of the loop? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415229405 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, Review comment: please add docs to describing what each method does. PutSpaced and DecodSpaced aren't the most intuitive names and someone just looking at the header would have to read the code to understand each one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415229405 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, Review comment: please add docs to each method This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #7029: ARROW-8579 [C++] Add AVX512 SIMD for spaced decoding and encoding.
emkornfield commented on a change in pull request #7029: URL: https://github.com/apache/arrow/pull/7029#discussion_r415229227 ## File path: cpp/src/arrow/util/spaced.h ## @@ -0,0 +1,266 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#pragma once + +#include "arrow/util/bit_util.h" + +#ifdef ARROW_HAVE_AVX512 +#include +#endif // ARROW_HAVE_AVX512 + +namespace arrow { +namespace util { +namespace internal { + +template +int PutSpacedScalar(const T* values, int num_values, const uint8_t* valid_bits, Review comment: FWIW, Daniel Lemire has a SIMD library for this https://lemire.me/blog/2017/04/25/quickly-pruning-elements-in-simd-vectors-using-the-simdprune-library/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tianchen92 commented on a change in pull request #6912: ARROW-8020: [Java] Implement vector validate functionality
tianchen92 commented on a change in pull request #6912: URL: https://github.com/apache/arrow/pull/6912#discussion_r415229312 ## File path: java/vector/src/main/java/org/apache/arrow/vector/complex/NonNullableStructVector.java ## @@ -320,6 +322,20 @@ public int hashCode(int index, ArrowBufHasher hasher) { return visitor.visit(this, value); } + @Override + public void validate() { +if (getValueCount() < 0) { + throw new RuntimeException("vector valueCount is negative"); +} + +if (getNullCount() > getValueCount()) { Review comment: Thanks, fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tianchen92 commented on a change in pull request #6912: ARROW-8020: [Java] Implement vector validate functionality
tianchen92 commented on a change in pull request #6912: URL: https://github.com/apache/arrow/pull/6912#discussion_r415229163 ## File path: java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java ## @@ -283,4 +283,10 @@ * @return the name of the vector. */ String getName(); + + /** + * Validate the vector, will throw exception if validate fail. + */ + void validate(); Review comment: Thanks for all your comments, we already have validate method in ValueVectorUtility, and I removed the api from ValueVector. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files
emkornfield commented on a change in pull request #6954: URL: https://github.com/apache/arrow/pull/6954#discussion_r415226340 ## File path: cpp/src/arrow/util/hash_util.h ## @@ -27,39 +27,27 @@ #include "arrow/util/logging.h" #include "arrow/util/macros.h" -#include "arrow/util/neon_util.h" -#include "arrow/util/sse_util.h" +#include "arrow/util/simd.h" -static inline uint32_t HW_crc32_u8(uint32_t crc, uint8_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u16(uint32_t crc, uint16_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u32(uint32_t crc, uint32_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u64(uint32_t crc, uint64_t v) { +#ifdef ARROW_HAVE_SSE4_2 Review comment: I'd be concerned about putting too much in SIMD directly. If we expect these to be used in other places that don't depend on this header we should create a specific header targetted at hardware. digests/hashing. I think the way bit_util.h wraps special instructions is a reasonable path to follow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files
emkornfield commented on a change in pull request #6954: URL: https://github.com/apache/arrow/pull/6954#discussion_r415228325 ## File path: cpp/src/arrow/util/simd.h ## @@ -17,6 +17,24 @@ #pragma once +#ifdef _MSC_VER +// MSVC x86_64/arm64 + +#if defined(_M_AMD64) || defined(_M_X64) +#include +#elif defined(_M_ARM64) +#include +#endif + +#else +// gcc/clang (possibly others) + Review comment: Just an FYI in https://github.com/apache/arrow/pull/6985 relies on BMI2 do you think adding that here would be appropriate? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files
emkornfield commented on a change in pull request #6954: URL: https://github.com/apache/arrow/pull/6954#discussion_r415226495 ## File path: cpp/src/arrow/util/hash_util.h ## @@ -27,39 +27,27 @@ #include "arrow/util/logging.h" #include "arrow/util/macros.h" -#include "arrow/util/neon_util.h" -#include "arrow/util/sse_util.h" +#include "arrow/util/simd.h" -static inline uint32_t HW_crc32_u8(uint32_t crc, uint8_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u16(uint32_t crc, uint16_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u32(uint32_t crc, uint32_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u64(uint32_t crc, uint64_t v) { +#ifdef ARROW_HAVE_SSE4_2 +constexpr auto HW_crc32_u8 = _mm_crc32_u8; Review comment: style nits: If possible les spell out the type instead of using auto (its not all clear what the types are without looking up these exact functions. Since these are functions, it would be nice to use standard function casing for them. e.g. Crc32U16, even better might be to define something like: HardwareCrc(uint16) .. HardwareCrc(uint16) .. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files
emkornfield commented on a change in pull request #6954: URL: https://github.com/apache/arrow/pull/6954#discussion_r415226495 ## File path: cpp/src/arrow/util/hash_util.h ## @@ -27,39 +27,27 @@ #include "arrow/util/logging.h" #include "arrow/util/macros.h" -#include "arrow/util/neon_util.h" -#include "arrow/util/sse_util.h" +#include "arrow/util/simd.h" -static inline uint32_t HW_crc32_u8(uint32_t crc, uint8_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u16(uint32_t crc, uint16_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u32(uint32_t crc, uint32_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u64(uint32_t crc, uint64_t v) { +#ifdef ARROW_HAVE_SSE4_2 +constexpr auto HW_crc32_u8 = _mm_crc32_u8; Review comment: If possible les spell out the type instead of using auto (its not all clear what the types are without looking up these exact functions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6954: ARROW-8440: [C++] Refine SIMD header files
emkornfield commented on a change in pull request #6954: URL: https://github.com/apache/arrow/pull/6954#discussion_r415226340 ## File path: cpp/src/arrow/util/hash_util.h ## @@ -27,39 +27,27 @@ #include "arrow/util/logging.h" #include "arrow/util/macros.h" -#include "arrow/util/neon_util.h" -#include "arrow/util/sse_util.h" +#include "arrow/util/simd.h" -static inline uint32_t HW_crc32_u8(uint32_t crc, uint8_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u16(uint32_t crc, uint16_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u32(uint32_t crc, uint32_t v) { - DCHECK(false) << "Hardware CRC support is not enabled"; - return 0; -} - -static inline uint32_t HW_crc32_u64(uint32_t crc, uint64_t v) { +#ifdef ARROW_HAVE_SSE4_2 Review comment: I'd be concerned about putting too much in SIMD directly. If we expect these to be used in other places that don't depend on this header we should create a specific header targetted at hardware. digests/hashing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
emkornfield commented on pull request #6985: URL: https://github.com/apache/arrow/pull/6985#issuecomment-619488093 @pitrou I think I addressed your comments. One of them that went stale was the complexity for "AppendWord", I tried to remove parts that did not seem to affect performance on benchmarks for parquet column reading but I spent some time trying to maximize word level parallelism for the unaligned case, because I think at least for repeated fields I expect this case to be fairly common. As a point of comparison using the AppendWord implementation I put in place for BigEndian shows much smaller improvements. Really this should use CopyBitmap but I didn't want to start moving a move code then I needed to in bit_util.h for this PR. I opened a [JIRA (ARROW-8595)](https://issues.apache.org/jira/browse/ARROW-8595) to track some improvements in that regard. If more comments are needed or you think there is a cleaner way of writing the code, I'm happy for input. @wesm do you want to look at the parquet specific logic/comments to make sure I captured them correctly? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r415221052 ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86") set(CXX_SUPPORTS_SSE4_2 TRUE) else() set(ARROW_SSE4_2_FLAG "-msse4.2") -set(ARROW_AVX2_FLAG "-mavx2") +set(ARROW_AVX2_FLAG "-march=core-avx2") Review comment: done, seems to work. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r415221009 ## File path: cpp/src/parquet/level_conversion_test.cc ## @@ -0,0 +1,162 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "parquet/level_conversion.h" + +#include +#include + +#include + +#include "arrow/util/bit_util.h" + +namespace parquet { +namespace internal { + +using ::testing::ElementsAreArray; + +std::string ToString(const uint8_t* bitmap, int64_t bit_count) { + return arrow::internal::Bitmap(bitmap, /*offset*/ 0, /*length=*/bit_count).ToString(); +} + +std::string ToString(const std::vector& bitmap, int64_t bit_count) { + return ToString(bitmap.data(), bit_count); +} + +TEST(TestGreaterThanBitmap, GeneratesExpectedBitmasks) { Review comment: There was a substantial refactoring, I tried to provide guards but without a big-endian machine in CI it will be very hard to to catch all of these issues. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] emkornfield commented on a change in pull request #6985: ARROW-8413: [C++][Parquet] Refactor Generating validity bitmap for values column
emkornfield commented on a change in pull request #6985: URL: https://github.com/apache/arrow/pull/6985#discussion_r414295562 ## File path: cpp/cmake_modules/SetupCxxFlags.cmake ## @@ -40,12 +40,13 @@ if(ARROW_CPU_FLAG STREQUAL "x86") set(CXX_SUPPORTS_SSE4_2 TRUE) else() set(ARROW_SSE4_2_FLAG "-msse4.2") -set(ARROW_AVX2_FLAG "-mavx2") +set(ARROW_AVX2_FLAG "-march=core-avx2") # skylake-avx512 consists of AVX512F,AVX512BW,AVX512VL,AVX512CD,AVX512DQ set(ARROW_AVX512_FLAG "-march=skylake-avx512") check_cxx_compiler_flag(${ARROW_SSE4_2_FLAG} CXX_SUPPORTS_SSE4_2) endif() check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2) + check_cxx_compiler_flag(${ARROW_AVX2_FLAG} CXX_SUPPORTS_AVX2) Review comment: should be removed now. ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, Review comment: I've moved all everything in the anonymous namespace to level_conversion.cc ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +throw ParquetException("definition level exceeds maximum"); + } +} + +#if !defined(ARROW_HAVE_BMI2) + +inline void DefinitionLevelsToBitmapScalar( +const int16_t* def_levels, int64_t num_def_levels, const int16_t max_definition_level, +const int16_t max_repetition_level, int64_t* values_read, int64_t* null_count, +uint8_t* valid_bits, int64_t valid_bits_offset) { + // We assume here that valid_bits is large enough to accommodate the + // additional definition levels and the ones that have already been written + ::arrow::internal::BitmapWriter valid_bits_writer(valid_bits, valid_bits_offset, +num_def_levels); + + // TODO(itaiin): As an interim solution we are splitting the code path here + // between repeated+flat column reads, and non-repeated+nested reads. + // Those paths need to be merged in the future + for (int i = 0; i < num_def_levels; ++i) { +if (def_levels[i] == max_definition_level) { + valid_bits_writer.Set(); +} else if (max_repetition_level > 0) { + // repetition+flat case + if (def_levels[i] == (max_definition_level - 1)) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +continue; + } +} else { + // non-repeated+nested case + if (def_levels[i] < max_definition_level) { +valid_bits_writer.Clear(); +*null_count += 1; + } else { +throw ParquetException("definition level exceeds maximum"); + } +} + +valid_bits_writer.Next(); + } + valid_bits_writer.Finish(); + *values_read = valid_bits_writer.position(); +} +#endif + +template +void DefinitionLevelsToBitmapSimd(const int16_t* def_levels, int64_t num_def_levels, + const int16_t required_definition_level, + int64_t* values_read, int64_t* null_count, + uint8_t* valid_bits, int64_t valid_bits_offset) { + constexpr int64_t kBitMaskSize = 64; + int64_t set_count = 0; + *values_read = 0; + while (num_def_levels > 0) { +int64_t batch_size = std::min(num_def_levels, kBitMaskSize); +CheckLevelRange(def_levels, batch_size, required_definition_level); +uint64_t defined_bitmap = internal::GreaterThanBitmap(def_levels, batch_size, + required_definition_level - 1); +if (has_repeated_parent) { + // This is currently a specialized code path assuming only (nested) lists Review comment: done. ## File path: cpp/src/parquet/column_reader.cc ## @@ -50,6 +51,140 @@ using arrow::internal::checked_cast; namespace parquet { +namespace { + +inline void CheckLevelRange(const int16_t* levels, int64_t num_levels, +const int16_t max_expected_level) { + int16_t min_level = std::numeric_limits::max(); + int16_t max_level = std::numeric_limits::min(); + for (int x = 0; x < num_levels; x++) { +min_level = std::min(levels[x], min_level); +max_level = std::max(levels[x], max_level); + } + if (ARROW_PREDICT_FALSE(num_levels > 0 && (min_level < 0 || max_level > max_level))) { +
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r415203879 ## File path: java/memory/src/test/java/org/apache/arrow/memory/TestNettyAllocationManager.java ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.memory; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertNull; +import static org.junit.Assert.assertTrue; + +import org.junit.Test; + +import io.netty.buffer.ArrowBuf; + +/** + * Test cases for {@link NettyAllocationManager}. + */ +public class TestNettyAllocationManager { + + private void readWriteArrowBuf(ArrowBuf buffer) { +// write buffer +for (long i = 0; i < buffer.capacity() / 8; i++) { + buffer.setLong(i * 8, i); +} + +// read buffer +for (long i = 0; i < buffer.capacity() / 8; i++) { + long val = buffer.getLong(i * 8); + assertEquals(i, val); +} + } + + /** + * Test the allocation strategy for small buffers.. + */ + @Test + public void testSmallBufferAllocation() { +final long bufSize = 512L; +try (RootAllocator allocator = new RootAllocator(bufSize); + ArrowBuf buffer = allocator.buffer(bufSize)) { + // make sure the buffer is small enough, so we wil use the allocation strategy for small buffers + assertTrue(bufSize < NettyAllocationManager.DEFAULT_ALLOCATION_CUTOFF_VALUE); + + assertTrue(buffer.getReferenceManager() instanceof BufferLedger); + BufferLedger bufferLedger = (BufferLedger) buffer.getReferenceManager(); + + // make sure we are using netty allocation manager + AllocationManager allocMgr = bufferLedger.getAllocationManager(); + assertTrue(allocMgr instanceof NettyAllocationManager); + NettyAllocationManager nettyMgr = (NettyAllocationManager) allocMgr; + + // for the small buffer allocation strategy, the chunk is not null + assertNotNull(nettyMgr.getMemoryChunk()); + + readWriteArrowBuf(buffer); +} + } + + /** + * Test the allocation strategy for large buffers.. + */ + @Test + public void testLargeBufferAllocation() { +final long bufSize = 2048L; Review comment: Sounds good. Revised accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r415203851 ## File path: java/memory/src/test/java/org/apache/arrow/memory/TestNettyAllocationManager.java ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.memory; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertNull; +import static org.junit.Assert.assertTrue; + +import org.junit.Test; + +import io.netty.buffer.ArrowBuf; + +/** + * Test cases for {@link NettyAllocationManager}. + */ +public class TestNettyAllocationManager { + + private void readWriteArrowBuf(ArrowBuf buffer) { +// write buffer +for (long i = 0; i < buffer.capacity() / 8; i++) { + buffer.setLong(i * 8, i); +} + +// read buffer +for (long i = 0; i < buffer.capacity() / 8; i++) { + long val = buffer.getLong(i * 8); + assertEquals(i, val); +} + } + + /** + * Test the allocation strategy for small buffers.. + */ + @Test + public void testSmallBufferAllocation() { +final long bufSize = 512L; +try (RootAllocator allocator = new RootAllocator(bufSize); + ArrowBuf buffer = allocator.buffer(bufSize)) { + // make sure the buffer is small enough, so we wil use the allocation strategy for small buffers + assertTrue(bufSize < NettyAllocationManager.DEFAULT_ALLOCATION_CUTOFF_VALUE); + + assertTrue(buffer.getReferenceManager() instanceof BufferLedger); + BufferLedger bufferLedger = (BufferLedger) buffer.getReferenceManager(); + + // make sure we are using netty allocation manager + AllocationManager allocMgr = bufferLedger.getAllocationManager(); + assertTrue(allocMgr instanceof NettyAllocationManager); + NettyAllocationManager nettyMgr = (NettyAllocationManager) allocMgr; + + // for the small buffer allocation strategy, the chunk is not null + assertNotNull(nettyMgr.getMemoryChunk()); + + readWriteArrowBuf(buffer); +} + } + + /** + * Test the allocation strategy for large buffers.. + */ + @Test + public void testLargeBufferAllocation() { +final long bufSize = 2048L; +try (RootAllocator allocator = new RootAllocator(bufSize); + ArrowBuf buffer = allocator.buffer(bufSize)) { + // make sure the buffer is large enough, so we wil use the allocation strategy for large buffers Review comment: Sorry about the typo. This comment is removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r415203808 ## File path: java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java ## @@ -17,48 +17,97 @@ package org.apache.arrow.memory; -import org.apache.arrow.memory.util.LargeMemoryUtil; - import io.netty.buffer.PooledByteBufAllocatorL; import io.netty.buffer.UnsafeDirectLittleEndian; +import io.netty.util.internal.PlatformDependent; /** - * The default implementation of AllocationManagerBase. The implementation is responsible for managing when memory + * The default implementation of {@link AllocationManager}. The implementation is responsible for managing when memory * is allocated and returned to the Netty-based PooledByteBufAllocatorL. */ public class NettyAllocationManager extends AllocationManager { public static final Factory FACTORY = new Factory(); + /** + * The default cut-off value for switching allocation strategies. + * If the request size is not greater than the cut-off value, we will allocate memory by + * {@link PooledByteBufAllocatorL} APIs, + * otherwise, we will use {@link PlatformDependent} APIs. + */ + public static final long DEFAULT_ALLOCATION_CUTOFF_VALUE; + + public static final String DEFAULT_ALLOCATION_CUTOFF_NAME = "default.allocation.cutoff.name"; + + static { +long cutOffValue; +try { + cutOffValue = Long.parseLong(System.getProperty(DEFAULT_ALLOCATION_CUTOFF_NAME)); +} catch (Exception e) { + cutOffValue = Integer.MAX_VALUE; +} +DEFAULT_ALLOCATION_CUTOFF_VALUE = cutOffValue; + } + private static final PooledByteBufAllocatorL INNER_ALLOCATOR = new PooledByteBufAllocatorL(); static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty; static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize(); - private final int allocatedSize; + private final long allocatedSize; private final UnsafeDirectLittleEndian memoryChunk; + private final long allocatedAddress; + + /** + * The cut-off value for switching allocation strategies. + */ + private final long allocationCutOffValue; - NettyAllocationManager(BaseAllocator accountingAllocator, int requestedSize) { + NettyAllocationManager(BaseAllocator accountingAllocator, long requestedSize, long allocationCutOffValue) { super(accountingAllocator); -this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize); -this.allocatedSize = memoryChunk.capacity(); +if (allocationCutOffValue > Integer.MAX_VALUE) { + throw new IllegalArgumentException("The cut-off value cannot be larger than Integer.MAX_VALUE"); +} +this.allocationCutOffValue = allocationCutOffValue; + +if (requestedSize > allocationCutOffValue) { + this.memoryChunk = null; + this.allocatedAddress = PlatformDependent.allocateMemory(requestedSize); + this.allocatedSize = requestedSize; +} else { + this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize); + this.allocatedAddress = memoryChunk.memoryAddress(); + this.allocatedSize = memoryChunk.capacity(); +} + } + + NettyAllocationManager(BaseAllocator accountingAllocator, long requestedSize) { +this(accountingAllocator, requestedSize, DEFAULT_ALLOCATION_CUTOFF_VALUE); } /** * Get the underlying memory chunk managed by this AllocationManager. - * @return buffer + * @return the underlying memory chunk if the request size is not greater than the + * {@link NettyAllocationManager#allocationCutOffValue}, or null otherwise. + * + * @deprecated this method will be removed in a future release. */ + @Deprecated UnsafeDirectLittleEndian getMemoryChunk() { -return memoryChunk; +return allocatedSize > allocationCutOffValue ? null : memoryChunk; Review comment: Sorry. This is not needed. Reverted. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r415203665 ## File path: java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java ## @@ -17,48 +17,97 @@ package org.apache.arrow.memory; -import org.apache.arrow.memory.util.LargeMemoryUtil; - import io.netty.buffer.PooledByteBufAllocatorL; import io.netty.buffer.UnsafeDirectLittleEndian; +import io.netty.util.internal.PlatformDependent; /** - * The default implementation of AllocationManagerBase. The implementation is responsible for managing when memory + * The default implementation of {@link AllocationManager}. The implementation is responsible for managing when memory * is allocated and returned to the Netty-based PooledByteBufAllocatorL. */ public class NettyAllocationManager extends AllocationManager { public static final Factory FACTORY = new Factory(); + /** + * The default cut-off value for switching allocation strategies. + * If the request size is not greater than the cut-off value, we will allocate memory by + * {@link PooledByteBufAllocatorL} APIs, + * otherwise, we will use {@link PlatformDependent} APIs. + */ + public static final long DEFAULT_ALLOCATION_CUTOFF_VALUE; + + public static final String DEFAULT_ALLOCATION_CUTOFF_NAME = "default.allocation.cutoff.name"; + + static { +long cutOffValue; +try { + cutOffValue = Long.parseLong(System.getProperty(DEFAULT_ALLOCATION_CUTOFF_NAME)); +} catch (Exception e) { + cutOffValue = Integer.MAX_VALUE; +} +DEFAULT_ALLOCATION_CUTOFF_VALUE = cutOffValue; + } + Review comment: Removed. Thank you for the good suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r415203349 ## File path: java/memory/src/test/java/org/apache/arrow/memory/TestLargeArrowBuf.java ## @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.memory; + +import static org.junit.Assert.assertEquals; + +import io.netty.buffer.ArrowBuf; + +/** + * Integration test for large (more than 2GB) {@link io.netty.buffer.ArrowBuf}. + * To run this test, please + *Make sure there are 4GB memory available in the system. + * + * Make sure the default allocation manager type is unsafe. + * This can be achieved by the environmental variable or system property. + * The details can be found in {@link DefaultAllocationManagerOption}. + * + */ +public class TestLargeArrowBuf { + + private static void testLargeArrowBuf() { +final long bufSize = 4 * 1024 * 1024 * 1024L; +try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + ArrowBuf largeBuf = allocator.buffer(bufSize)) { + assertEquals(bufSize, largeBuf.capacity()); + System.out.println("Successfully allocated a buffer with capacity " + largeBuf.capacity()); + + for (long i = 0; i < bufSize / 8; i++) { +largeBuf.setLong(i * 8, i); + +if ((i + 1) % 1 == 0) { + System.out.println("Successfully written " + (i + 1) + " long words"); +} + } + System.out.println("Successfully written " + (bufSize / 8) + " long words"); + + for (long i = 0; i < bufSize / 8; i++) { +long val = largeBuf.getLong(i * 8); +assertEquals(i, val); + +if ((i + 1) % 1 == 0) { + System.out.println("Successfully read " + (i + 1) + " long words"); +} + } + System.out.println("Successfully read " + (bufSize / 8) + " long words"); +} +System.out.println("Successfully released the large buffer."); + } + + public static void main(String[] args) { Review comment: Yes, it needs to run manually. I have updated the javadoc accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r415202903 ## File path: java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java ## @@ -17,48 +17,97 @@ package org.apache.arrow.memory; -import org.apache.arrow.memory.util.LargeMemoryUtil; - import io.netty.buffer.PooledByteBufAllocatorL; import io.netty.buffer.UnsafeDirectLittleEndian; +import io.netty.util.internal.PlatformDependent; /** - * The default implementation of AllocationManagerBase. The implementation is responsible for managing when memory + * The default implementation of {@link AllocationManager}. The implementation is responsible for managing when memory * is allocated and returned to the Netty-based PooledByteBufAllocatorL. */ public class NettyAllocationManager extends AllocationManager { public static final Factory FACTORY = new Factory(); + /** + * The default cut-off value for switching allocation strategies. + * If the request size is not greater than the cut-off value, we will allocate memory by + * {@link PooledByteBufAllocatorL} APIs, + * otherwise, we will use {@link PlatformDependent} APIs. + */ + public static final long DEFAULT_ALLOCATION_CUTOFF_VALUE; + + public static final String DEFAULT_ALLOCATION_CUTOFF_NAME = "default.allocation.cutoff.name"; + + static { +long cutOffValue; +try { + cutOffValue = Long.parseLong(System.getProperty(DEFAULT_ALLOCATION_CUTOFF_NAME)); +} catch (Exception e) { + cutOffValue = Integer.MAX_VALUE; +} +DEFAULT_ALLOCATION_CUTOFF_VALUE = cutOffValue; + } + private static final PooledByteBufAllocatorL INNER_ALLOCATOR = new PooledByteBufAllocatorL(); static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty; static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize(); - private final int allocatedSize; + private final long allocatedSize; private final UnsafeDirectLittleEndian memoryChunk; + private final long allocatedAddress; + + /** + * The cut-off value for switching allocation strategies. + */ + private final long allocationCutOffValue; Review comment: Changed to int. Thanks for the good suggestion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] liyafan82 commented on a change in pull request #6323: ARROW-7610: [Java] Finish support for 64 bit int allocations
liyafan82 commented on a change in pull request #6323: URL: https://github.com/apache/arrow/pull/6323#discussion_r415202953 ## File path: java/memory/src/main/java/org/apache/arrow/memory/NettyAllocationManager.java ## @@ -17,48 +17,97 @@ package org.apache.arrow.memory; -import org.apache.arrow.memory.util.LargeMemoryUtil; - import io.netty.buffer.PooledByteBufAllocatorL; import io.netty.buffer.UnsafeDirectLittleEndian; +import io.netty.util.internal.PlatformDependent; /** - * The default implementation of AllocationManagerBase. The implementation is responsible for managing when memory + * The default implementation of {@link AllocationManager}. The implementation is responsible for managing when memory * is allocated and returned to the Netty-based PooledByteBufAllocatorL. */ public class NettyAllocationManager extends AllocationManager { public static final Factory FACTORY = new Factory(); + /** + * The default cut-off value for switching allocation strategies. + * If the request size is not greater than the cut-off value, we will allocate memory by + * {@link PooledByteBufAllocatorL} APIs, + * otherwise, we will use {@link PlatformDependent} APIs. + */ + public static final long DEFAULT_ALLOCATION_CUTOFF_VALUE; + + public static final String DEFAULT_ALLOCATION_CUTOFF_NAME = "default.allocation.cutoff.name"; + + static { +long cutOffValue; +try { + cutOffValue = Long.parseLong(System.getProperty(DEFAULT_ALLOCATION_CUTOFF_NAME)); +} catch (Exception e) { + cutOffValue = Integer.MAX_VALUE; +} +DEFAULT_ALLOCATION_CUTOFF_VALUE = cutOffValue; + } + private static final PooledByteBufAllocatorL INNER_ALLOCATOR = new PooledByteBufAllocatorL(); static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty; static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize(); - private final int allocatedSize; + private final long allocatedSize; private final UnsafeDirectLittleEndian memoryChunk; + private final long allocatedAddress; + + /** + * The cut-off value for switching allocation strategies. + */ + private final long allocationCutOffValue; - NettyAllocationManager(BaseAllocator accountingAllocator, int requestedSize) { + NettyAllocationManager(BaseAllocator accountingAllocator, long requestedSize, long allocationCutOffValue) { super(accountingAllocator); -this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize); -this.allocatedSize = memoryChunk.capacity(); +if (allocationCutOffValue > Integer.MAX_VALUE) { + throw new IllegalArgumentException("The cut-off value cannot be larger than Integer.MAX_VALUE"); +} +this.allocationCutOffValue = allocationCutOffValue; + +if (requestedSize > allocationCutOffValue) { + this.memoryChunk = null; + this.allocatedAddress = PlatformDependent.allocateMemory(requestedSize); + this.allocatedSize = requestedSize; +} else { + this.memoryChunk = INNER_ALLOCATOR.allocate(requestedSize); + this.allocatedAddress = memoryChunk.memoryAddress(); + this.allocatedSize = memoryChunk.capacity(); +} + } + + NettyAllocationManager(BaseAllocator accountingAllocator, long requestedSize) { +this(accountingAllocator, requestedSize, DEFAULT_ALLOCATION_CUTOFF_VALUE); } /** * Get the underlying memory chunk managed by this AllocationManager. - * @return buffer + * @return the underlying memory chunk if the request size is not greater than the + * {@link NettyAllocationManager#allocationCutOffValue}, or null otherwise. + * + * @deprecated this method will be removed in a future release. */ + @Deprecated UnsafeDirectLittleEndian getMemoryChunk() { -return memoryChunk; +return allocatedSize > allocationCutOffValue ? null : memoryChunk; } @Override protected long memoryAddress() { -return memoryChunk.memoryAddress(); +return allocatedAddress; } @Override protected void release0() { -memoryChunk.release(); +if (allocatedSize > allocationCutOffValue) { Review comment: Revised accordingly. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-619463693 I found the cause of the test failure: If the `batch_size` isn't aligned with the `chunk_size`, categorical columns will fail with the error: ``` pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked arrays ``` I think this means categorical columns/DictionaryArray columns aren't supported by this method for now, except if you are able to align the `batch_size` with `chunk_size`. Is it possible or even common that `chunk_size` might be variable within a file? (The reason we were seeing the error in Python 3.5 and not in later Python versions is I was selecting a subset of columns using indices, and the ordering of columns changed between Python versions. I think because of the change in dictionary ordering in 3.6+. I've instead moved to have the offending test run on all columns.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7041: ARROW-8584: [C++] Fix ORC link order
github-actions[bot] commented on pull request #7041: URL: https://github.com/apache/arrow/pull/7041#issuecomment-619457325 https://issues.apache.org/jira/browse/ARROW-8584 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7041: ARROW-8584: [C++] Fix ORC link order
github-actions[bot] commented on pull request #7041: URL: https://github.com/apache/arrow/pull/7041#issuecomment-619456125 Revision: 3e57660bbcb5002a8c53146754146fc7c92b1ead Submitted crossbow builds: [ursa-labs/crossbow @ actions-166](https://github.com/ursa-labs/crossbow/branches/all?query=actions-166) |Task|Status| ||--| |centos-6-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-centos-6-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-centos-6-amd64)| |centos-7-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-centos-7-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-centos-7-amd64)| |centos-8-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-centos-8-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-centos-8-amd64)| |debian-buster-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-debian-buster-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-debian-buster-amd64)| |debian-stretch-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-debian-stretch-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-debian-stretch-amd64)| |ubuntu-bionic-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-ubuntu-bionic-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-ubuntu-bionic-amd64)| |ubuntu-eoan-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-ubuntu-eoan-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-ubuntu-eoan-amd64)| |ubuntu-focal-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-ubuntu-focal-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-ubuntu-focal-amd64)| |ubuntu-xenial-amd64|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-166-github-ubuntu-xenial-amd64)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-166-github-ubuntu-xenial-amd64)| This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou commented on pull request #7041: ARROW-8584: [C++] Fix ORC link order
kou commented on pull request #7041: URL: https://github.com/apache/arrow/pull/7041#issuecomment-619456003 @github-actions crossbow submit -g linux This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mayuropensource commented on pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics
mayuropensource commented on pull request #7022: URL: https://github.com/apache/arrow/pull/7022#issuecomment-619456016 A better calculation for bandwidth (by removing TTFB from total time) is done using following script: curl --negotiate -u: -o /dev/null -w "total_time_sec=%{time_total} data_size_bytes=%{size_download} TTFB_sec=%{time_starttransfer}\n" $S3_DATA_URI | awk -F' ' '{ split($1, tt, "="); split($2, ds, "="); split($3, ttfb, "="); bw=ds[2] / (1024 * 1024. * (tt[2] - ttfb[2])); print("TTFB_millis =", ttfb[2] * 1000.); print("Bandwidth_MiB_per_sec =", bw); }' This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mayuropensource edited a comment on pull request #7022: ARROW-8562: [C++] IO: Parameterize I/O Coalescing using S3 metrics
mayuropensource edited a comment on pull request #7022: URL: https://github.com/apache/arrow/pull/7022#issuecomment-619456016 A better calculation for bandwidth (by removing TTFB from total time) is done using following script: `curl --negotiate -u: -o /dev/null -w "total_time_sec=%{time_total} data_size_bytes=%{size_download} TTFB_sec=%{time_starttransfer}\n" $S3_DATA_URI | awk -F' ' '{ split($1, tt, "="); split($2, ds, "="); split($3, ttfb, "="); bw=ds[2] / (1024 * 1024. * (tt[2] - ttfb[2])); print("TTFB_millis =", ttfb[2] * 1000.); print("Bandwidth_MiB_per_sec =", bw); }'` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new pull request #7041: ARROW-8584: [C++] Fix ORC link order
kou opened a new pull request #7041: URL: https://github.com/apache/arrow/pull/7041 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-619437378 Two failing checks right now. For the linting one, it seems to be alarmed by some Rust code that I didn't touch. Am I missing something in that output? For the Python tests, getting [this error](https://github.com/apache/arrow/pull/6979/checks?check_run_id=618464479#step:7:2128): ``` pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked arrays ``` When I run the same test locally, it succeeds. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] wjones1 commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
wjones1 commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r415130068 ## File path: python/pyarrow/tests/test_parquet.py ## @@ -179,6 +179,99 @@ def alltypes_sample(size=1, seed=0, categorical=False): @pytest.mark.pandas +def test_iter_batches_columns_reader(tempdir): +df = alltypes_sample(size=1, categorical=True) + +filename = tempdir / 'pandas_roundtrip.parquet' +arrow_table = pa.Table.from_pandas(df) +_write_table(arrow_table, filename, version="2.0", + coerce_timestamps='ms', chunk_size=1000) + +columns = df.columns[4:15] + +file_ = pq.ParquetFile(filename) + +batches = file_.iter_batches( +batch_size=500, +columns=columns +) + +tm.assert_frame_equal( +next(batches).to_pandas(), +df.iloc[:500, :].loc[:, columns] +) + + +@pytest.mark.pandas +@pytest.mark.parametrize('chunk_size', [1000]) +def test_iter_batches_reader(tempdir, chunk_size): Review comment: Strangely, after I merged the latest changes from master, I am no longer seeing this issue with dictionary arrays. I definitely saw it in the original fork, so I think it may have actually been fixed (though not sure where). I've removed the dictionary array correction in the test and hopefully the CI should confirm what I am seeing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7040: ARROW-8505: [Release][C#] "sourcelink test" is failed by Apache.ArrowAssemblyInfo.cs
github-actions[bot] commented on pull request #7040: URL: https://github.com/apache/arrow/pull/7040#issuecomment-619408251 https://issues.apache.org/jira/browse/ARROW-8505 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] eerhardt opened a new pull request #7040: ARROW-8505: [Release][C#] "sourcelink test" is failed by Apache.ArrowAssemblyInfo.cs
eerhardt opened a new pull request #7040: URL: https://github.com/apache/arrow/pull/7040 Workaround https://github.com/dotnet/sourcelink/issues/572 by explicitly embedding the AssemblyAttributes file into the pdb. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] eerhardt commented on pull request #6121: ARROW-6603: [C#] - Nullable Array Support
eerhardt commented on pull request #6121: URL: https://github.com/apache/arrow/pull/6121#issuecomment-619396952 Thank you for this contribution, @abbotware. However, my opinion is that #7032 is more inline with how null support should be designed for the builder APIs. It also more closely models how the other languages do it as well. I think we should move forward with #7032's design and implementation to resolve ARROW-6603. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] eerhardt commented on pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
eerhardt commented on pull request #7032: URL: https://github.com/apache/arrow/pull/7032#issuecomment-619396229 > ARROW-5634 by properly setting the readonly value for NullCount, which previously was hardcoded to -1. I don't believe this change addresses the issue correctly. Can we remove that one from this PR? We can submit a separate PR for ARROW-5634, since it would involve more work and I think this PR should be targeted to just the ArrayBuilder API to support appending `null`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] eerhardt commented on a change in pull request #7032: ARROW-6603, ARROW-5708, ARROW-5634: [C#] Adds ArrayBuilder API to support writing null values + BooleanArray null support
eerhardt commented on a change in pull request #7032: URL: https://github.com/apache/arrow/pull/7032#discussion_r415072524 ## File path: csharp/src/Apache.Arrow/Apache.Arrow.csproj ## @@ -4,7 +4,7 @@ netstandard1.3;netcoreapp2.1 true $(DefineConstants);UNSAFE_BYTEBUFFER;BYTEBUFFER_NO_BOUNDS_CHECK;ENABLE_SPAN_T - + Review comment: (nit) Can you revert this change? That way it doesn't pollute version history. ## File path: csharp/Directory.Build.props ## @@ -21,6 +21,12 @@ $(CSharpDir)ApacheArrow.snk + + Review comment: I don't see where this is necessary. I've removed it and it still builds and tests pass on my machine. Can this be removed? In general, in the http://github.com/dotnet/runtime tests, we try to avoid using `InternalsVisibleTo` for testing, and instead write the tests in the same way a consumer of the library would use it. ## File path: csharp/src/Apache.Arrow/Arrays/ArrayData.cs ## @@ -84,7 +84,7 @@ public ArrayData Slice(int offset, int length) length = Math.Min(Length - offset, length); offset += Offset; -return new ArrayData(DataType, length, -1, offset, Buffers, Children); +return new ArrayData(DataType, length, NullCount, offset, Buffers, Children); Review comment: This isn't correct because we are slicing a subset of the `ArrayData`. If the total `ArrayData` has values: `0`, `null`, `2`, `3`, `4` `NullCount` for the outer `ArrayData` will be `1`. But if we are slicing from `offset = 2` and `length = 3` and creating a new `ArrayData` from the above, the new `ArrayData` won't have any nulls in it. But with this change, it will have `NullCount = 1` from its parent. `NullCount = -1` in the original code means "we don't know yet, it needs to be computed". That was done so we don't have to count the nulls in the sliced values until it is necessary. Since there are cases where it may not be required. ## File path: csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj ## @@ -4,6 +4,8 @@ netcoreapp2.1 true +true Review comment: If we can remove the `InternalsVisibleTo` attribute, we can revert this change as well. ## File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs ## @@ -99,43 +105,63 @@ public abstract class PrimitiveArrayBuilder : IArrowArrayBu { protected TBuilder Instance => this as TBuilder; protected ArrowBuffer.Builder ValueBuffer { get; } +protected BooleanArray.Builder ValidityBuffer { get; } public int Length => ValueBuffer.Length; +protected int NullCount { get; set; } + // TODO: Implement support for null values (null bitmaps) internal PrimitiveArrayBuilder() { ValueBuffer = new ArrowBuffer.Builder(); +ValidityBuffer = new BooleanArray.Builder(); } public TBuilder Resize(int length) { ValueBuffer.Resize(length); +ValidityBuffer.Resize(length + 1); return Instance; } public TBuilder Reserve(int capacity) { ValueBuffer.Reserve(capacity); +ValidityBuffer.Reserve(capacity + 1); return Instance; } public TBuilder Append(T value) { ValueBuffer.Append(value); +ValidityBuffer.Append(true); return Instance; } public TBuilder Append(ReadOnlySpan span) { ValueBuffer.Append(span); +ValidityBuffer.Append(span != Span.Empty); Review comment: This needs to be similar to `AppendRange` below and append a `true` for each element in the `span`. ## File path: csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs ## @@ -162,8 +188,8 @@ public TBuilder Swap(int i, int j) public TArray Build(MemoryAllocator allocator = default) { return Build( -ValueBuffer.Build(allocator), ArrowBuffer.Empty, -ValueBuffer.Length, 0, 0); +ValueBuffer.Build(allocator), ValidityBuffer.Build(allocator).ValueBuffer, Review comment: How do you feel about making the `ValidityBuffer` lazy? That is, if no `null` values were ever appended to the builder, then the `ValidityBuffer` stays empty. And only upon the first `null` value being append do we allocate a `ValidityBuffer` and append all `true`s for all the values already appended before the first `null`. The advantage is that when a buffer doesn't contain any nulls, we don't need to allocate memory here, and it will be smaller memory size when transferring the Arrow bytes across a network. ## File path: csharp/test/Apache.Arrow.Tests/ArrayBuilderTests.cs ## @@ -81,7 +110,9 @@ L
[GitHub] [arrow] github-actions[bot] commented on pull request #7039: ARROW-8513: [Python] Expose Take with Table input in Python
github-actions[bot] commented on pull request #7039: URL: https://github.com/apache/arrow/pull/7039#issuecomment-619367004 https://issues.apache.org/jira/browse/ARROW-8513 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] gramirezespinoza opened a new pull request #7039: ARROW-8513: [Python] Expose Take with Table input in Python
gramirezespinoza opened a new pull request #7039: URL: https://github.com/apache/arrow/pull/7039 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] rollokb commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader
rollokb commented on a change in pull request #6979: URL: https://github.com/apache/arrow/pull/6979#discussion_r415027375 ## File path: python/pyarrow/tests/test_parquet.py ## @@ -179,6 +179,99 @@ def alltypes_sample(size=1, seed=0, categorical=False): @pytest.mark.pandas +def test_iter_batches_columns_reader(tempdir): +df = alltypes_sample(size=1, categorical=True) + +filename = tempdir / 'pandas_roundtrip.parquet' +arrow_table = pa.Table.from_pandas(df) +_write_table(arrow_table, filename, version="2.0", + coerce_timestamps='ms', chunk_size=1000) + +columns = df.columns[4:15] + +file_ = pq.ParquetFile(filename) + +batches = file_.iter_batches( +batch_size=500, +columns=columns +) + +tm.assert_frame_equal( +next(batches).to_pandas(), +df.iloc[:500, :].loc[:, columns] +) + + +@pytest.mark.pandas +@pytest.mark.parametrize('chunk_size', [1000]) +def test_iter_batches_reader(tempdir, chunk_size): Review comment: @wjones1 thank you very much for taking over my PR. Unfortunately one of the reasons I abandoned this (and just built a wheel for my own company's project), was not knowing what to do about the problem this test highlights. The problem with dictionary arrays is far beyond my skill. I was wondering what someone with more architectural vision thinks of this issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7038: ARROW-8593: [C++][Parquet] Fix build with musl libc
github-actions[bot] commented on pull request #7038: URL: https://github.com/apache/arrow/pull/7038#issuecomment-619346808 https://issues.apache.org/jira/browse/ARROW-8593 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] tobim opened a new pull request #7038: ARROW-8593: [C++][Parquet] Fix build with musl libc
tobim opened a new pull request #7038: URL: https://github.com/apache/arrow/pull/7038 Converts local constants in `file_serialize_test.cc` to snake_case. Fixes a confilct with the `PAGE_SIZE` macro declared in the `limits.h` header that is shipped with musl libc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org