[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-31 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r480201923



##
File path: cpp/src/arrow/array/builder_dict.h
##
@@ -93,6 +93,10 @@ class ARROW_EXPORT DictionaryMemoTable {
   std::unique_ptr impl_;
 };
 
+/// \brief Check array's value type by DCHECK
+ARROW_EXPORT void CheckArrayType(const std::shared_ptr& 
expected_type,
+ const Array& array, const std::string& 
message);

Review comment:
   I'll make this a protected method of `ArrayBuilder`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-31 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r480195899



##
File path: cpp/src/arrow/array/array_test.cc
##
@@ -2306,6 +2319,19 @@ TEST_F(TestAdaptiveUIntBuilder, TestAppendNulls) {
   }
 }
 
+TEST(TestAdaptiveUIntBuilderWithStartIntSize, TestReset) {
+  auto builder = std::make_shared(
+  static_cast(sizeof(uint16_t)), default_memory_pool());
+  ASSERT_TRUE(uint16()->Equals(*builder->type()));

Review comment:
   I'm sorry for missing this frequently.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-27 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r478785713



##
File path: cpp/src/arrow/array/array_dict_test.cc
##
@@ -904,6 +904,67 @@ TEST(TestDecimalDictionaryBuilder, DoubleTableSize) {
   ASSERT_TRUE(expected.Equals(result));
 }
 
+TEST(TestNullDictionaryBuilder, Basic) {
+  // MakeBuilder
+  auto dict_type = dictionary(int8(), null());
+  std::unique_ptr boxed_builder;
+  ASSERT_OK(MakeBuilder(default_memory_pool(), dict_type, _builder));
+  auto& builder = checked_cast&>(*boxed_builder);
+
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_EQ(3, builder.length());
+  ASSERT_EQ(3, builder.null_count());
+
+  ASSERT_OK(builder.AppendNulls(4));
+  ASSERT_EQ(7, builder.length());
+  ASSERT_EQ(7, builder.null_count());
+
+  auto int_array = ArrayFromJSON(int8(), "[0, 1, 0, null]");
+  ASSERT_OK(builder.AppendArray(*int_array));

Review comment:
   @pitrou I see, will do it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-26 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r478143508



##
File path: cpp/src/arrow/array/array_dict_test.cc
##
@@ -904,6 +904,67 @@ TEST(TestDecimalDictionaryBuilder, DoubleTableSize) {
   ASSERT_TRUE(expected.Equals(result));
 }
 
+TEST(TestNullDictionaryBuilder, Basic) {
+  // MakeBuilder
+  auto dict_type = dictionary(int8(), null());
+  std::unique_ptr boxed_builder;
+  ASSERT_OK(MakeBuilder(default_memory_pool(), dict_type, _builder));
+  auto& builder = checked_cast&>(*boxed_builder);
+
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_EQ(3, builder.length());
+  ASSERT_EQ(3, builder.null_count());
+
+  ASSERT_OK(builder.AppendNulls(4));
+  ASSERT_EQ(7, builder.length());
+  ASSERT_EQ(7, builder.null_count());
+
+  auto int_array = ArrayFromJSON(int8(), "[0, 1, 0, null]");
+  ASSERT_OK(builder.AppendArray(*int_array));

Review comment:
   @pitrou I tried to directly insert `DCHECK` in `AppendArray`, but I 
found it is difficult because `AppendArray` is defined in `builder_dict.h`. 
That is a public header file` so we cannot use `DCHECK` there.
   
   Instead of directly inserting `DCHECK`, the candidate patch is below.
   
   ```diff
   diff --git a/cpp/src/arrow/array/builder_dict.cc 
b/cpp/src/arrow/array/builder_dict.cc
   index 54fd94856..dc239f268 100644
   --- a/cpp/src/arrow/array/builder_dict.cc
   +++ b/cpp/src/arrow/array/builder_dict.cc
   @@ -189,5 +189,11 @@ Status DictionaryMemoTable::InsertValues(const Array& 
array) {
   
int32_t DictionaryMemoTable::size() const { return impl_->size(); }
   
   +#ifndef NDEBUG
   +void CheckArrayType(const Type::type type_id, const Array& array, const 
std::string& message) {
   +  DCHECK_EQ(type_id, array.type_id()) << message;
   +}
   +#endif
   +
}  // namespace internal
}  // namespace arrow
   diff --git a/cpp/src/arrow/array/builder_dict.h 
b/cpp/src/arrow/array/builder_dict.h
   index f8c77a5b0..a0d02fd1b 100644
   --- a/cpp/src/arrow/array/builder_dict.h
   +++ b/cpp/src/arrow/array/builder_dict.h
   @@ -93,6 +93,10 @@ class ARROW_EXPORT DictionaryMemoTable {
  std::unique_ptr impl_;
};
   
   +#ifndef NDEBUG
   +void CheckArrayType(const Type::type type_id, const Array& array, const 
std::string& message);
   +#endif
   +
/// \brief Array builder for created encoded DictionaryArray from
/// dense array
///
   @@ -248,6 +252,10 @@ class DictionaryBuilderBase : public ArrayBuilder {
  const Array& array) {
using ArrayType = typename TypeTraits::ArrayType;
   
   +#ifndef NDEBUG
   +CheckArrayType(T::type_id, array, "Wrong value type of array to be 
appended");
   +#endif
   +
const auto& concrete_array = static_cast(array);
for (int64_t i = 0; i < array.length(); i++) {
  if (array.IsNull(i)) {
   @@ -406,6 +414,9 @@ class DictionaryBuilderBase : 
public ArrayBuilder {
   
  /// \brief Append a whole dense array to the builder
  Status AppendArray(const Array& array) {
   +#ifndef NDEBUG
   +CheckArrayType(NullType::type_id, array, "Wrong value type of array to 
be appended");
   +#endif
for (int64_t i = 0; i < array.length(); i++) {
  ARROW_RETURN_NOT_OK(AppendNull());
}
};
   
}  // namespace internal
   ```
   
   I don't think this approach is good.  Can I follow this way?
   
   By the way, I found that `AppendArray` for `DictionaryBuilderBase` of 
`FixedSizedBinaryType` checks the array type at the beginning of the function 
and return `Status::Invalid` when mismatching type types.  I guess if this kind 
of check is accepted for `FixedSizedBinaryType`, the same kind of check can be 
put in the other `AppendArray` functions.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-20 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r474160038



##
File path: cpp/src/arrow/array/array_dict_test.cc
##
@@ -904,6 +904,67 @@ TEST(TestDecimalDictionaryBuilder, DoubleTableSize) {
   ASSERT_TRUE(expected.Equals(result));
 }
 
+TEST(TestNullDictionaryBuilder, Basic) {
+  // MakeBuilder
+  auto dict_type = dictionary(int8(), null());
+  std::unique_ptr boxed_builder;
+  ASSERT_OK(MakeBuilder(default_memory_pool(), dict_type, _builder));
+  auto& builder = checked_cast&>(*boxed_builder);
+
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_EQ(3, builder.length());
+  ASSERT_EQ(3, builder.null_count());
+
+  ASSERT_OK(builder.AppendNulls(4));
+  ASSERT_EQ(7, builder.length());
+  ASSERT_EQ(7, builder.null_count());
+
+  auto int_array = ArrayFromJSON(int8(), "[0, 1, 0, null]");
+  ASSERT_OK(builder.AppendArray(*int_array));

Review comment:
   @pitrou I replaced the `int_array` by `null_array`.
   How do you think inserting `DCHECK` in `AppendArray` to check the array type 
only in the debug build?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-14 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r470633675



##
File path: cpp/src/arrow/array/array_dict_test.cc
##
@@ -904,6 +904,67 @@ TEST(TestDecimalDictionaryBuilder, DoubleTableSize) {
   ASSERT_TRUE(expected.Equals(result));
 }
 
+TEST(TestNullDictionaryBuilder, Basic) {
+  // MakeBuilder
+  auto dict_type = dictionary(int8(), null());
+  std::unique_ptr boxed_builder;
+  ASSERT_OK(MakeBuilder(default_memory_pool(), dict_type, _builder));
+  auto& builder = checked_cast&>(*boxed_builder);
+
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_EQ(3, builder.length());
+  ASSERT_EQ(3, builder.null_count());
+
+  ASSERT_OK(builder.AppendNulls(4));
+  ASSERT_EQ(7, builder.length());
+  ASSERT_EQ(7, builder.null_count());
+
+  auto int_array = ArrayFromJSON(int8(), "[0, 1, 0, null]");
+  ASSERT_OK(builder.AppendArray(*int_array));

Review comment:
   @pitrou How about this?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-13 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r469954523



##
File path: cpp/src/arrow/array/array_dict_test.cc
##
@@ -904,6 +904,67 @@ TEST(TestDecimalDictionaryBuilder, DoubleTableSize) {
   ASSERT_TRUE(expected.Equals(result));
 }
 
+TEST(TestNullDictionaryBuilder, Basic) {
+  // MakeBuilder
+  auto dict_type = dictionary(int8(), null());
+  std::unique_ptr boxed_builder;
+  ASSERT_OK(MakeBuilder(default_memory_pool(), dict_type, _builder));
+  auto& builder = checked_cast&>(*boxed_builder);
+
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_OK(builder.AppendNull());
+  ASSERT_EQ(3, builder.length());
+  ASSERT_EQ(3, builder.null_count());
+
+  ASSERT_OK(builder.AppendNulls(4));
+  ASSERT_EQ(7, builder.length());
+  ASSERT_EQ(7, builder.null_count());
+
+  auto int_array = ArrayFromJSON(int8(), "[0, 1, 0, null]");
+  ASSERT_OK(builder.AppendArray(*int_array));

Review comment:
   I think this is practically meaningless.  It is the same as 
`AppendNulls` as its implementation.  But, as  `AppendArray` function in 
`DictionaryBuilder` of `NullType` actually accepts an Array of any type, I 
believe we should test the behavior.
   
   We probably need to check the value type of the given `Array` in 
`AppendArray`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-13 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r469945583



##
File path: cpp/src/arrow/array/builder_dict.h
##
@@ -105,18 +105,46 @@ class DictionaryBuilderBase : public ArrayBuilder {
 
   // WARNING: the type given below is the value type, not the DictionaryType.
   // The DictionaryType is instantiated on the Finish() call.
-  template 
-  DictionaryBuilderBase(enable_if_t::value,
+  template 
+  DictionaryBuilderBase(uint8_t start_int_size,
+enable_if_t::value &&
+!is_fixed_size_binary_type::value,
 const std::shared_ptr&>
 value_type,
 MemoryPool* pool = default_memory_pool())
+  : ArrayBuilder(pool),
+memo_table_(new internal::DictionaryMemoTable(pool, value_type)),
+delta_offset_(0),
+byte_width_(-1),
+indices_builder_(start_int_size, pool),
+value_type_(value_type) {}
+
+  template 
+  DictionaryBuilderBase(

Review comment:
   This constructor is for the cases that BuilderType is not 
AdaptiveIntBuilder.  The constructor for the case of AdaptiveIntBuilder cannot 
be reused by this constructor because the constructor of a non-adaptive 
IntBuilder (i.e. `indices_builder_`) doesn't take the `start_int_size` argument.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-12 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r469298987



##
File path: cpp/src/arrow/array/builder_adaptive.h
##
@@ -122,9 +122,10 @@ class ARROW_EXPORT AdaptiveUIntBuilder : public 
internal::AdaptiveIntBuilderBase
 
   std::shared_ptr type() const override;
 
+  Status ExpandIntSize(uint8_t new_int_size);

Review comment:
   This function needn't be public anymore.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-12 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r469298490



##
File path: cpp/src/arrow/array/builder_dict.h
##
@@ -409,6 +409,10 @@ class DictionaryBuilder : public 
internal::DictionaryBuilderBase;
   using BASE::BASE;
 
+  Status ExpandIndexByteWidth(uint8_t new_index_byte_width) {
+return BASE::indices_builder_.ExpandIntSize(new_index_byte_width);

Review comment:
   @pitrou I've done to do this.  Could you please review this again?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-11 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r468657312



##
File path: cpp/src/arrow/array/builder_dict.h
##
@@ -409,6 +409,10 @@ class DictionaryBuilder : public 
internal::DictionaryBuilderBase;
   using BASE::BASE;
 
+  Status ExpandIndexByteWidth(uint8_t new_index_byte_width) {
+return BASE::indices_builder_.ExpandIntSize(new_index_byte_width);

Review comment:
   I see. I try to rewrite this way.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] mrkn commented on a change in pull request #7898: ARROW-9642: [C++] Let MakeBuilder refer DictionaryType's index_type for deciding the starting bit width of the indices

2020-08-11 Thread GitBox


mrkn commented on a change in pull request #7898:
URL: https://github.com/apache/arrow/pull/7898#discussion_r468656539



##
File path: cpp/src/arrow/array/array_dict_test.cc
##
@@ -22,6 +22,8 @@
 #include 
 #include 
 
+#include 

Review comment:
   This should be removed before committing but I forgot.
   Thanks for catching this!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org