This is an automated email from the ASF dual-hosted git repository.
kou pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 6ee7f7ecdc GH-48481: [Ruby] Correctly infer types for nested integer
arrays (#48699)
6ee7f7ecdc is described below
commit 6ee7f7ecdc4b79b7882cdc8f47828964e3ba6a6f
Author: hypsakata <[email protected]>
AuthorDate: Sat Jan 3 09:58:34 2026 +0900
GH-48481: [Ruby] Correctly infer types for nested integer arrays (#48699)
### Rationale for this change
When building an `Arrow::Table` from a Ruby Hash passed to
`Arrow::Table.new`, nested Integer arrays are incorrectly inferred as
`list<uint8>` or `list<int8>` regardless of the actual values contained. Nested
integer arrays should be correctly inferred as the appropriate list type (e.g.,
`list<int64>`, `list<uint64>`) based on their values, similar to how flat
arrays are handled, unless they contain values out of range for any integer
type.
### What changes are included in this PR?
This PR modifies the logic in `detect_builder_info()` to fix the inference
issue. Specifically:
- **Persist `sub_builder_info` across sub-array elements**: Previously,
`sub_builder_info` was recreated for each sub-array element in the Array. The
logic has been updated to accumulate and carry over the builder information
across elements to ensure correct type inference for the entire list.
- **Refactor Integer builder logic**: Following the pattern used for
`BigDecimal`, the logic for determining the Integer builder has been moved to
`create_builder()`. `detect_builder_info()` now calls this function.
**Note:**
- As a side effect of this refactoring, nested lists of `BigDecimal` (which
were previously inferred as `string`) may now have their types inferred.
However, comprehensive testing and verification for nested `BigDecimal` support
will be addressed in a separate issue to keep this PR focused.
- We stopped using `IntArrayBuilder` for inference logic to ensure
correctness. This results in a performance overhead (array building is
approximately 2x slower) as we can no longer rely on the specialized builder's
detection.
```text
user system total
real
array_builder int32 100000 0.085867 0.000194 0.086061 (
0.086369)
int_array_builder int32 100000 0.042163 0.001033 0.043196 (
0.043268)
array_builder int64 100000 0.086799 0.000015 0.086814 (
0.086828)
int_array_builder int64 100000 0.044493 0.000973 0.045466 (
0.045469)
array_builder uint32 100000 0.085748 0.000009 0.085757 (
0.085768)
int_array_builder uint32 100000 0.044463 0.001034 0.045497 (
0.045498)
array_builder uint64 100000 0.084548 0.000987 0.085535 (
0.085537)
int_array_builder uint64 100000 0.044206 0.000017 0.044223 (
0.044225)
```
### Are these changes tested?
Yes. `ruby ruby/red-arrow/test/run-test.rb`
### Are there any user-facing changes?
Yes.
* GitHub Issue: #48481
Authored-by: hypsakata <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
---
ruby/red-arrow/lib/arrow/array-builder.rb | 53 +++-
ruby/red-arrow/test/test-array-builder.rb | 434 +++++++++++++++++++++++++++---
2 files changed, 443 insertions(+), 44 deletions(-)
diff --git a/ruby/red-arrow/lib/arrow/array-builder.rb
b/ruby/red-arrow/lib/arrow/array-builder.rb
index 2ccf50f3c1..5bb1ee7456 100644
--- a/ruby/red-arrow/lib/arrow/array-builder.rb
+++ b/ruby/red-arrow/lib/arrow/array-builder.rb
@@ -74,14 +74,23 @@ module Arrow
detected: true,
}
when Integer
- if value < 0
+ builder_info ||= {}
+ min = builder_info[:min] || value
+ max = builder_info[:max] || value
+ min = value if value < min
+ max = value if value > max
+
+ if builder_info[:builder_type] == :int || value < 0
{
- builder: IntArrayBuilder.new,
- detected: true,
+ builder_type: :int,
+ min: min,
+ max: max,
}
else
{
- builder: UIntArrayBuilder.new,
+ builder_type: :uint,
+ min: min,
+ max: max,
}
end
when Time
@@ -150,18 +159,19 @@ module Arrow
end
end
when ::Array
- sub_builder_info = nil
+ sub_builder_info = builder_info && builder_info[:value_builder_info]
value.each do |sub_value|
sub_builder_info = detect_builder_info(sub_value, sub_builder_info)
break if sub_builder_info and sub_builder_info[:detected]
end
if sub_builder_info
- sub_builder = sub_builder_info[:builder]
- return builder_info unless sub_builder
+ sub_builder = sub_builder_info[:builder] ||
create_builder(sub_builder_info)
+ return sub_builder_info unless sub_builder
sub_value_data_type = sub_builder.value_data_type
field = Field.new("item", sub_value_data_type)
{
builder: ListArrayBuilder.new(ListDataType.new(field)),
+ value_builder_info: sub_builder_info,
detected: sub_builder_info[:detected],
}
else
@@ -186,6 +196,35 @@ module Arrow
data_type = Decimal256DataType.new(builder_info[:precision],
builder_info[:scale])
Decimal256ArrayBuilder.new(data_type)
+ when :int
+ min = builder_info[:min]
+ max = builder_info[:max]
+
+ if GLib::MININT8 <= min && max <= GLib::MAXINT8
+ Int8ArrayBuilder.new
+ elsif GLib::MININT16 <= min && max <= GLib::MAXINT16
+ Int16ArrayBuilder.new
+ elsif GLib::MININT32 <= min && max <= GLib::MAXINT32
+ Int32ArrayBuilder.new
+ elsif GLib::MININT64 <= min && max <= GLib::MAXINT64
+ Int64ArrayBuilder.new
+ else
+ StringArrayBuilder.new
+ end
+ when :uint
+ max = builder_info[:max]
+
+ if max <= GLib::MAXUINT8
+ UInt8ArrayBuilder.new
+ elsif max <= GLib::MAXUINT16
+ UInt16ArrayBuilder.new
+ elsif max <= GLib::MAXUINT32
+ UInt32ArrayBuilder.new
+ elsif max <= GLib::MAXUINT64
+ UInt64ArrayBuilder.new
+ else
+ StringArrayBuilder.new
+ end
else
nil
end
diff --git a/ruby/red-arrow/test/test-array-builder.rb
b/ruby/red-arrow/test/test-array-builder.rb
index 7a2d42e54b..f629eec661 100644
--- a/ruby/red-arrow/test/test-array-builder.rb
+++ b/ruby/red-arrow/test/test-array-builder.rb
@@ -147,44 +147,404 @@ class ArrayBuilderTest < Test::Unit::TestCase
])
end
- test("list<uint>s") do
- values = [
- [0, 1, 2],
- [3, 4],
- ]
- array = Arrow::Array.new(values)
- data_type = Arrow::ListDataType.new(Arrow::UInt8DataType.new)
- assert_equal({
- data_type: data_type,
- values: [
- [0, 1, 2],
- [3, 4],
- ],
- },
- {
- data_type: array.value_data_type,
- values: array.to_a,
- })
- end
+ sub_test_case("nested integer list") do
+ test("list<uint8>s") do
+ values = [
+ [0, 1, 2],
+ [3, 4],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:uint8)
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, 1, 2],
+ [3, 4],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
- test("list<int>s") do
- values = [
- [0, -1, 2],
- [3, 4],
- ]
- array = Arrow::Array.new(values)
- data_type = Arrow::ListDataType.new(Arrow::Int8DataType.new)
- assert_equal({
- data_type: data_type,
- values: [
- [0, -1, 2],
- [3, 4],
- ],
- },
- {
- data_type: array.value_data_type,
- values: array.to_a,
- })
+ test("list<int8>s boundary") do
+ values = [
+ [0, GLib::MININT8],
+ [GLib::MAXINT8],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int8)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MININT8],
+ [GLib::MAXINT8],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<int16>s inferred from int8 underflow") do
+ values = [
+ [0, GLib::MININT8 - 1],
+ [GLib::MAXINT8],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int16)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MININT8 - 1],
+ [GLib::MAXINT8],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<int16>s inferred from int8 overflow") do
+ values = [
+ [0, GLib::MAXINT8 + 1],
+ [GLib::MININT8],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int16)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXINT8 + 1],
+ [GLib::MININT8],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<int16>s boundary") do
+ values = [
+ [0, GLib::MININT16],
+ [GLib::MAXINT16],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int16)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MININT16],
+ [GLib::MAXINT16],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<int32>s inferred from int16 underflow") do
+ values = [
+ [0, GLib::MININT16 - 1],
+ [GLib::MAXINT16],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int32)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MININT16 - 1],
+ [GLib::MAXINT16],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<int32>s inferred from int16 overflow") do
+ values = [
+ [0, GLib::MAXINT16 + 1],
+ [GLib::MININT16],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int32)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXINT16 + 1],
+ [GLib::MININT16],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<int32>s boundary") do
+ values = [
+ [0, GLib::MININT32],
+ [GLib::MAXINT32],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int32)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MININT32],
+ [GLib::MAXINT32],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<int64>s inferred from int32 underflow") do
+ values = [
+ [0, GLib::MININT32 - 1],
+ [GLib::MAXINT32],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int64)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MININT32 - 1],
+ [GLib::MAXINT32],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<int64>s inferred from int32 overflow") do
+ values = [
+ [0, GLib::MAXINT32 + 1],
+ [GLib::MININT32],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:int64)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXINT32 + 1],
+ [GLib::MININT32],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("string fallback from nested int64 array overflow") do
+ values = [
+ [0, GLib::MAXINT64 + 1],
+ [GLib::MININT64],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:string)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ ["0", "#{GLib::MAXINT64 + 1}"],
+ ["#{GLib::MININT64}"],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("string fallback from nested int64 array underflow") do
+ values = [
+ [0, GLib::MININT64 - 1],
+ [GLib::MAXINT64],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:string)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ ["0", "#{GLib::MININT64 - 1}"],
+ ["#{GLib::MAXINT64}"],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<uint8>s boundary") do
+ values = [
+ [0, GLib::MAXUINT8],
+ [GLib::MAXUINT8],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:uint8)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXUINT8],
+ [GLib::MAXUINT8],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<uint16>s") do
+ values = [
+ [0, GLib::MAXUINT8 + 1],
+ [GLib::MAXUINT8],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:uint16)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXUINT8 + 1],
+ [GLib::MAXUINT8],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<uint16>s boundary") do
+ values = [
+ [0, GLib::MAXUINT16],
+ [GLib::MAXUINT16],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:uint16)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXUINT16],
+ [GLib::MAXUINT16],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<uint32>s") do
+ values = [
+ [0, GLib::MAXUINT16 + 1],
+ [GLib::MAXUINT16],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:uint32)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXUINT16 + 1],
+ [GLib::MAXUINT16],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<uint32>s boundary") do
+ values = [
+ [0, GLib::MAXUINT32],
+ [GLib::MAXUINT32],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:uint32)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXUINT32],
+ [GLib::MAXUINT32],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("list<uint64>s") do
+ values = [
+ [0, GLib::MAXUINT32 + 1],
+ [GLib::MAXUINT32],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:uint64)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ [0, GLib::MAXUINT32 + 1],
+ [GLib::MAXUINT32],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
+
+ test("string fallback from nested uint64 array overflow") do
+ values = [
+ [0, GLib::MAXUINT64 + 1],
+ [GLib::MAXUINT64],
+ ]
+ array = Arrow::Array.new(values)
+ data_type = Arrow::ListDataType.new(:string)
+
+ assert_equal({
+ data_type: data_type,
+ values: [
+ ["0", "#{GLib::MAXUINT64 + 1}"],
+ ["#{GLib::MAXUINT64}"],
+ ],
+ },
+ {
+ data_type: array.value_data_type,
+ values: array.to_a,
+ })
+ end
end
end