Re: [I] Language server unable to lint or analyze any classes imported by the `pyarrow.lib ` commands, in the `__init__.py` [arrow]

2024-06-06 Thread via GitHub


AntBlo commented on issue #41736:
URL: https://github.com/apache/arrow/issues/41736#issuecomment-2153166074

   According to a developer over at LanceDB, imported Cython code doesn't have 
type hints. So I think this is expected. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] feat(python) : Added support to create timestamp/date32/date64 Array from iterable [arrow-nanoarrow]

2024-06-06 Thread via GitHub


paleolimbot commented on code in PR #502:
URL: https://github.com/apache/arrow-nanoarrow/pull/502#discussion_r1630003589


##
python/tests/test_c_buffer.py:
##
@@ -362,3 +363,47 @@ def test_c_buffer_bitmap_from_iterable():
 builder.write_elements([True, False])
 with pytest.raises(NotImplementedError, match="Append to bitmap"):
 builder.write_elements([True])
+
+
+def test_c_buffer_from_timestamp_iterable():

Review Comment:
   I am not sure that one should be able to do `na.c_buffer([1, 2, 3], 
na.timestamp("ms"))` without getting an error (a timestamp is a valid *array* 
type but not really a valid *buffer* type). I think that the `buffer_format` 
vs. `storage_buffer_format` change will take care of this.



##
python/tests/test_c_buffer.py:
##
@@ -259,8 +260,8 @@ def test_c_buffer_from_iterable():
 
 # An Arrow type whose storage type is not the same as its top-level
 # type will error.
-with pytest.raises(ValueError, match="Can't create buffer"):
-na.c_buffer([1, 2, 3], na.date32())
+# with pytest.raises(ValueError, match="Can't create buffer"):
+# na.c_buffer([1, 2, 3], na.duration("s"))

Review Comment:
   ```suggestion
   with pytest.raises(ValueError, match="Can't create buffer"):
   na.c_buffer([1, 2, 3], na.dictionary(na.int32(), na.string()))
   ```
   
   (I think this should trigger the intended code path without requiring the 
new construction features!)



##
python/tests/test_c_array.py:
##
@@ -514,3 +516,71 @@ def test_c_array_from_buffers_validation():
 validation_level=validation_level,
 )
 assert c_array.length == 2
+
+
+def test_c_array_timestamp_seconds():
+d1 = int(round(datetime(1970, 1, 1).timestamp()))
+d2 = int(round(datetime(1985, 12, 31).timestamp()))
+d3 = int(round(datetime(2005, 3, 4).timestamp()))
+c_array = na.c_array([d1, d2, d3], na.timestamp("s"))

Review Comment:
   In addition to `na.c_array([d1, d2, d3], na.timestamp("s"))`, I believe that 
creating this from a pybuffer (e.g., numpy) should work too. Perhaps we should 
also check `na.c_array(array.array("i", [d1, d2, d3]), na.timestamp("s"))`?



##
python/tests/test_array.py:
##
@@ -354,3 +355,64 @@ def test_array_inspect(capsys):
 array.inspect()
 captured = capsys.readouterr()
 assert captured.out.startswith("

Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41889:
URL: https://github.com/apache/arrow/pull/41889#discussion_r1629981539


##
python/pyarrow/tests/test_io.py:
##
@@ -669,6 +669,57 @@ def test_allocate_buffer_resizable():
 assert buf.size == 200
 
 
+def test_non_cpu_buffer(pickle_module):
+cuda = pytest.importorskip("pyarrow.cuda")
+ctx = cuda.Context(0)
+
+arr = np.arange(4, dtype=np.int32)
+cuda_buf = ctx.buffer_from_data(arr)
+
+arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
+buf_on_gpu = arr.buffers()[1]
+
+assert buf_on_gpu.size == cuda_buf.size
+assert buf_on_gpu.address == cuda_buf.address
+assert buf_on_gpu.is_cpu == cuda_buf.is_cpu
+
+repr1 = ""
+assert repr1 in repr(buf_on_gpu)
+assert repr2 in repr(buf_on_gpu)
+
+msg = "Implemented only for data on CPU device"
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu.hex()
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf.hex()
+
+assert buf_on_gpu.is_mutable
+
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+pickle_module.dumps(buf_on_gpu, protocol=4)
+
+buf = pa.py_buffer(b'testing')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(7), 1, [None, buf])
+buf_on_gpu = arr.buffers()[1]
+buf_on_gpu_sliced = buf_on_gpu.slice(2)
+
+buf = pa.py_buffer(b'sting')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(5), 1, [None, buf])
+buf_on_gpu_expected = arr.buffers()[1]
+
+assert buf_on_gpu_sliced.equals(buf_on_gpu_expected)
+assert buf.equals(buf_on_gpu_expected)

Review Comment:
   Checking equation between `CudaBuffer` and `Buffer` on gpu work. But I get a 
segfault if checking equation between two `Buffers` on gpu (`buf_on_gpu_sliced` 
and `buf_on_gpu_expected`): 
https://github.com/apache/arrow/pull/41889/commits/958af940b3020f5656e86a08ee9f5d47d9857660



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41889:
URL: https://github.com/apache/arrow/pull/41889#discussion_r1629970741


##
python/pyarrow/tests/test_io.py:
##
@@ -669,6 +669,57 @@ def test_allocate_buffer_resizable():
 assert buf.size == 200
 
 
+def test_non_cpu_buffer(pickle_module):
+cuda = pytest.importorskip("pyarrow.cuda")
+ctx = cuda.Context(0)
+
+arr = np.arange(4, dtype=np.int32)
+cuda_buf = ctx.buffer_from_data(arr)
+
+arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
+buf_on_gpu = arr.buffers()[1]
+
+assert buf_on_gpu.size == cuda_buf.size
+assert buf_on_gpu.address == cuda_buf.address
+assert buf_on_gpu.is_cpu == cuda_buf.is_cpu
+
+repr1 = ""
+assert repr1 in repr(buf_on_gpu)
+assert repr2 in repr(buf_on_gpu)
+
+msg = "Implemented only for data on CPU device"
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu.hex()
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf.hex()
+
+assert buf_on_gpu.is_mutable
+
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+pickle_module.dumps(buf_on_gpu, protocol=4)
+
+buf = pa.py_buffer(b'testing')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(7), 1, [None, buf])
+buf_on_gpu = arr.buffers()[1]
+buf_on_gpu_sliced = buf_on_gpu.slice(2)
+
+buf = pa.py_buffer(b'sting')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(5), 1, [None, buf])
+buf_on_gpu_expected = arr.buffers()[1]
+
+assert buf_on_gpu_sliced.equals(buf_on_gpu_expected)
+assert buf.equals(buf_on_gpu_expected)
+assert buf_on_gpu_sliced.to_pybytes() == buf_on_gpu_expected.to_pybytes()

Review Comment:
   Yes, not working:
   
   ```python
   >>> arr = np.array([b'testing'])
   >>> cuda_buf = ctx.buffer_from_data(arr)
   >>> arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(7), 1, [None, 
cuda_buf])
   >>> buf_on_gpu = arr.buffers()[1]
   
   >>> cuda_buf.to_pybytes()
   b'testing'
   >>> buf_on_gpu.to_pybytes()
   b'\x15\x00\x00\x00\x00\x00\x00'
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41889:
URL: https://github.com/apache/arrow/pull/41889#discussion_r1629965069


##
python/pyarrow/tests/test_io.py:
##
@@ -669,6 +669,57 @@ def test_allocate_buffer_resizable():
 assert buf.size == 200
 
 
+def test_non_cpu_buffer(pickle_module):
+cuda = pytest.importorskip("pyarrow.cuda")
+ctx = cuda.Context(0)
+
+arr = np.arange(4, dtype=np.int32)
+cuda_buf = ctx.buffer_from_data(arr)
+
+arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
+buf_on_gpu = arr.buffers()[1]
+
+assert buf_on_gpu.size == cuda_buf.size
+assert buf_on_gpu.address == cuda_buf.address
+assert buf_on_gpu.is_cpu == cuda_buf.is_cpu
+
+repr1 = ""
+assert repr1 in repr(buf_on_gpu)
+assert repr2 in repr(buf_on_gpu)
+
+msg = "Implemented only for data on CPU device"
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu.hex()
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf.hex()
+
+assert buf_on_gpu.is_mutable
+
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+pickle_module.dumps(buf_on_gpu, protocol=4)
+
+buf = pa.py_buffer(b'testing')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(7), 1, [None, buf])
+buf_on_gpu = arr.buffers()[1]
+buf_on_gpu_sliced = buf_on_gpu.slice(2)
+
+buf = pa.py_buffer(b'sting')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(5), 1, [None, buf])
+buf_on_gpu_expected = arr.buffers()[1]
+
+assert buf_on_gpu_sliced.equals(buf_on_gpu_expected)
+assert buf.equals(buf_on_gpu_expected)
+assert buf_on_gpu_sliced.to_pybytes() == buf_on_gpu_expected.to_pybytes()

Review Comment:
   Yes, the case with integer data is wrong:
   
   ```python
   >>> from pyarrow import cuda
   >>> ctx = cuda.Context(0)
   >>> import numpy as np
   >>> import pyarrow as pa
   
   >>> arr = np.arange(4, dtype=np.int32)
   >>> cuda_buf = ctx.buffer_from_data(arr)
   
   >>> arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
   >>> buf_on_gpu = arr.buffers()[1]
   
   >>> buf_on_gpu.to_pybytes()
   b'\xe4\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
   >>> cuda_buf.to_pybytes()
   b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'
   ```
   
   and I think the binary would be also - I just noticed I am starting from a 
`pyarrow.Buffer `in this case, not CUDA Buffer! Need to correct that.
   
   Will also need to look into why the `to_pybytes` is wrong and is it only for 
non-CPU data? I would guess it worked correctly till now for CPU data?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41889:
URL: https://github.com/apache/arrow/pull/41889#discussion_r1629952428


##
python/pyarrow/tests/test_io.py:
##
@@ -669,6 +669,57 @@ def test_allocate_buffer_resizable():
 assert buf.size == 200
 
 
+def test_non_cpu_buffer(pickle_module):
+cuda = pytest.importorskip("pyarrow.cuda")
+ctx = cuda.Context(0)
+
+arr = np.arange(4, dtype=np.int32)
+cuda_buf = ctx.buffer_from_data(arr)
+
+arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
+buf_on_gpu = arr.buffers()[1]
+
+assert buf_on_gpu.size == cuda_buf.size
+assert buf_on_gpu.address == cuda_buf.address
+assert buf_on_gpu.is_cpu == cuda_buf.is_cpu
+
+repr1 = ""
+assert repr1 in repr(buf_on_gpu)
+assert repr2 in repr(buf_on_gpu)
+
+msg = "Implemented only for data on CPU device"
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu.hex()
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf.hex()
+
+assert buf_on_gpu.is_mutable
+
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+pickle_module.dumps(buf_on_gpu, protocol=4)
+
+buf = pa.py_buffer(b'testing')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(7), 1, [None, buf])
+buf_on_gpu = arr.buffers()[1]
+buf_on_gpu_sliced = buf_on_gpu.slice(2)
+
+buf = pa.py_buffer(b'sting')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(5), 1, [None, buf])
+buf_on_gpu_expected = arr.buffers()[1]
+
+assert buf_on_gpu_sliced.equals(buf_on_gpu_expected)
+assert buf.equals(buf_on_gpu_expected)
+assert buf_on_gpu_sliced.to_pybytes() == buf_on_gpu_expected.to_pybytes()

Review Comment:
   Actually I think it was failing with integer data. Will check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] kernel.cc: Avoid default in switches to let the compiler warn us about missing cases [arrow]

2024-06-06 Thread via GitHub


felipecrv commented on issue #41994:
URL: https://github.com/apache/arrow/issues/41994#issuecomment-2153045863

   Issue resolved by pull request 41995
   https://github.com/apache/arrow/pull/41995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41994 [C++]: kernel.cc: Remove defaults on switch so that compiler can check full enum coverage for us [arrow]

2024-06-06 Thread via GitHub


felipecrv merged PR #41995:
URL: https://github.com/apache/arrow/pull/41995


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41889:
URL: https://github.com/apache/arrow/pull/41889#discussion_r1629937919


##
python/pyarrow/tests/test_io.py:
##
@@ -669,6 +669,57 @@ def test_allocate_buffer_resizable():
 assert buf.size == 200
 
 
+def test_non_cpu_buffer(pickle_module):
+cuda = pytest.importorskip("pyarrow.cuda")
+ctx = cuda.Context(0)
+
+arr = np.arange(4, dtype=np.int32)
+cuda_buf = ctx.buffer_from_data(arr)
+
+arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
+buf_on_gpu = arr.buffers()[1]
+
+assert buf_on_gpu.size == cuda_buf.size
+assert buf_on_gpu.address == cuda_buf.address
+assert buf_on_gpu.is_cpu == cuda_buf.is_cpu
+
+repr1 = ""
+assert repr1 in repr(buf_on_gpu)
+assert repr2 in repr(buf_on_gpu)
+
+msg = "Implemented only for data on CPU device"
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu.hex()
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf.hex()
+
+assert buf_on_gpu.is_mutable
+
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+pickle_module.dumps(buf_on_gpu, protocol=4)
+
+buf = pa.py_buffer(b'testing')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(7), 1, [None, buf])
+buf_on_gpu = arr.buffers()[1]
+buf_on_gpu_sliced = buf_on_gpu.slice(2)
+
+buf = pa.py_buffer(b'sting')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(5), 1, [None, buf])
+buf_on_gpu_expected = arr.buffers()[1]
+
+assert buf_on_gpu_sliced.equals(buf_on_gpu_expected)
+assert buf.equals(buf_on_gpu_expected)
+assert buf_on_gpu_sliced.to_pybytes() == buf_on_gpu_expected.to_pybytes()

Review Comment:
   Ah yes, it was failing and I thought it was something I did wrong.
   Will look into it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


llama90 commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2153026667

   I think we need to review it more to solve the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] GH-42002: [Java] Update Unit Tests for Vector Module [arrow]

2024-06-06 Thread via GitHub


llama90 opened a new pull request, #42019:
URL: https://github.com/apache/arrow/pull/42019

   
   
   ### Rationale for this change
   
   Update package from JUnit 4(`org.junit`) to JUnit 5(`org.junit.jupiter`).
   
   
   
   ### What changes are included in this PR?
   
   - [x] Replacing `org.junit` with `org.junit.jupiter.api`.
   - [x] Updating `Assertions.assertXXX` to `assertXXX` using static imports.
   - [x] Updating annotations such as `@Before`, `@BeforeClass`, `@After`, 
`@AfterClass`.
 - `@Before` -> `@BeforeEach`
 - `@BeforeClass` -> `@BeforeAll`
 - `@After` -> `@AfterEach`
 - `@AfterClass` -> `@AfterAll`
 - `@Test` -> `@Test` with `org.junit.jupiter`
   - [x] Removing unused `@Rule` Annotation
   - [x] Updating `Parameterized` test
   - [ ] Doing self review
   
   
   
   ### Are these changes tested?
   
   Yes, existing tests have passed.
   
   
   
   ### Are there any user-facing changes?
   
   No.
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42002: [Java] Update Unit Tests for Vector Module [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42019:
URL: https://github.com/apache/arrow/pull/42019#issuecomment-2153024301

   :warning: GitHub issue #42002 **has been automatically assigned in GitHub** 
to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41301: [C++] Extract the kernel loops used for PrimitiveTakeExec and generalize to any fixed-width type [arrow]

2024-06-06 Thread via GitHub


felipecrv commented on code in PR #41373:
URL: https://github.com/apache/arrow/pull/41373#discussion_r1629919692


##
cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc:
##
@@ -324,261 +326,109 @@ namespace {
 using TakeState = OptionsWrapper;
 
 // --
-// Implement optimized take for primitive types from boolean to 
1/2/4/8/16/32-byte
-// C-type based types. Use common implementation for every byte width and only
-// generate code for unsigned integer indices, since after boundschecking to
-// check for negative numbers in the indices we can safely reinterpret_cast
-// signed integers as unsigned.
-
-/// \brief The Take implementation for primitive (fixed-width) types does not
-/// use the logical Arrow type but rather the physical C type. This way we
-/// only generate one take function for each byte width.
+// Implement optimized take for primitive types from boolean to
+// 1/2/4/8/16/32-byte C-type based types and fixed-size binary (0 or more
+// bytes).
+//
+// Use one specialization for each of these primitive byte-widths so the
+// compiler can specialize the memcpy to dedicated CPU instructions and for
+// fixed-width binary use the 1-byte specialization but pass WithFactor=true
+// that makes the kernel consider the factor parameter provided at runtime.
+//
+// Only unsigned index types need to be instantiated since after
+// boundschecking to check for negative numbers in the indices we can safely
+// reinterpret_cast signed integers as unsigned.
+
+/// \brief The Take implementation for primitive types and fixed-width binary.
 ///
 /// Also note that this function can also handle fixed-size-list arrays if
 /// they fit the criteria described in fixed_width_internal.h, so use the
 /// function defined in that file to access values and destination pointers
 /// and DO NOT ASSUME `values.type()` is a primitive type.
 ///
 /// \pre the indices have been boundschecked
-template 
-struct PrimitiveTakeImpl {
-  static constexpr int kValueWidth = ValueWidthConstant::value;
-
-  static void Exec(const ArraySpan& values, const ArraySpan& indices,
-   ArrayData* out_arr) {
-DCHECK_EQ(util::FixedWidthInBytes(*values.type), kValueWidth);
-const auto* values_data = util::OffsetPointerOfFixedWidthValues(values);
-const uint8_t* values_is_valid = values.buffers[0].data;
-auto values_offset = values.offset;
-
-const auto* indices_data = indices.GetValues(1);
-const uint8_t* indices_is_valid = indices.buffers[0].data;
-auto indices_offset = indices.offset;
-
-DCHECK_EQ(out_arr->offset, 0);
-auto* out = util::MutableFixedWidthValuesPointer(out_arr);
-auto out_is_valid = out_arr->buffers[0]->mutable_data();
-
-// If either the values or indices have nulls, we preemptively zero out the
-// out validity bitmap so that we don't have to use ClearBit in each
-// iteration for nulls.
-if (values.null_count != 0 || indices.null_count != 0) {
-  bit_util::SetBitsTo(out_is_valid, 0, indices.length, false);
-}
-
-auto WriteValue = [&](int64_t position) {
-  memcpy(out + position * kValueWidth,
- values_data + indices_data[position] * kValueWidth, kValueWidth);
-};
-
-auto WriteZero = [&](int64_t position) {
-  memset(out + position * kValueWidth, 0, kValueWidth);
-};
-
-auto WriteZeroSegment = [&](int64_t position, int64_t length) {
-  memset(out + position * kValueWidth, 0, kValueWidth * length);
-};
-
-OptionalBitBlockCounter indices_bit_counter(indices_is_valid, 
indices_offset,
-indices.length);
-int64_t position = 0;
-int64_t valid_count = 0;
-while (position < indices.length) {
-  BitBlockCount block = indices_bit_counter.NextBlock();
-  if (values.null_count == 0) {
-// Values are never null, so things are easier
-valid_count += block.popcount;
-if (block.popcount == block.length) {
-  // Fastest path: neither values nor index nulls
-  bit_util::SetBitsTo(out_is_valid, position, block.length, true);
-  for (int64_t i = 0; i < block.length; ++i) {
-WriteValue(position);
-++position;
-  }
-} else if (block.popcount > 0) {
-  // Slow path: some indices but not all are null
-  for (int64_t i = 0; i < block.length; ++i) {
-if (bit_util::GetBit(indices_is_valid, indices_offset + position)) 
{
-  // index is not null
-  bit_util::SetBit(out_is_valid, position);
-  WriteValue(position);
-} else {
-  WriteZero(position);
-}
-++position;
-  }
-} else {
-  WriteZeroSegment(position, block.length);
-  position += block.length;
-}
-  } else {
-// Values have nulls, so we must do random 

Re: [PR] GH-41190: [C++] support for single threaded joins [arrow]

2024-06-06 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #41125:
URL: https://github.com/apache/arrow/pull/41125#issuecomment-2152971907

   After merging your PR, Conbench analyzed the 7 benchmarking runs that have 
been run so far on merge-commit a2453bd50fa7bf90b92850d4fddc779289499e96.
   
   There was 1 benchmark result indicating a performance regression:
   
   - Commit Run on `ursa-i9-9960x` at [2024-06-06 
13:39:17Z](https://conbench.ursa.dev/compare/runs/58fcf2f4dd5b45259eb371db6f7c006e...4c94506d5a7e4d78af9b4189bec0690f/)
 - [`dataset-serialize` (Python) with dataset=nyctaxi_multi_ipc_s3, 
format=csv, 
selectivity=100pc](https://conbench.ursa.dev/compare/benchmarks/06661a3ae8cf7c2e8000a232f47b0a2e...06661d09a2d57f4f800022961f4f3dfe)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/25904303159) 
has more details. It also includes information about 37 possible false 
positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on code in PR #41889:
URL: https://github.com/apache/arrow/pull/41889#discussion_r1629849391


##
python/pyarrow/io.pxi:
##
@@ -1311,7 +1311,10 @@ cdef class Buffer(_Weakrefable):
 ---
 : bytes
 """
-return self.buffer.get().ToHexString()
+if self.is_cpu:
+return self.buffer.get().ToHexString()
+else:
+raise NotImplementedError("Implemented only for data on CPU 
device")

Review Comment:
   ```suggestion
   if not self.is_cpu:
   raise NotImplementedError("Implemented only for data on CPU 
device")
   return self.buffer.get().ToHexString()
   ```
   
   (in this case your code is also perfectly fine, but I think I have a slight 
preference over always using the same pattern of checking if not CPU. Although 
it is only two lines and we only need to do it in three places for now, it 
could also be factored out in a helper method like `self._assert_cpu()`)



##
python/pyarrow/tests/test_io.py:
##
@@ -669,6 +669,57 @@ def test_allocate_buffer_resizable():
 assert buf.size == 200
 
 
+def test_non_cpu_buffer(pickle_module):
+cuda = pytest.importorskip("pyarrow.cuda")
+ctx = cuda.Context(0)
+
+arr = np.arange(4, dtype=np.int32)
+cuda_buf = ctx.buffer_from_data(arr)
+
+arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
+buf_on_gpu = arr.buffers()[1]
+
+assert buf_on_gpu.size == cuda_buf.size
+assert buf_on_gpu.address == cuda_buf.address
+assert buf_on_gpu.is_cpu == cuda_buf.is_cpu
+
+repr1 = ""
+assert repr1 in repr(buf_on_gpu)
+assert repr2 in repr(buf_on_gpu)
+
+msg = "Implemented only for data on CPU device"
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu.hex()
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf.hex()
+
+assert buf_on_gpu.is_mutable
+
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+pickle_module.dumps(buf_on_gpu, protocol=4)
+
+buf = pa.py_buffer(b'testing')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(7), 1, [None, buf])
+buf_on_gpu = arr.buffers()[1]
+buf_on_gpu_sliced = buf_on_gpu.slice(2)
+
+buf = pa.py_buffer(b'sting')
+arr = pa.FixedSizeBinaryArray.from_buffers(pa.binary(5), 1, [None, buf])
+buf_on_gpu_expected = arr.buffers()[1]
+
+assert buf_on_gpu_sliced.equals(buf_on_gpu_expected)
+assert buf.equals(buf_on_gpu_expected)
+assert buf_on_gpu_sliced.to_pybytes() == buf_on_gpu_expected.to_pybytes()

Review Comment:
   Could you also check `buf_on_gpu.to_pybytes()`  with a hardcoded bytes 
result? Because right now if both left and right return the same but wrong 
result, this will pass ..
   
   (looking at `to_pybytes` implementation, I would actually expect this to not 
work properly)



##
python/pyarrow/tests/test_io.py:
##
@@ -669,6 +669,57 @@ def test_allocate_buffer_resizable():
 assert buf.size == 200
 
 
+def test_non_cpu_buffer(pickle_module):
+cuda = pytest.importorskip("pyarrow.cuda")
+ctx = cuda.Context(0)
+
+arr = np.arange(4, dtype=np.int32)
+cuda_buf = ctx.buffer_from_data(arr)
+
+arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
+buf_on_gpu = arr.buffers()[1]
+
+assert buf_on_gpu.size == cuda_buf.size
+assert buf_on_gpu.address == cuda_buf.address
+assert buf_on_gpu.is_cpu == cuda_buf.is_cpu
+
+repr1 = ""
+assert repr1 in repr(buf_on_gpu)
+assert repr2 in repr(buf_on_gpu)
+
+msg = "Implemented only for data on CPU device"
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu.hex()
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf.hex()
+
+assert buf_on_gpu.is_mutable
+
+with pytest.raises(NotImplementedError, match=msg):
+buf_on_gpu[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+cuda_buf[1]
+
+with pytest.raises(NotImplementedError, match=msg):
+pickle_module.dumps(buf_on_gpu, protocol=4)

Review Comment:
   Similarly can you again add a version testing this with `cuda_buf` as well?



##
python/pyarrow/tests/test_io.py:
##
@@ -669,6 +669,57 @@ def test_allocate_buffer_resizable():
 assert buf.size == 200
 
 
+def test_non_cpu_buffer(pickle_module):
+cuda = pytest.importorskip("pyarrow.cuda")
+ctx = cuda.Context(0)
+
+arr = np.arange(4, dtype=np.int32)
+cuda_buf = ctx.buffer_from_data(arr)
+
+arr = pa.Array.from_buffers(pa.int32(), 4, [None, cuda_buf])
+buf_on_gpu = arr.buffers()[1]
+
+assert buf_on_gpu.size == cuda_buf.size
+assert buf_on_gpu.address == cuda_buf.address
+assert buf_on_gpu.is_cpu == 

Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #41889:
URL: https://github.com/apache/arrow/pull/41889#issuecomment-2152964695

   Revision: 3f56e121dc46af4d7a95c7154224d18f1a0a535f
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-b43dfa1819](https://github.com/ursacomputing/crossbow/branches/all?query=actions-b43dfa1819)
   
   |Task|Status|
   ||--|
   |test-cuda-python|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b43dfa1819-github-test-cuda-python)](https://github.com/ursacomputing/crossbow/actions/runs/9404673437/job/25904094964)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on PR #41889:
URL: https://github.com/apache/arrow/pull/41889#issuecomment-2152960331

   @github-actions crossbow submit test-cuda-python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-39759: [Docs] Update pydata-sphinx-theme to 0.15.x [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on PR #39879:
URL: https://github.com/apache/arrow/pull/39879#issuecomment-2152924768

   @Divyansh200102 would you have time to rebase this PR? In the meantime 
pydata-sphinx-theme 0.15.3 was released, so it would be good to test it with 
the latest version


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Python][Doc] Clarify docstring of FixedSizeListArray.values that it ignores the offset [arrow]

2024-06-06 Thread via GitHub


rok commented on issue #41672:
URL: https://github.com/apache/arrow/issues/41672#issuecomment-2152909955

   @PeopleMakeCulture the documentation PR is welcome, feel free to open a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41862: [C++][Dataset] Try out possible fix number 1 [arrow]

2024-06-06 Thread via GitHub


pitrou closed pull request #41875: GH-41862: [C++][Dataset] Try out possible 
fix number 1
URL: https://github.com/apache/arrow/pull/41875


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41862: [C++][Dataset] Try out possible fix number 1 [arrow]

2024-06-06 Thread via GitHub


pitrou commented on PR #41875:
URL: https://github.com/apache/arrow/pull/41875#issuecomment-2152896122

   According to @icexelloss this didn't fix the original issue, so I'm going to 
close this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41862: [C++][S3] Fix potential deadlock when closing output stream [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #41876:
URL: https://github.com/apache/arrow/pull/41876#issuecomment-2152855945

   Revision: 51d073831d1a6c52af1ca778d9ba2ad9df520179
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-b40dadb2f5](https://github.com/ursacomputing/crossbow/branches/all?query=actions-b40dadb2f5)
   
   |Task|Status|
   ||--|
   |test-alpine-linux-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-alpine-linux-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/9403910551/job/25901570430)|
   |test-build-cpp-fuzz|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/9403910712/job/25901570943)|
   |test-conda-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/9403910578/job/25901570744)|
   |test-conda-cpp-valgrind|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/actions/runs/9403910849/job/25901571631)|
   |test-cuda-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-cuda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/9403910758/job/25901571124)|
   |test-debian-12-cpp-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-debian-12-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9403910742/job/25901571033)|
   |test-debian-12-cpp-i386|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-debian-12-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/9403910870/job/25901571718)|
   |test-fedora-39-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-fedora-39-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/9403910472/job/25901569786)|
   |test-ubuntu-20.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-ubuntu-20.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/9403910547/job/25901570425)|
   |test-ubuntu-20.04-cpp-bundled|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-ubuntu-20.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/9403910515/job/25901569799)|
   |test-ubuntu-20.04-cpp-minimal-with-formats|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-ubuntu-20.04-cpp-minimal-with-formats)](https://github.com/ursacomputing/crossbow/actions/runs/9403910590/job/25901570732)|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions/runs/9403910783/job/25901571299)|
   |test-ubuntu-22.04-cpp|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/9403910438/job/25901569783)|
   |test-ubuntu-22.04-cpp-20|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/9403910528/job/25901570032)|
   |test-ubuntu-22.04-cpp-emscripten|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-ubuntu-22.04-cpp-emscripten)](https://github.com/ursacomputing/crossbow/actions/runs/9403910769/job/25901571260)|
   |test-ubuntu-22.04-cpp-no-threading|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-b40dadb2f5-github-test-ubuntu-22.04-cpp-no-threading)](https://github.com/ursacomputing/crossbow/actions/runs/9403910839/job/25901571609)|
   |test-ubuntu-24.04-cpp|[![GitHub 

Re: [PR] GH-41764: [Parquet][C++] Support future logical types in the Parquet reader [arrow]

2024-06-06 Thread via GitHub


paleolimbot commented on code in PR #41765:
URL: https://github.com/apache/arrow/pull/41765#discussion_r1629777943


##
cpp/src/parquet/types.cc:
##
@@ -464,7 +464,10 @@ std::shared_ptr LogicalType::FromThrift(
   } else if (type.__isset.FLOAT16) {
 return Float16LogicalType::Make();
   } else {
-throw ParquetException("Metadata contains Thrift LogicalType that is not 
recognized");
+// Or provide some mechanism to optionally error here?

Review Comment:
   > If we don't throw here at the very least we need to ignore statistics for 
the logical type
   
   I did some investigation and I believe statistics are already ignored for 
`UndefinedLogicalType`:
   
   
https://github.com/apache/arrow/blob/93712bfc71a5013231b950b2b655d77b14f83fa7/cpp/src/parquet/types.cc#L324-L332
   
   
https://github.com/apache/arrow/blob/93712bfc71a5013231b950b2b655d77b14f83fa7/cpp/src/parquet/metadata.cc#L254-L272
   
(i.e., `ColumnChunkMetadata::statistics()` will return `nullptr` if 
`logical_type->is_valid()` returns `false`)
   
   I still have to figure out how to write a full-on file so that this is 
tested properly! (I figured out how to test the schema conversion but not how 
to check a full-on file yet).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41862: [C++][S3] Fix potential deadlock when closing output stream [arrow]

2024-06-06 Thread via GitHub


pitrou commented on PR #41876:
URL: https://github.com/apache/arrow/pull/41876#issuecomment-2152850191

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Docs] List the logical types in Columnar.rst for searchability [arrow]

2024-06-06 Thread via GitHub


pitrou commented on issue #14752:
URL: https://github.com/apache/arrow/issues/14752#issuecomment-2152841961

   This was fixed in #41958, after inspiration from @AlenkaF 's own PR. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Format][Docs] Clarify (remove?) usage of the term "logical types" [arrow]

2024-06-06 Thread via GitHub


pitrou commented on issue #41691:
URL: https://github.com/apache/arrow/issues/41691#issuecomment-2152840556

   Issue resolved by pull request 41958
   https://github.com/apache/arrow/pull/41958


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41691: [Doc] Remove notion of "logical type" [arrow]

2024-06-06 Thread via GitHub


pitrou merged PR #41958:
URL: https://github.com/apache/arrow/pull/41958


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152827445

   I don't think running the other extra CI is required


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: [JS] Bump @typescript-eslint/eslint-plugin from 7.11.0 to 7.12.0 in /js [arrow]

2024-06-06 Thread via GitHub


domoritz merged PR #41950:
URL: https://github.com/apache/arrow/pull/41950


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Python][Doc] Clarify docstring of FixedSizeListArray.values that it ignores the offset [arrow]

2024-06-06 Thread via GitHub


PeopleMakeCulture commented on issue #41672:
URL: https://github.com/apache/arrow/issues/41672#issuecomment-2152774710

   I would like to claim this issue if it's still available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152774403

   > Umm... the error message is changed. It seems that the `js` tests failed.
   
   I've re-run the job in case it was a fluke, otherwise we might need to open 
a different issue for that test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


llama90 commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152761379

   Umm... the error message is changed. It seems that the `js` tests failed. 
   
   
   
   ```
   # FAILURES #
   FAILED TEST: primitive_zerolength Go producing,  JS consuming
   : Command failed: /arrow/js/bin/integration.ts -a 
/tmp/tmp5mhokyv3/dfcb2e35_generated_primitive_zerolength.consumer_stream_as_file
 -j /tmp/arrow-integration-ima0onkg/generated_primitive_zerolength.json --mode 
VALIDATE
   With output:
   --
   
   
   #
   # Fatal error in , line 0
   # unreachable code
   #
   #
   #
   #FailureMessage Object: 0x7fdeb7fb6a10
   - Native stack trace -
   
1: 0xd38851  [node]
2: 0x216e041 V8_Fatal(char const*, ...) [node]
3: 0x1566cec int 
v8::internal::Deserializer::ReadSingleBytecodeData(unsigned
 char, v8::internal::SlotAccessorForHeapObject) [node]
4: 0x1567e91 
v8::internal::Deserializer::ReadData(v8::internal::Handle,
 int, int) [node]
5: 0x1568158 
v8::internal::Deserializer::ReadObject(v8::internal::SnapshotSpace)
 [node]
6: 0x1566d5b int 
v8::internal::Deserializer::ReadSingleBytecodeData(unsigned
 char, v8::internal::SlotAccessorForHeapObject) [node]
7: 0x1567e91 
v8::internal::Deserializer::ReadData(v8::internal::Handle,
 int, int) [node]
8: 0x1568158 
v8::internal::Deserializer::ReadObject(v8::internal::SnapshotSpace)
 [node]
9: 0x1566d5b int 
v8::internal::Deserializer::ReadSingleBytecodeData(unsigned
 char, v8::internal::SlotAccessorForHeapObject) [node]
   10: 0x1567e91 
v8::internal::Deserializer::ReadData(v8::internal::Handle,
 int, int) [node]
   11: 0x1568158 
v8::internal::Deserializer::ReadObject(v8::internal::SnapshotSpace)
 [node]
   12: 0x1566d5b int 
v8::internal::Deserializer::ReadSingleBytecodeData(unsigned
 char, v8::internal::SlotAccessorForHeapObject) [node]
   13: 0x1567e91 
v8::internal::Deserializer::ReadData(v8::internal::Handle,
 int, int) [node]
   14: 0x1568158 
v8::internal::Deserializer::ReadObject(v8::internal::SnapshotSpace)
 [node]
   15: 0x1566d5b int 
v8::internal::Deserializer::ReadSingleBytecodeData(unsigned
 char, v8::internal::SlotAccessorForHeapObject) [node]
   16: 0x1567e91 
v8::internal::Deserializer::ReadData(v8::internal::Handle,
 int, int) [node]
   17: 0x1568158 
v8::internal::Deserializer::ReadObject(v8::internal::SnapshotSpace)
 [node]
   18: 0x1566d5b int 
v8::internal::Deserializer::ReadSingleBytecodeData(unsigned
 char, v8::internal::SlotAccessorForHeapObject) [node]
   19: 0x1567e91 
v8::internal::Deserializer::ReadData(v8::internal::Handle,
 int, int) [node]
   20: 0x1568158 
v8::internal::Deserializer::ReadObject(v8::internal::SnapshotSpace)
 [node]
   21: 0x1566d5b int 
v8::internal::Deserializer::ReadSingleBytecodeData(unsigned
 char, v8::internal::SlotAccessorForHeapObject) [node]
   22: 0x1567e91 
v8::internal::Deserializer::ReadData(v8::internal::Handle,
 int, int) [node]
   23: 0x1568158 
v8::internal::Deserializer::ReadObject(v8::internal::SnapshotSpace)
 [node]
   24: 0x1568439 int 
v8::internal::Deserializer::ReadSingleBytecodeData
 >(unsigned char, v8::internal::SlotAccessorForHandle) 
[node]
   25: 0x1568d3f 
v8::internal::Deserializer::ReadObject() [node]
   26: 0x156d8d2 v8::internal::ObjectDeserializer::Deserialize() [node]
   27: 0x156dae4 
v8::internal::ObjectDeserializer::DeserializeSharedFunctionInfo(v8::internal::Isolate*,
 v8::internal::SerializedCodeData const*, 
v8::internal::Handle) [node]
   28: 0x1560822 
v8::internal::CodeSerializer::Deserialize(v8::internal::Isolate*, 
v8::internal::AlignedCachedData*, v8::internal::Handle, 
v8::ScriptOriginOptions, v8::internal::MaybeHandle) [node]
   29: 0xfb9ccb  [node]
   30: 0xfb9fad 
v8::internal::Compiler::GetSharedFunctionInfoForScriptWithCachedData(v8::internal::Isolate*,
 v8::internal::Handle, v8::internal::ScriptDetails 
const&, v8::internal::AlignedCachedData*, v8::ScriptCompiler::CompileOptions, 
v8::ScriptCompiler::NoCacheReason, v8::internal::NativesFlag) [node]
   31: 0xf19dbc  [node]
   32: 0xf19ec8 v8::ScriptCompiler::CompileUnboundScript(v8::Isolate*, 
v8::ScriptCompiler::Source*, v8::ScriptCompiler::CompileOptions, 
v8::ScriptCompiler::NoCacheReason) [node]
   33: 0xca8421 
node::contextify::ContextifyScript::New(v8::FunctionCallbackInfo 
const&) [node]
   34: 0xf57eaf 
v8::internal::FunctionCallbackArguments::Call(v8::internal::CallHandlerInfo) 
[node]
   35: 0xf58465  [node]
   36: 0xf58b83 v8::internal::Builtin_HandleApiCall(int, unsigned long*, 
v8::internal::Isolate*) [node]
   37: 0x1963df6  [node]
   
   --
   
   
   1 failures, 200 skips
   Failed to verify release candidate. See /tmp/arrow-HEAD.cpSRw for details.
   Error: docker run --rm -e VERIFY_VERSION= -e VERIFY_RC= -e TEST_DEFAULT=0 -e 
TEST_INTEGRATION=1 --shm-size 1073741824 -e CCACHE_COMPILERCHECK=content -e 
CCACHE_COMPRESS=1 -e CCACHE_COMPRESSLEVEL=6 -e CCACHE_DIR=/ccache -e 

Re: [PR] GH-25118: [Python] Make NumPy an optional runtime dependency [arrow]

2024-06-06 Thread via GitHub


pitrou commented on code in PR #41904:
URL: https://github.com/apache/arrow/pull/41904#discussion_r1629699221


##
python/pyarrow/src/arrow/python/init.cc:
##
@@ -21,4 +21,15 @@
 #include "arrow/python/init.h"
 #include "arrow/python/numpy_interop.h"
 
-int arrow_init_numpy() { return arrow::py::import_numpy(); }
+bool numpy_imported = false;
+
+int arrow_init_numpy(bool import_numpy) {

Review Comment:
   Does it make sense to add this argument? The function does nothing if it is 
false, so it could just be skipped from the Python side.
   
   Or perhaps we want to make sure that this function is always called before 
PyArrow is used?



##
.github/workflows/python.yml:
##
@@ -84,6 +85,11 @@ jobs:
 title: AMD64 Conda Python 3.10 Pandas latest
 python: "3.10"
 pandas: latest
+  - name: conda-python-3.10-no-numpy
+cache: conda-python-3.10
+image: conda-python-no-numpy
+title: AMD64 Conda Python 3.10 Testing without numpy

Review Comment:
   ```suggestion
   title: AMD64 Conda Python 3.10 without NumPy
   ```



##
python/pyarrow/tests/conftest.py:
##
@@ -25,8 +25,14 @@
 
 import pytest
 import hypothesis as h
+try:
+import numpy as np

Review Comment:
   Why is it needed to import NumPy here?



##
python/pyarrow/tests/conftest.py:
##
@@ -116,6 +122,30 @@ def pytest_runtest_setup(item):
 item.config.pyarrow.apply_mark(mark)
 
 
+@pytest.fixture
+def all_array_types():

Review Comment:
   Is this needed here?



##
python/pyarrow/types.pxi:
##
@@ -149,14 +154,14 @@ def _is_primitive(Type type):
 
 def _get_pandas_type(arrow_type, coerce_to_ns=False):
 cdef Type type_id = arrow_type.id
-if type_id not in _pandas_type_map:
+if type_id not in _get_pandas_type_map():

Review Comment:
   Will this recreate the dict everytime this function is called? We don't want 
this.



##
ci/scripts/python_test.sh:
##
@@ -69,4 +69,8 @@ export PYARROW_TEST_PARQUET_ENCRYPTION
 export PYARROW_TEST_S3
 
 # Testing PyArrow
-pytest -r s ${PYTEST_ARGS} --pyargs pyarrow
+if [ -z "${TEST_COMMAND}" ]; then

Review Comment:
   Is this needed? If so, perhaps use a more specific variable name, such as 
`PYARROW_TEST_COMMAND`.



##
python/pyarrow/src/arrow/python/init.h:
##
@@ -22,5 +22,6 @@
 
 extern "C" {
 ARROW_PYTHON_EXPORT
-int arrow_init_numpy();
+int arrow_init_numpy(bool import_numpy);
+bool get_numpy_imported();
 }

Review Comment:
   I think the `extern "C"` is unwarranted, we can just use regular C++ 
namespacing:
   ```c++
   namespace arrow::py {
   
   ARROW_PYTHON_EXPORT
   int init_numpy(bool has_numpy);
   bool has_numpy();
   
   }  // namespace arrow::py
   ```
   (you'll need to update the corresponding Cython declarations)



##
python/pyarrow/tests/strategies.py:
##
@@ -35,7 +38,10 @@
 import tzdata  # noqa:F401
 except ImportError:
 zoneinfo = None
-import numpy as np
+try:
+import numpy as np
+except ImportError:
+pass

Review Comment:
   Is the import actually used below?



##
python/pyarrow/lib.pyx:
##
@@ -32,8 +35,17 @@ from pyarrow.includes.common cimport PyObject_to_object
 cimport pyarrow.includes.libarrow_python as libarrow_python
 cimport cpython as cp
 
-# Initialize NumPy C API
-arrow_init_numpy()
+# Initialize NumPy C API only if numpy was able to be imported
+
+
+def initialize_numpy():
+if "numpy" in sys.modules:
+arrow_init_numpy(True)
+else:
+arrow_init_numpy(False)
+
+
+initialize_numpy()

Review Comment:
   Could be as simple as:
   ```python
   if np is not None:
   arrow_init_numpy()
   ```



##
python/pyarrow/builder.pxi:
##
@@ -42,7 +42,7 @@ cdef class StringBuilder(_Weakrefable):
 value : string/bytes or np.nan/None
 The value to append to the string array builder.
 """
-if value is None or value is np.nan:
+if value is None or ('numpy' in sys.modules and value is np.nan):

Review Comment:
   Perhaps this
   ```suggestion
   if value is None or math.isnan(value):
   ```



##
python/pyarrow/lib.pyx:
##
@@ -21,7 +21,10 @@
 
 import datetime
 import decimal as _pydecimal
-import numpy as np
+try:
+import numpy as np
+except ImportError:
+pass

Review Comment:
   If you change this thusly, 
   ```suggestion
   np = None
   ```
   
   Then you should later be able to write `np is not None` to check that NumPy 
is present.



##
python/pyarrow/tests/test_compute.py:
##
@@ -,30 +1093,31 @@ def test_binary_join_element_wise():
 'a', 'b', null, options=replace).as_py() is None
 
 
-@pytest.mark.parametrize(('ty', 'values'), all_array_types)

Review Comment:
   I've no strong opinion on whether we want to keep these tests parameterized 
or not. @jorisvandenbossche 

Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152755738

   Revision: f39efac916631e99a485952c5ee531866ab10f86
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-ac4eb35fba](https://github.com/ursacomputing/crossbow/branches/all?query=actions-ac4eb35fba)
   
   |Task|Status|
   ||--|
   |verify-rc-source-python-linux-ubuntu-20.04-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-ac4eb35fba-github-verify-rc-source-python-linux-ubuntu-20.04-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9403182844/job/25899106756)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152750379

   @github-actions crossbow submit 
verify-rc-source-python-linux-ubuntu-20.04-amd64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java] ListViewVector Implementation for `UnionListViewReader` [arrow]

2024-06-06 Thread via GitHub


vibhatha commented on issue #41569:
URL: https://github.com/apache/arrow/issues/41569#issuecomment-2152716369

   Started working on this issue...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-23576: [JS] Support DictionaryMessage replacement [arrow]

2024-06-06 Thread via GitHub


domoritz commented on code in PR #41965:
URL: https://github.com/apache/arrow/pull/41965#discussion_r1629685248


##
js/src/ipc/writer.ts:
##
@@ -294,9 +294,9 @@ export class RecordBatchWriter 
extends ReadableInterop<
 if (!prevDictionary || prevDictionary.data[0] !== chunks[0]) {
 // * If `index > 0`, then `isDelta` is true.
 // * If `index = 0`, then `isDelta` is false, because this is 
either the initial or a replacement DictionaryMessage.
-chunks.forEach((chunk, index) => 
this._writeDictionaryBatch(chunk, id, index > 0));
+for (const [index, chunk] of chunks.entries()) 
this._writeDictionaryBatch(chunk, id, index > 0);

Review Comment:
   just confirming that this is what you want



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: [Java] Bump org.jacoco:jacoco-maven-plugin from 0.8.11 to 0.8.12 in /java [arrow]

2024-06-06 Thread via GitHub


vibhatha commented on PR #41516:
URL: https://github.com/apache/arrow/pull/41516#issuecomment-2152710186

   The minor default CI failures(2) are due to an already reported issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] MINOR: [Java] Bump org.jacoco:jacoco-maven-plugin from 0.8.11 to 0.8.12 in /java [arrow]

2024-06-06 Thread via GitHub


vibhatha commented on PR #41516:
URL: https://github.com/apache/arrow/pull/41516#issuecomment-2152709066

   @lidavidm LGTM! All the CIs are passing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41102: [Packaging][Release] Create unique git tags for release candidates (e.g. apache-arrow-{MAJOR}.{MINOR}.{PATCH}-rc{RC_NUM}) [arrow]

2024-06-06 Thread via GitHub


sgilmore10 commented on PR #41131:
URL: https://github.com/apache/arrow/pull/41131#issuecomment-2152696891

   > The MATLAB check crashed during the `Run Tests` step on Windows after I 
rebased on main to deal with merge conflicts. I'm fairly certain that this 
crash is unrelated to the changes in this PR because the same crash occurred in 
the mathworks/arrow fork after syncing the branch to be up to date with 
apache/arrow: 
https://github.com/mathworks/arrow/actions/runs/9401909013/job/25894734169
   
   I've created issue for this crash: #42015


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on code in PR #42007:
URL: https://github.com/apache/arrow/pull/42007#discussion_r1629672108


##
dev/release/verify-release-candidate.sh:
##
@@ -788,7 +788,7 @@ test_python() {
   pushd python
 
   # Build pyarrow
-  python setup.py build_ext --inplace
+  python -m pip install .

Review Comment:
   ```suggestion
 python -m pip install -e .
   ```
   
   (if we want to keep the rest of the code below as is, it needs to be an 
inplace install)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Make RowSelection's from_consecutive_ranges public [arrow-rs]

2024-06-06 Thread via GitHub


advancedxy commented on code in PR #5848:
URL: https://github.com/apache/arrow-rs/pull/5848#discussion_r1629661641


##
parquet/src/arrow/arrow_reader/selection.rs:
##
@@ -111,13 +118,13 @@ impl RowSelection {
 SlicesIterator::new(filter).map(move |(start, end)| start + 
offset..end + offset)
 });
 
-Self::from_consecutive_ranges(iter, total_rows)
+Self::from_consecutive_ranges(iter, Some(total_rows))
 }
 
 /// Creates a [`RowSelection`] from an iterator of consecutive ranges to 
keep
-pub(crate) fn from_consecutive_ranges>>(
+pub fn from_consecutive_ranges>>(
 ranges: I,
-total_rows: usize,
+total_rows_opt: Option,

Review Comment:
   I see, I thought the final skip is optional based on the trim function. I 
will revert the API change later.  I am traveling now, so it might take some 
days for me to get back with access to a computer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41102: [Packaging][Release] Create unique git tags for release candidates (e.g. apache-arrow-{MAJOR}.{MINOR}.{PATCH}-rc{RC_NUM}) [arrow]

2024-06-06 Thread via GitHub


sgilmore10 commented on PR #41131:
URL: https://github.com/apache/arrow/pull/41131#issuecomment-2152661335

   The MATLAB check crashed during the `Run Tests` step on Windows after I 
rebased on main to deal with merge conflicts. I'm fairly certain that this 
crash is unrelated to the changes in this PR because the same crash occurred in 
the mathworks/arrow fork after syncing the branch to be up to date with 
apache/arrow: 
https://github.com/mathworks/arrow/actions/runs/9401909013/job/25894734169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Format][Docs] Move data type (parameter) descriptions from Schema.fbs to Columnar.rst format docs [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on issue #42011:
URL: https://github.com/apache/arrow/issues/42011#issuecomment-2152643853

   To avoid that the `Columnar.rst` pages becomes to unwieldy long, we should 
maybe at the same time separate the IPC specification into its own file: 
https://github.com/apache/arrow/issues/41671


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41691: [Doc] Remove notion of "logical type" [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on PR #41958:
URL: https://github.com/apache/arrow/pull/41958#issuecomment-2152640047

   > I agree that it makes things a bit intimidating; also, reading a fbs file 
just to find prose documentation is not really a pleasant experience. Perhaps 
open a new issue for this?
   
   Opened https://github.com/apache/arrow/issues/42011 about moving a lot of 
the prose content from Schema.fbs to Columnar.rst


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152628508

   Revision: f098baf9b403b14895e8973ffdd013efa7ffea6e
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-75636dfc2c](https://github.com/ursacomputing/crossbow/branches/all?query=actions-75636dfc2c)
   
   |Task|Status|
   ||--|
   |verify-rc-source-python-linux-ubuntu-20.04-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-75636dfc2c-github-verify-rc-source-python-linux-ubuntu-20.04-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9402306063/job/25896126778)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152623132

   @github-actions crossbow submit 
verify-rc-source-python-linux-ubuntu-20.04-amd64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152616441

   The `verify-rc-source-python-linux-almalinux-8-amd64` are unrelated and are 
failing on some python gandiva tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152610813

   I haven't tested locally but the verification script just uses
   ```
 LD_LIBRARY_PATH=$ARROW_CPP_EXE_PATH:$LD_LIBRARY_PATH archery integration \
   --run-ipc --run-flight --run-c-data \
   --with-cpp=${TEST_INTEGRATION_CPP} \
   --with-java=${TEST_INTEGRATION_JAVA} \
   --with-js=${TEST_INTEGRATION_JS} \
   --with-go=${TEST_INTEGRATION_GO} \
   $INTEGRATION_TEST_ARGS
   ```
   This is the relevant piece: 
https://github.com/apache/arrow/blob/main/dev/release/verify-release-candidate.sh#L986-L1013


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41691: [Doc] Remove notion of "logical type" [arrow]

2024-06-06 Thread via GitHub


pitrou commented on code in PR #41958:
URL: https://github.com/apache/arrow/pull/41958#discussion_r1629607803


##
docs/source/format/Columnar.rst:
##
@@ -70,21 +70,131 @@ concepts, here is a small glossary to help disambiguate.
   without taking into account any value semantics. For example, a
   32-bit signed integer array and 32-bit floating point array have the
   same layout.
-* **Parent** and **child arrays**: names to express relationships
-  between physical value arrays in a nested type structure. For
-  example, a ``List``-type parent array has a T-type array as its
-  child (see more on lists below).
+* **Data type**: An application-facing semantic value type that is
+  implemented using some physical layout. For example, Decimal128
+  values are stored as 16 bytes in a fixed-size binary
+  layout. A timestamp may be stored as 64-bit fixed-size layout.
 * **Primitive type**: a data type having no child types. This includes
   such types as fixed bit-width, variable-size binary, and null types.
 * **Nested type**: a data type whose full structure depends on one or
   more other child types. Two fully-specified nested types are equal
   if and only if their child types are equal. For example, ``List``
   is distinct from ``List`` iff U and V are different types.
-* **Logical type**: An application-facing semantic value type that is
-  implemented using some physical layout. For example, Decimal
-  values are stored as 16 bytes in a fixed-size binary
-  layout. Similarly, strings can be stored as ``List<1-byte>``. A
-  timestamp may be stored as 64-bit fixed-size layout.
+* **Parent** and **child arrays**: names to express relationships
+  between physical value arrays in a nested type structure. For
+  example, a ``List``-type parent array has a T-type array as its
+  child (see more on lists below).
+* **Parametric type**: a type which requires additional parameters
+  for full determination of its semantics. For example, all nested types
+  are parametric by construction. A timestamp is also parametric as it needs
+  a unit (such as microseconds) and a timezone.
+
+Data Types
+==
+
+The `Schema.fbs`_ defines built-in data types supported by the

Review Comment:
   Good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41691: [Doc] Remove notion of "logical type" [arrow]

2024-06-06 Thread via GitHub


pitrou commented on code in PR #41958:
URL: https://github.com/apache/arrow/pull/41958#discussion_r1629607303


##
docs/source/format/Columnar.rst:
##
@@ -70,21 +70,131 @@ concepts, here is a small glossary to help disambiguate.
   without taking into account any value semantics. For example, a
   32-bit signed integer array and 32-bit floating point array have the
   same layout.
-* **Parent** and **child arrays**: names to express relationships
-  between physical value arrays in a nested type structure. For
-  example, a ``List``-type parent array has a T-type array as its
-  child (see more on lists below).
+* **Data type**: An application-facing semantic value type that is
+  implemented using some physical layout. For example, Decimal128
+  values are stored as 16 bytes in a fixed-size binary
+  layout. A timestamp may be stored as 64-bit fixed-size layout.
 * **Primitive type**: a data type having no child types. This includes
   such types as fixed bit-width, variable-size binary, and null types.
 * **Nested type**: a data type whose full structure depends on one or
   more other child types. Two fully-specified nested types are equal
   if and only if their child types are equal. For example, ``List``
   is distinct from ``List`` iff U and V are different types.
-* **Logical type**: An application-facing semantic value type that is
-  implemented using some physical layout. For example, Decimal
-  values are stored as 16 bytes in a fixed-size binary
-  layout. Similarly, strings can be stored as ``List<1-byte>``. A
-  timestamp may be stored as 64-bit fixed-size layout.
+* **Parent** and **child arrays**: names to express relationships
+  between physical value arrays in a nested type structure. For
+  example, a ``List``-type parent array has a T-type array as its
+  child (see more on lists below).
+* **Parametric type**: a type which requires additional parameters
+  for full determination of its semantics. For example, all nested types
+  are parametric by construction. A timestamp is also parametric as it needs
+  a unit (such as microseconds) and a timezone.
+
+Data Types
+==
+
+The `Schema.fbs`_ defines built-in data types supported by the
+Arrow columnar format. Each data type uses a well-defined physical layout.
+
+`Schema.fbs`_ is the authoritative source for the description of the
+standard Arrow data types. However, we also provide the below table for
+convenience:
+
+++--++
+| Type   | Type Parameters *(1)*| Physical Memory Layout   
  |
+++==++
+| Null   |  | Null 
  |
+++--++
+| Boolean|  | Fixed-size Primitive 
  |
+++--++
+| Int| * bit width  | *" (same as above)*  
  |
+|| * signedness |  
  |
+++--++
+| Floating Point | * precision  | *"*  
  |
+++--++
+| Decimal| * bit width  | *"*  
  |
+|| * scale  |  
  |
+|| * precision  |  
  |
+++--++
+| Date   | * unit   | *"*  
  |
+++--++
+| Time   | * bit width *(2)*| *"*  
  |
+|| * unit   |  
  |
+++--++
+| Timestamp  | * unit   | *"*  
 

Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152609899

   Revision: e5a8205abcf6501d9a1da01d6b61b3ea0820b5c2
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-758971](https://github.com/ursacomputing/crossbow/branches/all?query=actions-758971)
   
   |Task|Status|
   ||--|
   |verify-rc-source-integration-linux-ubuntu-20.04-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-758971-github-verify-rc-source-integration-linux-ubuntu-20.04-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9402185872/job/25895690026)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152601333

   @github-actions crossbow submit 
verify-rc-source-integration-linux-ubuntu-20.04-amd64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41664: [C++][Python] PrettyPrint non-cpu data by copying slice to default CPU device [arrow]

2024-06-06 Thread via GitHub


pitrou commented on PR #42010:
URL: https://github.com/apache/arrow/pull/42010#issuecomment-2152600218

   Yes, it's certainly useful on the CPU side as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


llama90 commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152594859

   @raulcd Could you please command one more?
   
   Also, I understand that [PR](https://github.com/apache/arrow/pull/41455) was 
aimed at improving documentation. However, I'm not sure why it included updates 
to the integration paths for Java and JS.
   
   If these integration paths are not related to improving documentation, I 
think the code needs to be rolled back.
   
   + Is there any method to execute commands with archery as integration tests 
on macOS?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Thread deadlock in ObjectOutputStream [arrow]

2024-06-06 Thread via GitHub


pitrou commented on issue #41862:
URL: https://github.com/apache/arrow/issues/41862#issuecomment-2152596970

   Cool, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41664: [C++][Python] PrettyPrint non-cpu data by copying slice to default CPU device [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on PR #42010:
URL: https://github.com/apache/arrow/pull/42010#issuecomment-2152565513

   @pitrou I started looking into this on the C++ side, but with the current 
approach of slice+copy+concat, I could also easily do this just on the Python 
side before passing the data to `PrettyPrint`. But do you think this is useful 
to keep on the C++ side?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41664: [C++][Python] PrettyPrint non-cpu data by copying slice to default CPU device [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42010:
URL: https://github.com/apache/arrow/pull/42010#issuecomment-2152563172

   :warning: GitHub issue #41664 **has been automatically assigned in GitHub** 
to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] GH-41664: [C++][Python] PrettyPrint non-cpu data by copying slice to default CPU device [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche opened a new pull request, #42010:
URL: https://github.com/apache/arrow/pull/42010

   ### Rationale for this change
   
   The various Python reprs or the C++ `PrettyPrint` functions will currently 
just segfault when passing an object that has its data on a non-CPU device. In 
python, getting a segfault while displaying the object is very annoying, and so 
we should make this at least not crash.
   
   ### What changes are included in this PR?
   
   When we detect data on a non-CPU device passed to `PrettyPrint`, we copy the 
necessary part (window size of start and end of array) to the default CPU 
device, and then use the existing print utilities as is on this copied subset.
   
   Note that with the current version for nested data we might still be copying 
too much data (e.g. for a list array, there is also a `container_window` to 
limit the number of elements to show in a single element).
   
   For now just testing with Array, need to generalize to the other objects 
(ChunkedArray, RecordBatch, Table).
   
   ### Are these changes tested?
   
   TODO
   
   ### Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


llama90 commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152536998

   It seems that the `java` tests have passed, but as expected, there is an 
error with the `js` tests. I will fix it too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on PR #41889:
URL: https://github.com/apache/arrow/pull/41889#issuecomment-2152528984

   Anyway, the reason this fails is because I think pickling is not expected to 
work with non-CPU data. Even if it does not segfault, we should not allow 
calling it if it simply gives garbage results (it seems all null in the second 
case). I would add a similar error to `__reduce_ex__` to avoid getting those 
confusing errors or wrong results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41834: [R] Better error handling in dplyr code [arrow]

2024-06-06 Thread via GitHub


conbench-apache-arrow[bot] commented on PR #41576:
URL: https://github.com/apache/arrow/pull/41576#issuecomment-2152523051

   After merging your PR, Conbench analyzed the 7 benchmarking runs that have 
been run so far on merge-commit 774ee0f2fe38e7f6023f60a686b38ec7dc31722f.
   
   There were 8 benchmark results indicating a performance regression:
   
   - Commit Run on `test-mac-arm` at [2024-05-29 
18:14:19Z](https://conbench.ursa.dev/compare/runs/88a72cad81414a5eb7e851b58a373dd4...bc06ecd10d1d4d158db3931b30f43fec/)
 - [`tpch` (R) with engine=arrow, format=parquet, language=R, 
memory_map=False, query_id=TPCH-15, 
scale_factor=1](https://conbench.ursa.dev/compare/benchmarks/06657567cd7a75e58000bf8739c169ab...0665771b6ca97ba38000c122601f7c6f)
 - [`tpch` (R) with engine=arrow, format=parquet, language=R, 
memory_map=False, query_id=TPCH-18, 
scale_factor=1](https://conbench.ursa.dev/compare/benchmarks/0665756bad4b716d800042de2faf5ccf...0665771f65df7f2f8000bb5fa0559966)
   - and 6 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/25894658283) 
has more details. It also includes information about 9 possible false positives 
for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [C++] Thread deadlock in ObjectOutputStream [arrow]

2024-06-06 Thread via GitHub


icexelloss commented on issue #41862:
URL: https://github.com/apache/arrow/issues/41862#issuecomment-2152512598

   @pitrou I ran tests with #41876 for two days and didn't see a deadlock. So I 
think #41876 fixes the issue that I was seeing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on PR #41889:
URL: https://github.com/apache/arrow/pull/41889#issuecomment-2152495640

   > Can you try with specifying `protocol=4` passed to `dumps()` ?
   
   Passes! I get assertion error for:
   
   ```python
   >   assert memoryview(result) == memoryview(buf_on_gpu)
   
   pyarrow/tests/test_io.py:693: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _
   
   >   assert buffer.len == 0
   E   AssertionError
   
   pyarrow/io.pxi:1444: AssertionError
   ```
   
   and
   
   ```python
   result = pickle_module.loads(pickle_module.dumps(buf_on_gpu, 
protocol=4))
   assert len(result) == len(buf_on_gpu)
   # assert memoryview(result) == memoryview(buf_on_gpu)
   >   assert result.to_pybytes() == buf_on_gpu.to_pybytes()
   E   AssertionError: assert b'\x10R\x9c\x...\xc5U\x00\x00' == 
b'\x00\x00\x0...0\x00\x00\x00'
   E 
   E At index 0 diff: b'\x10' != b'\x00'
   E Use -v to get more diff
   
   pyarrow/tests/test_io.py:694: AssertionError
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on PR #41889:
URL: https://github.com/apache/arrow/pull/41889#issuecomment-2152475109

   > It doesn't look it segfaults.
   
   Can you try with specifying `protocol=4` passed to `dumps()` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41662: [Python] Ensure Buffer methods don't crash with non-CPU data [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on PR #41889:
URL: https://github.com/apache/arrow/pull/41889#issuecomment-215243

   > Some more suggestions for the tests:
   > 
   > * test the repr
   > * does the `equals()` method work? (both for cuda == cuda or cuda == cpu)
   
   I have added all together with the other suggestions.
   
   > * pickling might also segfault?
   
   It doesn't look it segfaults. I get this error though:
   
   ```python
   >   result = pickle_module.loads(pickle_module.dumps(buf_on_gpu))
   
   pyarrow/tests/test_io.py:692: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
../../envs/arrow-dev/lib/python3.11/site-packages/cloudpickle/cloudpickle.py:1479:
 in dumps
   cp.dump(obj)
   
../../envs/arrow-dev/lib/python3.11/site-packages/cloudpickle/cloudpickle.py:1245:
 in dump
   return super().dump(obj)
   pyarrow/io.pxi:1413: in pyarrow.lib.Buffer.__reduce_ex__
   bufobj = pickle.PickleBuffer(self)
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   assert buffer.len == 0
   E   AssertionError
   
   pyarrow/io.pxi:1444: AssertionError
   ```
   
   Haven't looked into it yet. Will do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41481: [CI] Update how extra environment variables are specified for the integration test docker job [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42009:
URL: https://github.com/apache/arrow/pull/42009#issuecomment-2152400499

   :warning: GitHub issue #41481 **has been automatically assigned in GitHub** 
to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] GH-41481: [CI] Update how extra environment variables are specified for the integration test docker job [arrow]

2024-06-06 Thread via GitHub


paleolimbot opened a new pull request, #42009:
URL: https://github.com/apache/arrow/pull/42009

   ### Rationale for this change
   
   Currently, nanoarrow and Rust are not being included in the integration 
tests. The command issued by archery includes multiple environment variable 
definitions and the rightmost ones disable the extra environment variables.
   
   
https://github.com/apache/arrow/actions/runs/9397807525/job/25881776553#step:9:353
   
   ```
   DEBUG:archery:Executing `['docker', 'run', '--rm', '-e', 
'ARCHERY_DEFAULT_BRANCH=main', '-e', 'ARCHERY_INTEGRATION_WITH_NANOARROW=1', 
'-e', 'ARCHERY_INTEGRATION_WITH_RUST=1', '-e', 
'ARCHERY_INTEGRATION_WITH_NANOARROW=0', '-e', 
'ARCHERY_INTEGRATION_WITH_RUST=0', '-e', 'ARROW_CPP_EXE_PATH=/build/cpp/debug', 
'-e', 'ARROW_NANOARROW_PATH=/build/nanoarrow', '-e', 
'ARROW_RUST_EXE_PATH=/build/rust/debug', '-e', 'CCACHE_COMPILERCHECK=content', 
'-e', 'CCACHE_COMPRESS=1', '-e', 'CCACHE_COMPRESSLEVEL=6', '-e', 
'CCACHE_DIR=/ccache', '-e', 'CCACHE_MAXSIZE=1G', '-e', 'GITHUB_ACTIONS=true', 
'-v', '/home/runner/work/arrow/arrow:/arrow', '-v', 
'/home/runner/work/arrow/arrow/.docker/conda-ccache:/ccache', 
'apache/arrow-dev:amd64-conda-integration', 
'/arrow/ci/scripts/integration_arrow_build.sh /arrow /build && 
/arrow/ci/scripts/integration_arrow.sh /arrow /build']`
   # ...
   + /arrow/ci/scripts/rust_build.sh /arrow /build
   =
   Not building the Rust implementation.
   =
   + /arrow/ci/scripts/nanoarrow_build.sh /arrow /build
   =
   Not building nanoarrow
   =
   ```
   
   ### What changes are included in this PR?
   
   This PR updates how environment variables are specified such that the 
intended value is passed to the docker build.
   
   ### Are these changes tested?
   
   Yes
   
   ### Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Write parquet statistics for `IntervalDayTimeArray` , `IntervalMonthDayNanoArray` and `IntervalYearMonthArray` [arrow-rs]

2024-06-06 Thread via GitHub


alamb commented on issue #5847:
URL: https://github.com/apache/arrow-rs/issues/5847#issuecomment-2152334490

   @marvinlanhenke  also points out that to completely support this ticket 
probably requires completing https://github.com/apache/arrow-rs/issues/5849 
first 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


llama90 commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152311167

   @kou Could you please let me know if there might have been another intention 
in the PR you wrote?
   
   From my review of the code, it seems that in this file, the path needs to go 
up 4 levels to correctly refer `java` directory.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152306975

   Revision: 849707f93144bb4a6f8cd8358e58bb4a9ac2a4de
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-0e1aa06ee4](https://github.com/ursacomputing/crossbow/branches/all?query=actions-0e1aa06ee4)
   
   |Task|Status|
   ||--|
   |verify-rc-source-integration-linux-ubuntu-20.04-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-0e1aa06ee4-github-verify-rc-source-integration-linux-ubuntu-20.04-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9401009463/job/25891842942)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] Support writing `IntervalMonthDayNanoArray` to parquet via Arrow Writer [arrow-rs]

2024-06-06 Thread via GitHub


alamb commented on issue #5849:
URL: https://github.com/apache/arrow-rs/issues/5849#issuecomment-2152306145

   i updated the title of this ticket to reflect the end behavior I think it is 
addressing
   
   Specifically, I think the gap identified by @marvinlanhenke  above is that 
trying to write an `IntervalMonthDayNano` array to parquet via 
https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html
 will not work
   
   TO proceed with this ticket the first thing would probably be to make a 
small test case to verify that `IntervalMonthDayNanoArray` can not be written 
to parquet
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152302021

   @github-actions crossbow submit 
verify-rc-source-integration-linux-ubuntu-20.04-amd64


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


llama90 commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152289327

   @raulcd Could you please execute the command as you mentioned?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


llama90 commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152280187

   I verified that the path to find the `java` directory needs to be set to 4 
levels up. If this resolves the issue, we may also need to update the paths for 
`js` to ensure consistency.
   
   ```bash
   # from /Users/lama/workspace/arrow-new/dev/archery/archery/utils
   ARROW_ROOT: /Users/lama/workspace/arrow-new
   # from /Users/lama/workspace/arrow-new/dev/archery/archery/integration
   ARROW_BUILD_ROOT: /Users/lama/workspace
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to find pom.xml [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42008:
URL: https://github.com/apache/arrow/pull/42008#issuecomment-2152275730

   :warning: GitHub issue #42005 **has been automatically assigned in GitHub** 
to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] GH-42005: [Java][Integration][CI] Fix ARROW_BUILD_ROOT Path to Find pom.xml [arrow]

2024-06-06 Thread via GitHub


llama90 opened a new pull request, #42008:
URL: https://github.com/apache/arrow/pull/42008

   
   
   ### Rationale for this change
   
   This PR aims to fix the issue where the integration tests are failing due to 
the missing `/java/pom.xml` file. It appears that the current code incorrectly 
determines the path to `ARROW_BUILD_ROOT`, leading to the failure in locating 
the `pom.xml` file.
   
   
   
   
   ### What changes are included in this PR?
   
   - Updating the `ARROW_BUILD_ROOT` path determination logic in 
`tester_java.py` to correctly reference the project root.
   
   
   
   ### Are these changes tested?
   
   Maybe, Yes.
   
   
   
   ### Are there any user-facing changes?
   
   No.
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Python][Dev] Source install yields unexpected keyword argument: 'version_file' [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on issue #41992:
URL: https://github.com/apache/arrow/issues/41992#issuecomment-2152270006

   It's also quite likely that you see that error just because of having a too 
old setuptools_scm in your environment (that's the error I got in another env 
where I tried to build inplace with `--no-build-isolation`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41691: [Doc] Remove notion of "logical type" [arrow]

2024-06-06 Thread via GitHub


alamb commented on code in PR #41958:
URL: https://github.com/apache/arrow/pull/41958#discussion_r1629412427


##
docs/source/format/Columnar.rst:
##
@@ -70,21 +70,131 @@ concepts, here is a small glossary to help disambiguate.
   without taking into account any value semantics. For example, a
   32-bit signed integer array and 32-bit floating point array have the
   same layout.
-* **Parent** and **child arrays**: names to express relationships
-  between physical value arrays in a nested type structure. For
-  example, a ``List``-type parent array has a T-type array as its
-  child (see more on lists below).
+* **Data type**: An application-facing semantic value type that is
+  implemented using some physical layout. For example, Decimal128
+  values are stored as 16 bytes in a fixed-size binary
+  layout. A timestamp may be stored as 64-bit fixed-size layout.
 * **Primitive type**: a data type having no child types. This includes
   such types as fixed bit-width, variable-size binary, and null types.
 * **Nested type**: a data type whose full structure depends on one or
   more other child types. Two fully-specified nested types are equal
   if and only if their child types are equal. For example, ``List``
   is distinct from ``List`` iff U and V are different types.
-* **Logical type**: An application-facing semantic value type that is
-  implemented using some physical layout. For example, Decimal
-  values are stored as 16 bytes in a fixed-size binary
-  layout. Similarly, strings can be stored as ``List<1-byte>``. A
-  timestamp may be stored as 64-bit fixed-size layout.
+* **Parent** and **child arrays**: names to express relationships
+  between physical value arrays in a nested type structure. For
+  example, a ``List``-type parent array has a T-type array as its
+  child (see more on lists below).
+* **Parametric type**: a type which requires additional parameters
+  for full determination of its semantics. For example, all nested types
+  are parametric by construction. A timestamp is also parametric as it needs
+  a unit (such as microseconds) and a timezone.
+
+Data Types
+==
+
+The `Schema.fbs`_ defines built-in data types supported by the
+Arrow columnar format. Each data type uses a well-defined physical layout.
+
+`Schema.fbs`_ is the authoritative source for the description of the
+standard Arrow data types. However, we also provide the below table for
+convenience:
+
+++--++
+| Type   | Type Parameters *(1)*| Physical Memory Layout   
  |
+++==++
+| Null   |  | Null 
  |
+++--++
+| Boolean|  | Fixed-size Primitive 
  |
+++--++
+| Int| * bit width  | *" (same as above)*  
  |
+|| * signedness |  
  |
+++--++
+| Floating Point | * precision  | *"*  
  |
+++--++
+| Decimal| * bit width  | *"*  
  |
+|| * scale  |  
  |
+|| * precision  |  
  |
+++--++
+| Date   | * unit   | *"*  
  |
+++--++
+| Time   | * bit width *(2)*| *"*  
  |
+|| * unit   |  
  |
+++--++
+| Timestamp  | * unit   | *"*  
  

Re: [PR] GH-41970: [C++] Misc changes making code around list-like types and list-view types behave the same way [arrow]

2024-06-06 Thread via GitHub


pitrou commented on code in PR #41971:
URL: https://github.com/apache/arrow/pull/41971#discussion_r1629413512


##
docs/source/cpp/compute.rst:
##
@@ -1433,14 +1435,20 @@ null input value is converted into a null output value.
 
 * \(3) The list offsets are unchanged, the list values are cast from the
   input value type to the output value type (if a conversion is
-  available).
+  available). If the output type is (Large)ListView, then sizes are
+  derived from the offsets.
+
+* \(4) If output type is list-like, offsets might have to be rebuilt to be
+  sorted and spaced correctly. If output type is a list-view type, the offsets

Review Comment:
   The values also need to be rewritten for the views to be contiguous, right?



##
cpp/src/arrow/compute/kernels/vector_hash.cc:
##
@@ -698,13 +698,12 @@ void AddHashKernels(VectorFunction* func, VectorKernel 
base, OutputType out_ty)
 DCHECK_OK(func->AddKernel(base));
   }
 
-  // Example parametric types that we want to match only on Type::type
-  auto parametric_types = {time32(TimeUnit::SECOND), time64(TimeUnit::MICRO),
-   timestamp(TimeUnit::SECOND), 
duration(TimeUnit::SECOND),
-   fixed_size_binary(0)};
-  for (const auto& ty : parametric_types) {
-base.init = GetHashInit(ty->id());
-base.signature = KernelSignature::Make({ty->id()}, out_ty);
+  // Parametric types that we want matching to be depededent only on type id
+  auto parametric_types = {Type::TIME32, Type::TIME64, Type::TIMESTAMP, 
Type::DURATION,
+   Type::FIXED_SIZE_BINARY};
+  for (const auto& type_id : parametric_types) {
+base.init = GetHashInit(type_id);
+base.signature = KernelSignature::Make({type_id}, out_ty);

Review Comment:
   Sounds good to me.



##
cpp/src/arrow/compute/kernels/vector_hash.cc:
##
@@ -698,13 +698,12 @@ void AddHashKernels(VectorFunction* func, VectorKernel 
base, OutputType out_ty)
 DCHECK_OK(func->AddKernel(base));
   }
 
-  // Example parametric types that we want to match only on Type::type
-  auto parametric_types = {time32(TimeUnit::SECOND), time64(TimeUnit::MICRO),
-   timestamp(TimeUnit::SECOND), 
duration(TimeUnit::SECOND),
-   fixed_size_binary(0)};
-  for (const auto& ty : parametric_types) {
-base.init = GetHashInit(ty->id());
-base.signature = KernelSignature::Make({ty->id()}, out_ty);
+  // Parametric types that we want matching to be depededent only on type id

Review Comment:
   ```suggestion
 // Parametric types that we want matching to be dependent only on type id
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41970: [C++] Misc changes making code around list-like types and list-view types behave the same way [arrow]

2024-06-06 Thread via GitHub


pitrou commented on code in PR #41971:
URL: https://github.com/apache/arrow/pull/41971#discussion_r1629406589


##
cpp/src/arrow/type_fwd.h:
##
@@ -525,19 +525,19 @@ std::shared_ptr decimal256(int32_t precision, 
int32_t scale);
 
 /// \brief Create a ListType instance from its child Field type
 ARROW_EXPORT
-std::shared_ptr list(const std::shared_ptr& value_type);
+std::shared_ptr list(std::shared_ptr value_type);

Review Comment:
   We've already made such changes in the past and nobody complained. This 
doesn't morally change the API (any existing usage should still work properly) 
and we don't make any promises about the ABI.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41960: Expose new S3 option check_directory_existence_before_creation [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on code in PR #41972:
URL: https://github.com/apache/arrow/pull/41972#discussion_r1629377802


##
python/pyarrow/_s3fs.pyx:
##
@@ -237,11 +237,20 @@ cdef class S3FileSystem(FileSystem):
 'port': 8020, 'username': 'username',
 'password': 'password'})
 allow_bucket_creation : bool, default False
-Whether to allow CreateDir at the bucket-level. This option may also be
+Whether to allow directory creation at the bucket-level. This option 
may also be
 passed in a URI query parameter.
 allow_bucket_deletion : bool, default False
-Whether to allow DeleteDir at the bucket-level. This option may also be
+Whether to allow directory deletion at the bucket-level. This option 
may also be
 passed in a URI query parameter.
+check_directory_existence_before_creation : bool, default false
+Whether to check the directory existence before creating it.
+if false, when creating a directory the code will not check if it 
already
+exists or not. It's an optimization to try directory creation and 
catch the error,
+  rather than issue two dependent I/O calls.

Review Comment:
   ```suggestion
   If false, when creating a directory the code will not check if it 
already
   exists or not. It's an optimization to try directory creation and 
catch the error,
   rather than issue two dependent I/O calls.
   ```



##
python/pyarrow/_s3fs.pyx:
##
@@ -237,11 +237,20 @@ cdef class S3FileSystem(FileSystem):
 'port': 8020, 'username': 'username',
 'password': 'password'})
 allow_bucket_creation : bool, default False
-Whether to allow CreateDir at the bucket-level. This option may also be
+Whether to allow directory creation at the bucket-level. This option 
may also be
 passed in a URI query parameter.
 allow_bucket_deletion : bool, default False
-Whether to allow DeleteDir at the bucket-level. This option may also be
+Whether to allow directory deletion at the bucket-level. This option 
may also be
 passed in a URI query parameter.
+check_directory_existence_before_creation : bool, default false
+Whether to check the directory existence before creating it.
+if false, when creating a directory the code will not check if it 
already
+exists or not. It's an optimization to try directory creation and 
catch the error,
+  rather than issue two dependent I/O calls.
+if true, when creating a directory the code will only create the 
directory when necessary

Review Comment:
   ```suggestion
   If true, when creating a directory the code will only create the 
directory when necessary
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41955: [C++] Follow up of adding null_bitmap to MapArray::FromArrays [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41956:
URL: https://github.com/apache/arrow/pull/41956#discussion_r1629372989


##
cpp/src/arrow/array/array_nested.cc:
##
@@ -847,14 +847,13 @@ Result> 
MapArray::FromArraysInternal(
   const auto& typed_offsets = checked_cast(*offsets);
 
   BufferVector buffers;
-  int64_t null_count;
-  if (null_bitmap != nullptr) {
-buffers = BufferVector({std::move(null_bitmap), typed_offsets.values()});
-null_count = null_bitmap->size();
-  } else {
-buffers = BufferVector({null_bitmap, typed_offsets.values()});
-null_count = 0;
+  buffers.resize(2);
+  int64_t null_count = 0;
+  if (null_bitmap) {
+buffers[0] = std::move(null_bitmap);

Review Comment:
   Done: 
https://github.com/apache/arrow/pull/41956/commits/a8586cabb4749069a6b67c335fede996b1ed65eb



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Minor: refine row selection example more [arrow-rs]

2024-06-06 Thread via GitHub


alamb commented on code in PR #5850:
URL: https://github.com/apache/arrow-rs/pull/5850#discussion_r1629364784


##
parquet/src/arrow/arrow_reader/mod.rs:
##
@@ -155,24 +155,43 @@ impl ArrowReaderBuilder {
 ///
 /// # Example
 ///
-/// Given a parquet file with 3 row groups, and a row group filter of
-/// `[0, 2]`, in order to only scan rows 50-100 in row group 2:
+/// Given a parquet file with 4 row groups, and a row group filter of `[0,
+/// 2, 3]`, in order to scan rows 50-100 in row group 2 and rows 200-300 in
+/// row group 3:
 ///
 /// ```text
 ///   Row Group 0, 1000 rows (selected)
 ///   Row Group 1, 1000 rows (skipped)
 ///   Row Group 2, 1000 rows (selected, but want to only scan rows 50-100)
+///   Row Group 3, 1000 rows (selected, but want to only scan rows 200-300)
 /// ```
 ///
-/// You would pass the following [`RowSelection`]:
+/// You could pass the following [`RowSelection`]:
 ///
 /// ```text
 ///  Select 1000(scan all rows in row group 0)
-///  Select 50-100 (scan rows 50-100 in row group 2)
+///  Skip 50(skip the first 50 rows in row group 2)
+///  Select 50  (scan rows 50-100 in row group 2)
+///  Skip 900   (skip the remaining rows in row group 2)

Review Comment:
   Not showing this "skip remaining rows in row group" I think was part of what 
was confusing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] Minor: refine row selection example more [arrow-rs]

2024-06-06 Thread via GitHub


alamb opened a new pull request, #5850:
URL: https://github.com/apache/arrow-rs/pull/5850

   # Which issue does this PR close?
   
   Follow on to https://github.com/apache/arrow-rs/pull/5824
   
   # Rationale for this change

   In discussing how row selections interacted with the row group selection 
with @advancedxy  on   https://github.com/apache/datafusion/pull/10738 (see 
https://github.com/apache/datafusion/pull/10738/files#r1628074742) it became 
clear that the example could be improved to better explain what is going on 
   
   # What changes are included in this PR?
   
   Refine example added in https://github.com/apache/arrow-rs/pull/5824 to 
better show how how skip / select works at the end of a row group. 
   
   # Are there any user-facing changes?
   Moar docs, no functional change
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41684: [C++][Python] Add optional null_bitmap to MapArray::FromArrays [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41757:
URL: https://github.com/apache/arrow/pull/41757#discussion_r1629363233


##
cpp/src/arrow/array/array_nested.cc:
##
@@ -836,24 +845,32 @@ Result> 
MapArray::FromArraysInternal(
 
   using OffsetArrayType = typename TypeTraits::ArrayType;
   const auto& typed_offsets = checked_cast(*offsets);
-  auto buffers = BufferVector({nullptr, typed_offsets.values()});
+
+  BufferVector buffers;
+  int64_t null_count;
+  if (null_bitmap != nullptr) {
+buffers = BufferVector({std::move(null_bitmap), typed_offsets.values()});
+null_count = null_bitmap->size();

Review Comment:
   Done:
   
https://github.com/apache/arrow/pull/41956/commits/5b6ec5498f3e170d56f64a425dd462a2c8a268e2
   
https://github.com/apache/arrow/pull/41956/commits/83adabeb1edf62315a95f9adac398177e3d1c9d3



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41955: [C++] Follow up of adding null_bitmap to MapArray::FromArrays [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41956:
URL: https://github.com/apache/arrow/pull/41956#discussion_r1629360321


##
cpp/src/arrow/array/array_nested.cc:
##
@@ -847,14 +847,13 @@ Result> 
MapArray::FromArraysInternal(
   const auto& typed_offsets = checked_cast(*offsets);
 
   BufferVector buffers;
-  int64_t null_count;
-  if (null_bitmap != nullptr) {
-buffers = BufferVector({std::move(null_bitmap), typed_offsets.values()});
-null_count = null_bitmap->size();
-  } else {
-buffers = BufferVector({null_bitmap, typed_offsets.values()});
-null_count = 0;
+  buffers.resize(2);
+  int64_t null_count = 0;
+  if (null_bitmap) {
+buffers[0] = std::move(null_bitmap);

Review Comment:
   I might be missing something but removing `const` in `null_bitmap ` from 
`FromArraysInternal` and both `MapArray::FromArrays` I get the following error:
   
   ```
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:540:32: 
error: non-const lvalue reference to type 'std::shared_ptr' cannot bind 
to a temporary of type 'std::nullptr_t'
 std::shared_ptr& null_bitmap = NULLPTR);
  ^ ~~~
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:540:32: 
note: passing argument to parameter 'null_bitmap' here
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:546:32: 
error: non-const lvalue reference to type 'std::shared_ptr' cannot bind 
to a temporary of type 'std::nullptr_t'
 std::shared_ptr& null_bitmap = NULLPTR);
  ^ ~~~
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:546:32: 
note: passing argument to parameter 'null_bitmap' here
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:566:50: 
error: non-const lvalue reference to type 'std::shared_ptr' cannot bind 
to a temporary of type 'std::nullptr_t'
 MemoryPool* pool, std::shared_ptr& null_bitmap = NULLPTR);
   ```
   
   the diff is:
   
   ```
   diff --git a/cpp/src/arrow/array/array_nested.cc 
b/cpp/src/arrow/array/array_nested.cc
   index fb2efe9ae..49f67a180 100644
   --- a/cpp/src/arrow/array/array_nested.cc
   +++ b/cpp/src/arrow/array/array_nested.cc
   @@ -807,7 +807,7 @@ MapArray::MapArray(const std::shared_ptr& 
type, int64_t length,
Result> MapArray::FromArraysInternal(
std::shared_ptr type, const std::shared_ptr& offsets,
const std::shared_ptr& keys, const std::shared_ptr& items,
   -MemoryPool* pool, const std::shared_ptr& null_bitmap) {
   +MemoryPool* pool, std::shared_ptr& null_bitmap) {
  using offset_type = typename MapType::offset_type;
  using OffsetArrowType = typename CTypeTraits::ArrowType;

   @@ -861,7 +861,7 @@ Result> 
MapArray::FromArraysInternal(
Result> MapArray::FromArrays(
const std::shared_ptr& offsets, const std::shared_ptr& 
keys,
const std::shared_ptr& items, MemoryPool* pool,
   -const std::shared_ptr& null_bitmap) {
   +std::shared_ptr& null_bitmap) {
  return FromArraysInternal(std::make_shared(keys->type(), 
items->type()),
offsets, keys, items, pool, null_bitmap);
}
   @@ -869,7 +869,7 @@ Result> MapArray::FromArrays(
Result> MapArray::FromArrays(
std::shared_ptr type, const std::shared_ptr& offsets,
const std::shared_ptr& keys, const std::shared_ptr& items,
   -MemoryPool* pool, const std::shared_ptr& null_bitmap) {
   +MemoryPool* pool, std::shared_ptr& null_bitmap) {
  if (type->id() != Type::MAP) {
return Status::TypeError("Expected map type, got ", type->ToString());
  }
   diff --git a/cpp/src/arrow/array/array_nested.h 
b/cpp/src/arrow/array/array_nested.h
   index f96b6bd3b..173147f55 100644
   --- a/cpp/src/arrow/array/array_nested.h
   +++ b/cpp/src/arrow/array/array_nested.h
   @@ -537,13 +537,13 @@ class ARROW_EXPORT MapArray : public ListArray {
  static Result> FromArrays(
  const std::shared_ptr& offsets, const std::shared_ptr& 
keys,
  const std::shared_ptr& items, MemoryPool* pool = 
default_memory_pool(),
   -  const std::shared_ptr& null_bitmap = NULLPTR);
   +  std::shared_ptr& null_bitmap = NULLPTR);

  static Result> FromArrays(
  std::shared_ptr type, const std::shared_ptr& offsets,
  const std::shared_ptr& keys, const std::shared_ptr& 
items,
  MemoryPool* pool = default_memory_pool(),
   -  const std::shared_ptr& null_bitmap = NULLPTR);
   +  std::shared_ptr& null_bitmap = NULLPTR);

  const MapType* map_type() const { return map_type_; }

   @@ -563,7 +563,7 @@ class ARROW_EXPORT MapArray : public ListArray {
  static Result> FromArraysInternal(
  std::shared_ptr type, const std::shared_ptr& offsets,
  const std::shared_ptr& keys, const std::shared_ptr& 
items,
   -  MemoryPool* pool, const std::shared_ptr& null_bitmap = 

Re: [PR] GH-41955: [C++] Follow up of adding null_bitmap to MapArray::FromArrays [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41956:
URL: https://github.com/apache/arrow/pull/41956#discussion_r1629360321


##
cpp/src/arrow/array/array_nested.cc:
##
@@ -847,14 +847,13 @@ Result> 
MapArray::FromArraysInternal(
   const auto& typed_offsets = checked_cast(*offsets);
 
   BufferVector buffers;
-  int64_t null_count;
-  if (null_bitmap != nullptr) {
-buffers = BufferVector({std::move(null_bitmap), typed_offsets.values()});
-null_count = null_bitmap->size();
-  } else {
-buffers = BufferVector({null_bitmap, typed_offsets.values()});
-null_count = 0;
+  buffers.resize(2);
+  int64_t null_count = 0;
+  if (null_bitmap) {
+buffers[0] = std::move(null_bitmap);

Review Comment:
   I might be missing something but removing `const` in `null_bitmap ` from 
`FromArraysInternal` and both `MapArray::FromArrays` I get the following error:
   
   ```
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:540:32: 
error: non-const lvalue reference to type 'std::shared_ptr' cannot bind 
to a temporary of type 'std::nullptr_t'
 std::shared_ptr& null_bitmap = NULLPTR);
  ^ ~~~
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:540:32: 
note: passing argument to parameter 'null_bitmap' here
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:546:32: 
error: non-const lvalue reference to type 'std::shared_ptr' cannot bind 
to a temporary of type 'std::nullptr_t'
 std::shared_ptr& null_bitmap = NULLPTR);
  ^ ~~~
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:546:32: 
note: passing argument to parameter 'null_bitmap' here
   /Users/alenkafrim/repos/arrow/cpp/src/arrow/array/array_nested.h:566:50: 
error: non-const lvalue reference to type 'std::shared_ptr' cannot bind 
to a temporary of type 'std::nullptr_t'
 MemoryPool* pool, std::shared_ptr& null_bitmap = NULLPTR);
   ```
   
   the diff is:
   
   ```
   diff --git a/cpp/src/arrow/array/array_nested.cc 
b/cpp/src/arrow/array/array_nested.cc
   index fb2efe9ae..49f67a180 100644
   --- a/cpp/src/arrow/array/array_nested.cc
   +++ b/cpp/src/arrow/array/array_nested.cc
   @@ -807,7 +807,7 @@ MapArray::MapArray(const std::shared_ptr& 
type, int64_t length,
Result> MapArray::FromArraysInternal(
std::shared_ptr type, const std::shared_ptr& offsets,
const std::shared_ptr& keys, const std::shared_ptr& items,
   -MemoryPool* pool, const std::shared_ptr& null_bitmap) {
   +MemoryPool* pool, std::shared_ptr& null_bitmap) {
  using offset_type = typename MapType::offset_type;
  using OffsetArrowType = typename CTypeTraits::ArrowType;

   @@ -861,7 +861,7 @@ Result> 
MapArray::FromArraysInternal(
Result> MapArray::FromArrays(
const std::shared_ptr& offsets, const std::shared_ptr& 
keys,
const std::shared_ptr& items, MemoryPool* pool,
   -const std::shared_ptr& null_bitmap) {
   +std::shared_ptr& null_bitmap) {
  return FromArraysInternal(std::make_shared(keys->type(), 
items->type()),
offsets, keys, items, pool, null_bitmap);
}
   @@ -869,7 +869,7 @@ Result> MapArray::FromArrays(
Result> MapArray::FromArrays(
std::shared_ptr type, const std::shared_ptr& offsets,
const std::shared_ptr& keys, const std::shared_ptr& items,
   -MemoryPool* pool, const std::shared_ptr& null_bitmap) {
   +MemoryPool* pool, std::shared_ptr& null_bitmap) {
  if (type->id() != Type::MAP) {
return Status::TypeError("Expected map type, got ", type->ToString());
  }
   diff --git a/cpp/src/arrow/array/array_nested.h 
b/cpp/src/arrow/array/array_nested.h
   index f96b6bd3b..173147f55 100644
   --- a/cpp/src/arrow/array/array_nested.h
   +++ b/cpp/src/arrow/array/array_nested.h
   @@ -537,13 +537,13 @@ class ARROW_EXPORT MapArray : public ListArray {
  static Result> FromArrays(
  const std::shared_ptr& offsets, const std::shared_ptr& 
keys,
  const std::shared_ptr& items, MemoryPool* pool = 
default_memory_pool(),
   -  const std::shared_ptr& null_bitmap = NULLPTR);
   +  std::shared_ptr& null_bitmap = NULLPTR);

  static Result> FromArrays(
  std::shared_ptr type, const std::shared_ptr& offsets,
  const std::shared_ptr& keys, const std::shared_ptr& 
items,
  MemoryPool* pool = default_memory_pool(),
   -  const std::shared_ptr& null_bitmap = NULLPTR);
   +  std::shared_ptr& null_bitmap = NULLPTR);

  const MapType* map_type() const { return map_type_; }

   @@ -563,7 +563,7 @@ class ARROW_EXPORT MapArray : public ListArray {
  static Result> FromArraysInternal(
  std::shared_ptr type, const std::shared_ptr& offsets,
  const std::shared_ptr& keys, const std::shared_ptr& 
items,
   -  MemoryPool* pool, const std::shared_ptr& null_bitmap = 

Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


jorisvandenbossche commented on code in PR #42007:
URL: https://github.com/apache/arrow/pull/42007#discussion_r1629359962


##
dev/release/verify-release-candidate.sh:
##
@@ -756,7 +756,7 @@ test_python() {
   show_header "Build and test Python libraries"
 
   # Build and test Python
-  maybe_setup_virtualenv "cython>=0.29.31" numpy "setuptools_scm<8.0.0" 
setuptools
+  maybe_setup_virtualenv "cython>=0.29.31" numpy "setuptools_scm>=8.0.0" 
"setuptools>=64"

Review Comment:
   FWIW, for _release_ verification I think we should actually remove this line 
entirely, and rely on the fact that we specify those dependencies in 
pyproject.toml (i.e. replacing the `python setup.py build_ext --inplace` below 
with `python -m pip install (-e) .`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152155842

   Revision: ec8ce7e0c5c5cd1f0b6236f253819bb979417a55
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-d9b2b774f4](https://github.com/ursacomputing/crossbow/branches/all?query=actions-d9b2b774f4)
   
   |Task|Status|
   ||--|
   |verify-rc-source-python-linux-almalinux-8-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-d9b2b774f4-github-verify-rc-source-python-linux-almalinux-8-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9400177093/job/25889180414)|
   |verify-rc-source-python-linux-conda-latest-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-d9b2b774f4-github-verify-rc-source-python-linux-conda-latest-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9400177094/job/25889180605)|
   |verify-rc-source-python-linux-ubuntu-20.04-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-d9b2b774f4-github-verify-rc-source-python-linux-ubuntu-20.04-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9400177082/job/25889180454)|
   |verify-rc-source-python-linux-ubuntu-22.04-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-d9b2b774f4-github-verify-rc-source-python-linux-ubuntu-22.04-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9400177046/job/25889180389)|
   |verify-rc-source-python-macos-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-d9b2b774f4-github-verify-rc-source-python-macos-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9400177138/job/25889180850)|
   |verify-rc-source-python-macos-arm64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-d9b2b774f4-github-verify-rc-source-python-macos-arm64)](https://github.com/ursacomputing/crossbow/actions/runs/9400177135/job/25889180892)|
   |verify-rc-source-python-macos-conda-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-d9b2b774f4-github-verify-rc-source-python-macos-conda-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9400177050/job/25889180426)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152147600

   I was missing dev tags on my fork that's why the jobs failed to generate the 
correct version, I've pushed them and will re-run


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152147789

   @github-actions crossbow submit verify-rc-source-python-*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java][Integration][CI] Integration tests are failing due to missing `/java/pom.xml` [arrow]

2024-06-06 Thread via GitHub


raulcd commented on issue #42005:
URL: https://github.com/apache/arrow/issues/42005#issuecomment-2152138578

   yes, of course, we can run the tests with our bot commenting with 
`@github-actions crossbow submit verify-rc-source-integration-*` on the PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Refine documentation for `unary_mut` and `binary_mut` [arrow-rs]

2024-06-06 Thread via GitHub


alamb commented on code in PR #5798:
URL: https://github.com/apache/arrow-rs/pull/5798#discussion_r1629342361


##
arrow-arith/src/arity.rs:
##
@@ -207,23 +227,46 @@ where
 Ok(PrimitiveArray::new(buffer.into(), nulls))
 }
 
-/// Given two arrays of length `len`, calls `op(a[i], b[i])` for `i` in 
`0..len`, mutating
-/// the mutable [`PrimitiveArray`] `a`. If any index is null in either `a` or 
`b`, the
-/// corresponding index in the result will also be null.
+/// Applies a binary and infallible function to values in two arrays, replacing
+/// the values in the first array in place.
 ///
-/// Mutable primitive array means that the buffer is not shared with other 
arrays.
-/// As a result, this mutates the buffer directly without allocating new 
buffer.
+/// # Details
+///
+/// Given two arrays of length `len`, calls `op(a[i], b[i])` for `i` in
+/// `0..len`, modifying the [`PrimitiveArray`] `a` in place, if possible.
+///
+/// If any index is null in either `a` or `b`, the corresponding index in the
+/// result will also be null.
+///
+/// # Buffer Reuse
+///
+/// If the underlying buffers in `a` are not shared with other arrays,  mutates
+/// the underlying buffer in place, without allocating.
 ///
 /// Like [`unary`] the provided function is evaluated for every index, 
ignoring validity. This
 /// is beneficial when the cost of the operation is low compared to the cost 
of branching, and
 /// especially when the operation can be vectorised, however, requires `op` to 
be infallible
 /// for all possible values of its inputs
 ///
-/// # Error
+/// # Errors
 ///
-/// This function gives error if the arrays have different lengths.
-/// This function gives error of original [`PrimitiveArray`] `a` if it is not 
a mutable
-/// primitive array.
+/// * if the arrays have different lengths.
+///
+/// # See Also
+///
+/// * Documentation on [`PrimitiveArray::unary_mut`] for operating on 
[`ArrayRef`].
+///
+/// # Example
+/// ```
+/// # use arrow_arith::arity::binary_mut;
+/// # use arrow_array::Float32Array;
+/// # use arrow_array::types::Int32Type;
+/// let a = Float32Array::from(vec![Some(5.1f32), None, Some(6.8)]);
+/// let b = Float32Array::from(vec![Some(1.0f32), None, Some(2.0)]);
+/// // compute a + b, updating the value in a in place if possible
+/// let a = binary_mut(a, , |a, b| a + b).unwrap().unwrap();

Review Comment:
   Update is that @viirya  fixed this in 
https://github.com/apache/arrow-rs/pull/5833



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] Refine documentation for `unary_mut` and `binary_mut` [arrow-rs]

2024-06-06 Thread via GitHub


alamb merged PR #5798:
URL: https://github.com/apache/arrow-rs/pull/5798


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [Java][Integration][CI] Integration tests are failing due to missing `/java/pom.xml` [arrow]

2024-06-06 Thread via GitHub


llama90 commented on issue #42005:
URL: https://github.com/apache/arrow/issues/42005#issuecomment-2152112834

   If my understanding is correct, the path seems to be wrong about 
`ARROW_BUILD_ROOT`. If I submit a pull request, can it be tested?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41955: [C++] Follow up of adding null_bitmap to MapArray::FromArrays [arrow]

2024-06-06 Thread via GitHub


AlenkaF commented on code in PR #41956:
URL: https://github.com/apache/arrow/pull/41956#discussion_r1629334388


##
cpp/src/arrow/array/array_nested.cc:
##
@@ -847,14 +847,13 @@ Result> 
MapArray::FromArraysInternal(
   const auto& typed_offsets = checked_cast(*offsets);
 
   BufferVector buffers;
-  int64_t null_count;
-  if (null_bitmap != nullptr) {
-buffers = BufferVector({std::move(null_bitmap), typed_offsets.values()});
-null_count = null_bitmap->size();
-  } else {
-buffers = BufferVector({null_bitmap, typed_offsets.values()});
-null_count = 0;
+  buffers.resize(2);
+  int64_t null_count = 0;
+  if (null_bitmap) {
+buffers[0] = std::move(null_bitmap);
+null_count = -1;

Review Comment:
   Done: 
https://github.com/apache/arrow/pull/41956/commits/5b6ec5498f3e170d56f64a425dd462a2c8a268e2



##
cpp/src/arrow/array/array_nested.cc:
##
@@ -847,14 +847,13 @@ Result> 
MapArray::FromArraysInternal(
   const auto& typed_offsets = checked_cast(*offsets);
 
   BufferVector buffers;
-  int64_t null_count;
-  if (null_bitmap != nullptr) {
-buffers = BufferVector({std::move(null_bitmap), typed_offsets.values()});
-null_count = null_bitmap->size();
-  } else {
-buffers = BufferVector({null_bitmap, typed_offsets.values()});
-null_count = 0;
+  buffers.resize(2);
+  int64_t null_count = 0;
+  if (null_bitmap) {
+buffers[0] = std::move(null_bitmap);

Review Comment:
   Thank you for such a detailed response and for taking time to educate us! ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-41923: [C++] Maybe fix ExecuteScalar with chunked_array [arrow]

2024-06-06 Thread via GitHub


ZhangHuiGui commented on code in PR #41925:
URL: https://github.com/apache/arrow/pull/41925#discussion_r1628700236


##
cpp/src/arrow/compute/expression_test.cc:
##
@@ -909,6 +909,39 @@ TEST(Expression, ExecuteCallWithNoArguments) {
   EXPECT_EQ(actual.length(), kCount);
 }
 
+TEST(Expression, ExecuteChunkedArray) {

Review Comment:
   > The implication from `COMPUTED_NO_PREALLOCATE` should be regarded as an 
unstable internal implementation choice
   
   Ah, in comparison, I don’t quite understand the meaning of the phrase 
`unstable internal implementation choice`. Maybe you mean that the 
`min_element_wise` kernel needs to consider whether it should use 
`COMPUTED_NO_PREALLOCATE`?
   
   Some functions have kernel output types of nested-types or dictionary-type, 
so they cannot pre-allocate validity-bitmap before actual execution, and can 
only set `NullHandling` to `COMPUTED_NO_PREALLOCATE`.
   
   In this case, if the input of this function is chunked-array, according to 
the existing logic of `ScalarExecutor`, the output nested-types or 
dictionary-type must be chunked-array.
   
   1. `preallocating_all_buffers_` must be `false` in this kind of kernel's 
execute path and go into `ExecuteNonSpans`
   2. `ExecuteNonSpans` will emplace the every chunk's array-data result input 
listener.

https://github.com/apache/arrow/blob/9ee6ea701e20d1b47934f977d87811624061d597/cpp/src/arrow/compute/exec.cc#L897
   3. Listener have saved multi-arrays and return a chunked array in below logic

https://github.com/apache/arrow/blob/9ee6ea701e20d1b47934f977d87811624061d597/cpp/src/arrow/compute/exec.cc#L814-L824
   
   
   If If ExecBatch supports chunked-array, should such output also be tested 
(to ensure that the expression system works properly in this scenario)?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


github-actions[bot] commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152080574

   Revision: ec8ce7e0c5c5cd1f0b6236f253819bb979417a55
   
   Submitted crossbow builds: [ursacomputing/crossbow @ 
actions-9d14dfc7fd](https://github.com/ursacomputing/crossbow/branches/all?query=actions-9d14dfc7fd)
   
   |Task|Status|
   ||--|
   |verify-rc-source-python-linux-almalinux-8-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-9d14dfc7fd-github-verify-rc-source-python-linux-almalinux-8-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9399905754/job/25888327222)|
   |verify-rc-source-python-linux-conda-latest-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-9d14dfc7fd-github-verify-rc-source-python-linux-conda-latest-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9399905733/job/25888326965)|
   |verify-rc-source-python-linux-ubuntu-20.04-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-9d14dfc7fd-github-verify-rc-source-python-linux-ubuntu-20.04-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9399905803/job/25888327637)|
   |verify-rc-source-python-linux-ubuntu-22.04-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-9d14dfc7fd-github-verify-rc-source-python-linux-ubuntu-22.04-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9399905698/job/25888326975)|
   |verify-rc-source-python-macos-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-9d14dfc7fd-github-verify-rc-source-python-macos-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9399905729/job/25888326971)|
   |verify-rc-source-python-macos-arm64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-9d14dfc7fd-github-verify-rc-source-python-macos-arm64)](https://github.com/ursacomputing/crossbow/actions/runs/9399905751/job/25888327032)|
   |verify-rc-source-python-macos-conda-amd64|[![GitHub 
Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-9d14dfc7fd-github-verify-rc-source-python-macos-conda-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/9399905757/job/25888327267)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] GH-42006: [CI][Python] Install required setuptools_scm>=8.0.0 and setuptools>=64 for Python verification [arrow]

2024-06-06 Thread via GitHub


raulcd commented on PR #42007:
URL: https://github.com/apache/arrow/pull/42007#issuecomment-2152072710

   @github-actions crossbow submit verify-rc-source-python-*


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   4   5   6   7   8   9   10   >