tmaxwell-anthropic opened a new issue, #39332:
URL: https://github.com/apache/arrow/issues/39332
### Describe the bug, including details regarding any error messages,
version, and platform.
This Python script produces a segmentation fault in the `join()` call:
```python
import pyarrow as pa
eight_mib = "xyzw" * (2048 * 1024)
gib = pa.array((eight_mib for i in range(128)), pa.string())
keys = pa.array(range(128), pa.int64())
left = pa.Table.from_pydict({"keys": keys, "gib": gib})
right_keys = pa.array(list(range(128)) * 4, pa.int64())
right = pa.Table.from_pydict({"keys": right_keys})
print("joining...")
left.join(right, "keys")
print("joined.")
```
The C++ call stack is:
```
#0 0x00007ffff43aba6a in
arrow::compute::ExecBatchBuilder::AppendSelected(std::shared_ptr<arrow::ArrayData>
const&, arrow::compute::ResizableArrayData*, int, unsigned short const*,
arrow::MemoryPool*) () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#1 0x00007ffff43acfa7 in
arrow::compute::ExecBatchBuilder::AppendSelected(arrow::MemoryPool*,
arrow::compute::ExecBatch const&, int, unsigned short const*, int, int const*)
() from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#2 0x00007ffff67cbe44 in
arrow::acero::JoinResultMaterialize::Append(arrow::compute::ExecBatch const&,
int, unsigned short const*, unsigned int const*, unsigned int const*, int*) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#3 0x00007ffff67e0423 in
arrow::acero::JoinProbeProcessor::OnNextBatch(long, arrow::compute::ExecBatch
const&, arrow::util::TempVectorStack*,
std::vector<arrow::compute::KeyColumnArray,
std::allocator<arrow::compute::KeyColumnArray> >*) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#4 0x00007ffff6802721 in arrow::acero::SwissJoin::ProbeSingleBatch(unsigned
long, arrow::compute::ExecBatch) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#5 0x00007ffff6825c07 in std::_Function_handler<arrow::Status (unsigned
long, long), arrow::acero::HashJoinNode::Init()::{lambda(unsigned long,
long)#8}>::_M_invoke(std::_Any_data const&, unsigned long&&, long&&) () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#6 0x00007ffff67be225 in
arrow::acero::TaskSchedulerImpl::ExecuteTask(unsigned long, int, long, bool*) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#7 0x00007ffff67d5814 in std::_Function_handler<arrow::Status (unsigned
long), arrow::acero::TaskSchedulerImpl::ScheduleMore(unsigned long,
int)::{lambda(unsigned long)#1}>::_M_invoke(std::_Any_data const&, unsigned
long&&) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#8 0x00007ffff67baf47 in std::_Function_handler<arrow::Status (),
arrow::acero::QueryContext::ScheduleTask(std::function<arrow::Status (unsigned
long)>, std::basic_string_view<char, std::char_traits<char>
>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#9 0x00007ffff67f9260 in arrow::internal::FnOnce<void
()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture
(arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)>
>::invoke() () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow_acero.so.1200
#10 0x00007ffff44d9505 in arrow::internal::FnOnce<void ()>::operator()() &&
()
from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#11 0x00007ffff44d5c38 in
std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}>
> >::_M_run() () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#12 0x00007ffff543b4a0 in execute_native_thread_routine () from
/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/pyarrow/libarrow.so.1200
#13 0x00007ffff7850ac3 in start_thread (arg=<optimized out>) at
./nptl/pthread_create.c:442
#14 0x00007ffff78e2660 in clone3 () at
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
```
Software versions: PyArrow 12.0.1, Python 3.11.6, Ubuntu 22.04.3.
If I change the `pa.string()` to a `pa.large_string()` then it works fine.
### Component(s)
C++, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]