mattaubury opened a new issue, #39862:
URL: https://github.com/apache/arrow/issues/39862

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   In Arrow-15.0.0 when using a threaded dataset scan I sometimes see the 
following crash at exit:
   ```
   #1  0x00007ffff691e351 in arrow::Status 
arrow::internal::Executor::Spawn<arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::shared_ptr<arrow::FutureImpl>
 const&, arrow::FutureImpl::CallbackRecord&&, 
bool)::{lambda()#1}>(arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::shared_ptr<arrow::FutureImpl>
 const&, arrow::FutureImpl::CallbackRecord&&, bool)::{lambda()#1}&&) () from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #2  0x00007ffff691e6ee in 
arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::shared_ptr<arrow::FutureImpl>
 const&, arrow::FutureImpl::CallbackRecord&&, bool) ()
      from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #3  0x00007ffff691e99d in 
arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) ()
      from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #4  0x00007ffff68c7454 in void 
arrow::Future<arrow::internal::Empty>::MarkFinished<arrow::internal::Empty, 
void>(arrow::Status) ()
      from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #5  0x00007ffff691bf1e in arrow::internal::FnOnce<void (arrow::FutureImpl 
const&)>::FnImpl<arrow::Future<arrow::internal::Empty>::WrapStatusyOnComplete::Callback<arrow::AllComplete(std::vector<arrow::Future<arrow::internal::Empty>,
 std::allocator<arrow::Future<arrow::internal::Empty> > > 
const&)::{lambda(arrow::Status const&)#1}> >::invoke(arrow::FutureImpl const&) 
() from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #6  0x00007ffff691e65c in 
arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::shared_ptr<arrow::FutureImpl>
 const&, arrow::FutureImpl::CallbackRecord&&, bool) ()
      from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #7  0x00007ffff691e99d in 
arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) ()
      from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #8  0x00007ffff68c0506 in arrow::internal::FnOnce<void 
()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture 
(arrow::Future<std::shared_ptr<arrow::Buffer> >, 
arrow::io::RandomAccessFile::ReadAsync(arrow::io::IOContext const&, long, 
long)::{lambda()#1})> >::invoke() ()
      from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #9  0x00007ffff694d6c9 in 
std::thread::_State_impl<std::thread::_Invoker<std::tuple<arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::{lambda()#1}>
 > >::_M_run() () from 
/jump/software/rhel8/apache-arrow-15.0.0-cxx20-gcc10/lib64/libarrow.so.1500
   #10 0x00007ffff5190640 in std::execute_native_thread_routine (__p=0x777df0) 
at ../../../../../libstdc++-v3/src/c++11/thread.cc:80
   #11 0x00007ffff17021cf in start_thread () from /lib64/libpthread.so.0
   #12 0x00007ffff4794dd3 in clone () from /lib64/libc.so.6
   ```
   
   The crash is non-deterministic, happens around 50% of the time on the 
machine I'm testing on. This problem only appears when the program terminates 
immediately after the end of the scan; my guess would be that the threads are 
not being cancelled/joined correctly and so are still running when the program 
terminates.
   
   
   To create the input data:
   ```
   import pyarrow.parquet as pq
   import pyarrow as pa
   
pq.write_table(pa.Table.from_pandas(pd.read_parquet("http://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet";)),
 "taxi.parquet")
   ```
   For some reason, using the downloaded Parquet directly does NOT show the 
problem.
   
   Then to produce the crash:
   ```
   #include <arrow/dataset/api.h>
   #include <arrow/filesystem/api.h>
   
   int
   main ()
   {
       const auto format = std::make_shared<arrow::dataset::ParquetFileFormat> 
();
       const auto options = arrow::dataset::FileSystemFactoryOptions {};
       const std::shared_ptr<arrow::fs::FileSystem> filesystem =
           std::make_shared<arrow::fs::LocalFileSystem> ();
   
       std::vector<std::string> object_ids { "taxi.parquet" };
   
       auto factory = arrow::dataset::FileSystemDatasetFactory::Make (
                          filesystem, std::move (object_ids), format, options)
                          .ValueOrDie ();
       const auto full_dataset = factory->Finish ().ValueOrDie ();
   
       auto builder = full_dataset->NewScan ().ValueOrDie ();
       (void)builder->UseThreads (true);
       (void)builder->Project ({ "passenger_count", "trip_distance" });
       auto scanner = builder->Finish ().ValueOrDie ();
       (void)scanner->Head (10);
   }
   ```
   I compiled this with:
   ```
   g++ -std=c++20 arrow_bug.cpp $(pkg-config arrow --cflags --libs) 
$(pkg-config arrow-dataset --cflags --libs)
   ```
   
   I've also seen this crash with 14.0.1, but can't reproduce on 12.0.0, so I 
imagine something changed between them.
   
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to