Jonathan Keane created ARROW-15664:
--------------------------------------

             Summary: [C++] parquet reader Segfaults with illegal SIMD 
instruction 
                 Key: ARROW-15664
                 URL: https://issues.apache.org/jira/browse/ARROW-15664
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 7.0.0
            Reporter: Jonathan Keane
             Fix For: 7.0.1, 8.0.0


When compiling with {{-Os}} (or with release type {{MinRelSize}}), and we run 
parquet tests (in R at least, though I imagine the pyarrow and C++ will have 
the same issues!) we get a segfault with an illegal opcode on systems that 
don't have BMI2 available when trying to read parquet files. (It turns out, the 
github runners for macos don't have BMI2, so this is easily testable there!)

Somehow in the optimization combined with the way our runtime detection code 
works, the runtime detection we normally use for this fails (though it works 
just fine with {{-O2}}, {{-O3}}, etc.).

When diagnosing this, I created a branch + PR that runs our R tests after 
installing from brew which can reliably cause this to happen: 
https://github.com/apache/arrow/pull/12364 other test suites that exercise 
parquet reading would probably have the same issue (or even C++ tests built 
with {{-Os}}.

Here's a coredump:
{code}
2491 Thread_829819
+ 2491 thread_start  (in libsystem_pthread.dylib) + 15  [0x7ff801c3e00f]
+   2491 _pthread_start  (in libsystem_pthread.dylib) + 125  [0x7ff801c424f4]
+     2491 void* 
std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct,
 std::__1::default_delete<std::__1::__thread_struct> >, 
arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_3> >(void*)  (in 
arrow.so) + 380  [0x109203749]
+       2491 arrow::internal::FnOnce<void ()>::operator()() &&  (in arrow.so) + 
26  [0x109201f30]
+         2491 arrow::internal::FnOnce<void 
()>::FnImpl<std::__1::__bind<arrow::detail::ContinueFuture, 
arrow::Future<std::__1::shared_ptr<arrow::ChunkedArray> >&, 
parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<parquet::arrow::(anonymous
 namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int> > 
const&, std::__1::vector<int, std::__1::allocator<int> > const&, 
arrow::internal::Executor*)::$_4&, unsigned long&, 
std::__1::shared_ptr<parquet::arrow::ColumnReaderImpl> > >::invoke()  (in 
arrow.so) + 98  [0x108f125c2]
+           2491 parquet::arrow::(anonymous 
namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<parquet::arrow::(anonymous
 namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int> > 
const&, std::__1::vector<int, std::__1::allocator<int> > const&, 
arrow::internal::Executor*)::$_4::operator()(unsigned long, 
std::__1::shared_ptr<parquet::arrow::ColumnReaderImpl>) const  (in arrow.so) + 
47  [0x108f11ed5]
+             2491 parquet::arrow::(anonymous 
namespace)::FileReaderImpl::ReadColumn(int, std::__1::vector<int, 
std::__1::allocator<int> > const&, parquet::arrow::ColumnReader*, 
std::__1::shared_ptr<arrow::ChunkedArray>*)  (in arrow.so) + 273  [0x108f0c037]
+               2491 parquet::arrow::ColumnReaderImpl::NextBatch(long long, 
std::__1::shared_ptr<arrow::ChunkedArray>*)  (in arrow.so) + 39  [0x108f0733b]
+                 2491 parquet::arrow::(anonymous 
namespace)::LeafReader::LoadBatch(long long)  (in arrow.so) + 137  [0x108f0794b]
+                   2491 parquet::internal::(anonymous 
namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)1> 
>::ReadRecords(long long)  (in arrow.so) + 442  [0x108f4f53e]
+                     2491 parquet::internal::(anonymous 
namespace)::TypedRecordReader<parquet::PhysicalType<(parquet::Type::type)1> 
>::ReadRecordData(long long)  (in arrow.so) + 471  [0x108f50503]
+                       2491 void 
parquet::internal::standard::DefLevelsToBitmapSimd<false>(short const*, long 
long, parquet::internal::LevelInfo, 
parquet::internal::ValidityBitmapInputOutput*)  (in arrow.so) + 250  
[0x108fc2a5a]
+                         2491 long long 
parquet::internal::standard::DefLevelsBatchToBitmap<false>(short const*, long 
long, long long, parquet::internal::LevelInfo, 
arrow::internal::FirstTimeBitmapWriter*)  (in arrow.so) + 63  [0x108fc34da]
+                           2491 ???  (in <unknown binary>)  [0x600001354518]
+                             2491 _sigtramp  (in libsystem_platform.dylib) + 
29  [0x7ff801c57e2d]
+                               2491 sigactionSegv  (in libR.dylib) + 649  
[0x1042598c9]  main.c:625
+                                 2491 Rstd_ReadConsole  (in libR.dylib) + 2042 
 [0x10435160a]  sys-std.c:1044
+                                   2491 R_SelectEx  (in libR.dylib) + 308  
[0x104350854]  sys-std.c:178
+                                     2491 __select  (in 
libsystem_kernel.dylib) + 10  [0x7ff801c0de4a]
{code}

And then a disassembly (where you can see a SHLX that shouldn't be there):

{code}
Dump of assembler code from 0x13ac6db00 to 0x13ac6db99ff:
 ...
--Type <RET> for more, q to quit, c to continue without paging--
   0x000000013ac6db82:  mov    $0x8,%ecx
   0x000000013ac6db87:  sub    %rax,%rcx
   0x000000013ac6db8a:  lea    0xf1520b(%rip),%rdi        # 0x13bb82d9c
   0x000000013ac6db91:  movzbl (%rcx,%rdi,1),%edi
   0x000000013ac6db95:  mov    %esi,%ebx
   0x000000013ac6db97:  and    %edi,%ebx
=> 0x000000013ac6db99:  shlx   %rax,%rbx,%rax
   0x000000013ac6db9e:  or     0x18(%r15),%al
   0x000000013ac6dba2:  mov    %al,0x18(%r15)
   0x000000013ac6dba6:  cmp    %rdx,%rcx
   0x000000013ac6dba9:  jg     0x13ac6dbf5
   0x000000013ac6dbab:  mov    %al,(%r14)
   0x000000013ac6dbae:  inc    %r14
   0x000000013ac6dbb1:  shrx   %rcx,%rsi,%rax
   0x000000013ac6dbb6:  mov    %rax,-0x20(%rbp)
   0x000000013ac6dbba:  sub    %rcx,%rdx
   0x000000013ac6dbbd:  mov    %rdx,%rbx
   0x000000013ac6dbc0:  sar    $0x3,%rbx
   0x000000013ac6dbc4:  and    $0x7,%edx
   0x000000013ac6dbc7:  cmp    $0x1,%rdx
   0x000000013ac6dbcb:  sbb    $0xffffffffffffffff,%rbx
   0x000000013ac6dbcf:  lea    -0x20(%rbp),%rsi
   0x000000013ac6dbd3:  mov    %r14,%rdi
...
{code}

We discovered this because homebrew alters the default build flags and uses 
{{-Os}}, though we should include a test that tests this in our CI as well (at 
least as a nightly) to catch it earlier: 
https://github.com/Homebrew/homebrew-core/issues/94724



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to