thisisnic opened a new issue, #36807:
URL: https://github.com/apache/arrow/issues/36807
### Describe the bug, including details regarding any error messages,
version, and platform.
I'm using the Arrow R package version 12.0.1.1 and am getting segfault when
trying to read a Parquet file. Here's the output with the debugger attached:
```
> library(fs)
library(arrow)
library(dplyr)
[New Thread 0x7ffff33ff640 (LWP 480350)]
[New Thread 0x7fffe99ff640 (LWP 480356)]
Some features are not enabled in this build of Arrow. Run `arrow_info()` for
more information.
Attaching package: ‘arrow’
The following object is masked from ‘package:utils’:
timestamp
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
> all_files <- dir_ls("/data/nyc-taxi", recurse=TRUE)
parquet_files <- all_files[endsWith(all_files, "parquet")]
> parquet_files[86]
/data/nyc-taxi/year=2016/month=10/part-0.parquet
> ds <- open_dataset(parquet_files[86]) %>% head(6) %>% collect()
[New Thread 0x7fffe9007640 (LWP 480358)]
[New Thread 0x7fffe8806640 (LWP 480359)]
[New Thread 0x7fffd7b7f640 (LWP 480360)]
[New Thread 0x7fffd6b7f640 (LWP 480361)]
[New Thread 0x7fffd637e640 (LWP 480362)]
[New Thread 0x7fffd5b7d640 (LWP 480363)]
[New Thread 0x7fffd537c640 (LWP 480364)]
[New Thread 0x7fffd4b7b640 (LWP 480365)]
[New Thread 0x7fffcd7ff640 (LWP 480366)]
[New Thread 0x7fffccffe640 (LWP 480367)]
[New Thread 0x7fffb3fff640 (LWP 480368)]
[New Thread 0x7fffb37fe640 (LWP 480369)]
[New Thread 0x7fffb2ffd640 (LWP 480370)]
> nrow(ds)
[1] 6
> parquet_files[87]
/data/nyc-taxi/year=2016/month=11/part-0.parquet
> ds <- open_dataset(parquet_files[87]) %>% head(6) %>% collect()
>
Thread 13 "R" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffccffe640 (LWP 480367)]
0x00007ffff00fbf38 in
arrow::internal::Executor::Submit<parquet::arrow::(anonymous
namespace)::FileReaderImpl::DecodeRowGroups(std::shared_ptr<parquet::arrow::(anonymous
namespace)::FileReaderImpl>, const std::vector<int>&, const std::vector<int>&,
arrow::internal::Executor*)::<lambda(size_t,
std::shared_ptr<parquet::arrow::ColumnReaderImpl>)>&, long unsigned int&,
std::shared_ptr<parquet::arrow::ColumnReaderImpl> >(arrow::internal::TaskHints,
arrow::StopToken, struct {...} &) (this=0x2e1c00000008, hints=...,
stop_token=..., func=...) at
/home/nic2/arrow/cpp/src/arrow/util/thread_pool.h:159
159 ARROW_RETURN_NOT_OK(SpawnReal(hints, std::move(task),
std::move(stop_token),
```
If I read in the file via `read_parquet()`, I don't have a problem and it
loads fine. Happy to supply the file if necessary, though wasn't sure it's
possible/desirable to attach a 150Mb file to an issue ticket.
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]