mrd0ll4r opened a new issue, #41813:
URL: https://github.com/apache/arrow/issues/41813

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hello!
   I've been using arrow with R for a while now to great success.
   Recently, I've re-opened an old project (managed with renv, so I'm pretty 
confident all the package versions were the same).
   It is possible I upgraded the OS and/or OS packages in the meantime.
   Now, some of my queries on a gzip-compressed dataset of parquet files lead 
to a segfault:
   
   ```
    *** caught segfault ***
   address 0x7f54ce520898, cause 'memory not mapped'
   
   Traceback:
    1: Table__from_ExecPlanReader(self)
    2: x$read_table()
    3: as_arrow_table.RecordBatchReader(reader)
    4: as_arrow_table(reader)
    5: as_arrow_table.arrow_dplyr_query(x)
    6: as_arrow_table(x)
    7: doTryCatch(return(expr), name, parentenv, handler)
    8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
    9: tryCatchList(expr, classes, parentenv, handlers)
   10: tryCatch(as_arrow_table(x), error = function(e, call = caller_env(n = 
4)) {    augment_io_error_msg(e, call, schema = schema())})
   11: compute.arrow_dplyr_query(x)
   12: collect.arrow_dplyr_query(.)
   13: collect(.)
   14: d_redacted %>% group_by(year, month, cid) %>% summarize(n = n()) %>%     
collect()
   ```
   
   I have a core dump from that session, but it's 46GB.
   I'm not a professional in analyzing these things, but this is what I got:
   ```
   Core was generated by `/usr/lib/R/bin/exec/R'.
   Program terminated with signal SIGSEGV, Segmentation fault.
   #0  0x00007f612d4ea3b0 in 
arrow::compute::KeyCompare::CompareBinaryColumnToRow_avx2(bool, unsigned int, 
unsigned int, unsigned short const*, unsigned int const*, 
arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, 
arrow::compute::RowTableImpl const&, unsigned char*) () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   [Current thread is 1 (Thread 0x7f6093fff640 (LWP 2273813))]
   (gdb) bt
   #0  0x00007f612d4ea3b0 in 
arrow::compute::KeyCompare::CompareBinaryColumnToRow_avx2(bool, unsigned int, 
unsigned int, unsigned short const*, unsigned int const*, 
arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, 
arrow::compute::RowTableImpl const&, unsigned char*) () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #1  0x00007f612d4d7093 in void 
arrow::compute::KeyCompare::CompareBinaryColumnToRow<true>(unsigned int, 
unsigned int, unsigned short const*, unsigned int const*, 
arrow::compute::LightContext*, arrow::compute::KeyColumnArray const&, 
arrow::compute::RowTableImpl const&, unsigned char*) () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #2  0x00007f612d4d6278 in 
arrow::compute::KeyCompare::CompareColumnsToRows(unsigned int, unsigned short 
const*, unsigned int const*, arrow::compute::LightContext*, unsigned int*, 
unsigned short*, std::vector<arrow::compute::KeyColumnArray, 
std::allocator<arrow::compute::KeyColumnArray> > const&, 
arrow::compute::RowTableImpl const&, bool, unsigned char*) ()
      from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #3  0x00007f612d4d896e in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #4  0x00007f612d3a98e6 in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #5  0x00007f612d3ab154 in arrow::compute::SwissTable::find(int, unsigned int 
const*, unsigned char*, unsigned char const*, unsigned int*, 
arrow::util::TempVectorStack*, std::function<void (int, unsigned short const*, 
unsigned int const*, unsigned int*, unsigned short*, void*)> const&, void*) 
const ()
      from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #6  0x00007f612d4df2d0 in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #7  0x00007f612d4dfb73 in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #8  0x00007f612cf8da83 in arrow::acero::aggregate::GroupByNode::Merge() () 
from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #9  0x00007f612cf8f8a3 in 
arrow::acero::aggregate::GroupByNode::OutputResult(bool) ()
      from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #10 0x00007f612cf941f6 in 
arrow::acero::aggregate::GroupByNode::InputReceived(arrow::acero::ExecNode*, 
arrow::compute::ExecBatch) ()
      from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #11 0x00007f612cef3f1b in 
arrow::acero::MapNode::InputReceived(arrow::acero::ExecNode*, 
arrow::compute::ExecBatch) ()
      from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #12 0x00007f612cf25dd2 in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #13 0x00007f612cf05a7e in arrow::internal::FnOnce<void 
()>::FnImpl<std::_Bind<arrow::detail::ContinueFuture 
(arrow::Future<arrow::internal::Empty>, std::function<arrow::Status ()>)> 
>::invoke() ()
      from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #14 0x00007f612d290a9d in ?? () from 
/home/leo/.cache/R/renv/cache/v5/R-4.3/x86_64-pc-linux-gnu/arrow/15.0.1/85c24dd7844977e4a680ba28f576125c/arrow/libs/arrow.so
   #15 0x00007f6136f87253 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
   #16 0x00007f61396a9ac3 in start_thread (arg=<optimized out>) at 
./nptl/pthread_create.c:442
   #17 0x00007f613973b850 in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
   ```
   
   
   I've tried:
   - Updating all the dependencies. I'm now at `15.0.1` from RSPM. Above crash 
is from this version.
   - Re-writing the dataset. The raw data is a bunch of CSV files, which I read 
-> mutate -> write to parquet
   - Checking if simple queries (`dataset %>% summarize(n=n())`) work, which 
they do
   
   Specifically, this query works:
   ```
   d_redacted %>% group_by(year, month) %>% summarize(n=n()) %>% collect()
   ```
   and this doesn't:
   ```
   d_malicious_requests %>% group_by(year, month, cid) %>% summarize(n=n()) %>% 
collect()
   ```
   
   The dataset looks like this:
   ```
   > d_redacted
   FileSystemDataset with 1342 Parquet files
   peer: string
   address: string
   asn: string
   geolocation: string
   cid: string
   entry_type: string
   date: date32[day]
   monitor: string
   year: int32
   month: int32
   ```
   
   Unfortunately, I cannot share the dataset publicly as it contains sensitive 
information.
   
   Overall, pretty lost now.
   The system is running Ubuntu 22.04, kernel:
   ```
   5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 
x86_64 GNU/Linux
   ```
   
   Hope that helps somehow...
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to