[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976841#comment-16976841 ] Ben Kietzman commented on ARROW-7156: - I can reproduce this failure with {code} arrow-file-to-stream SingleBatch_String_85000_Rows.arrow > /dev/null {code} I can confirm that the buffer length is negative as we read from flatbuffers https://github.com/apache/arrow/blob/bef9a1c/cpp/src/arrow/ipc/message.cc#L159 C# writer does seem to be producing an invalid file. > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: SingleBatch_String_7_Rows.ok.rar, > SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976738#comment-16976738 ] Ben Kietzman commented on ARROW-7156: - If it's supsected that the c# writer is emitting invalid record batches, could you share the code which generates your test files? > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: SingleBatch_String_7_Rows.ok.rar, > SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973769#comment-16973769 ] Neal Richardson commented on ARROW-7156: https://arrow.apache.org/docs/format/Columnar.html doesn't say that RecordBatches are limited to 2 GB, but either way it sounds like there's an issue in the C# writer if it's (presumably) overflowing int32 and reporting negative offsets in the file you're trying to read. > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: SingleBatch_String_7_Rows.ok.rar, > SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973723#comment-16973723 ] Anthony Abate commented on ARROW-7156: -- I uploaded some test files. they deceptively small compressed.. but 2gigs uncompressed I have a work around for now - just make sure my batches are less than 2 gigs. > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: SingleBatch_String_7_Rows.ok.rar, > SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973717#comment-16973717 ] Anthony Abate commented on ARROW-7156: -- >From Event viewer: Faulting application name: rsession.exe, version: 1.2.1335.0, time stamp: 0x5c9d0154 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 0x5dc40022 Exception code: 0xc005 Fault offset: 0x003e4c05 Faulting process id: 0x8ec Faulting application start time: 0x01d59a59ff052a76 Faulting application path: C:\software\RStudio\bin\rsession.exe Faulting module path: C:\Users\aabate\Documents\R\win-library\3.6\arrow\libs\x64\arrow.dll Report Id: db7e29f8-54ba-40fc-a104-75d3b6f75d0e Faulting package full name: Faulting package-relative application ID: > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973714#comment-16973714 ] Anthony Abate commented on ARROW-7156: -- [~npr]- crashes RStudio means just that - instead of an error message !image-2019-11-13-16-27-30-641.png! > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973713#comment-16973713 ] Anthony Abate commented on ARROW-7156: -- [~npr]- do you know if an individual RecordBatch can exceed 2 gigs (int32 max) ? This might not be an Arrow C++ issue, but another bug in the C# library that I used to generate the file. > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973607#comment-16973607 ] Neal Richardson commented on ARROW-7156: Searching for the error message, I see https://github.com/apache/arrow/blob/a33bd3acae41f89972c71ad5bd559a3cecf3e197/cpp/src/arrow/memory_pool.cc#L285 which suggests an overflow somewhere. Could you please clarify what "Crashes R Studio!" means? https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20fixVersion%20%3D%201.0.0%20AND%20component%20%3D%20R%20AND%20text%20~%20%222gb%22 finds 3 known issues about large string columns--could that be involved? > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)