[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-18 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976841#comment-16976841
 ] 

Ben Kietzman commented on ARROW-7156:
-

I can reproduce this failure with
{code}
arrow-file-to-stream SingleBatch_String_85000_Rows.arrow > /dev/null
{code}

I can confirm that the buffer length is negative as we read from flatbuffers 
https://github.com/apache/arrow/blob/bef9a1c/cpp/src/arrow/ipc/message.cc#L159

C# writer does seem to be producing an invalid file.

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-18 Thread Ben Kietzman (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976738#comment-16976738
 ] 

Ben Kietzman commented on ARROW-7156:
-

If it's supsected that the c# writer is emitting invalid record batches, could 
you share the code which generates your test files?

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973769#comment-16973769
 ] 

Neal Richardson commented on ARROW-7156:


https://arrow.apache.org/docs/format/Columnar.html doesn't say that 
RecordBatches are limited to 2 GB, but either way it sounds like there's an 
issue in the C# writer if it's (presumably) overflowing int32 and reporting 
negative offsets in the file you're trying to read.

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973723#comment-16973723
 ] 

Anthony Abate commented on ARROW-7156:
--

I uploaded some test files. they deceptively small compressed.. but 2gigs 
uncompressed

I have a work around for now - just make sure my batches are less than 2 gigs. 

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973717#comment-16973717
 ] 

Anthony Abate commented on ARROW-7156:
--

>From Event viewer:

 

Faulting application name: rsession.exe, version: 1.2.1335.0, time stamp: 
0x5c9d0154
Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 0x5dc40022
Exception code: 0xc005
Fault offset: 0x003e4c05
Faulting process id: 0x8ec
Faulting application start time: 0x01d59a59ff052a76
Faulting application path: C:\software\RStudio\bin\rsession.exe
Faulting module path: 
C:\Users\aabate\Documents\R\win-library\3.6\arrow\libs\x64\arrow.dll
Report Id: db7e29f8-54ba-40fc-a104-75d3b6f75d0e
Faulting package full name: 
Faulting package-relative application ID:

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973714#comment-16973714
 ] 

Anthony Abate commented on ARROW-7156:
--

[~npr]- crashes RStudio means just that - instead of an error message 

 

!image-2019-11-13-16-27-30-641.png!

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973713#comment-16973713
 ] 

Anthony Abate commented on ARROW-7156:
--

[~npr]- do you know if an individual RecordBatch can exceed 2 gigs (int32 max) 
? 

This might not be an Arrow C++ issue, but another bug in the C# library that I 
used to generate the file.

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973607#comment-16973607
 ] 

Neal Richardson commented on ARROW-7156:


Searching for the error message, I see 
https://github.com/apache/arrow/blob/a33bd3acae41f89972c71ad5bd559a3cecf3e197/cpp/src/arrow/memory_pool.cc#L285

which suggests an overflow somewhere.

Could you please clarify what "Crashes R Studio!" means? 

https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20fixVersion%20%3D%201.0.0%20AND%20component%20%3D%20R%20AND%20text%20~%20%222gb%22
 finds 3 known issues about large string columns--could that be involved?

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)