KyleGrains opened a new pull request #1044:
URL: https://github.com/apache/orc/pull/1044


   ### What changes were proposed in this pull request?
   Let each string type columns use its own databuffer before filling into orc, 
so that when one column is calling buffer.resize(), the invalidated 
buffer.data() is not affecting other columns which points to it.
   
   ### Why are the changes needed?
   When using the csv-import tool to export a orc file with schema like 
"struct<a:string,b:binary>", if the data in column "b" has very long bytes 
(over 4MB), the process could segmentation fault or the exported data in column 
"a" could become empty string.
   
   Following the code in CSVFileImport.cc, when writing a orc file, all string 
type columns is using one databuffer inside function fillStringValues(). If a 
data length is larger than the buffer, the buffer will be resized. The resize() 
operation will cause all previous result of buffer.data() become invalid. 
   
   In this case, when field "a" finished writing data into buffer, field "b" 
begin writing will resize the buffer, invalidate previous buffer.data(), so 
field "a"'s stringBatch is not refering to a valid buffer.data() any more.
   
   A workaround could use different databuffers for each string type column, 
however requires allocating 4MB memory each. Alternatively using one same 
databuffer, but let all previous stringBatch re-points to databuffer's new 
address, after a resize() operation happens.
   
   ### How was this patch tested?
   Prepare a csv file with two columns, while the second column has data larger 
than 4MB, like:
   _str1,longbinaryabcd.........._
   Run following command which finished successfully:
   _$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc_
   Run the following command shows data correctly:
   _$ orc-contents /tmp/test.orc --columns=0_
   _{"a": "str1"}_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to