[jira] [Closed] (ORC-1116) Csv-import tool exported field become empty

William Hyun (Jira) Fri, 15 Apr 2022 16:27:06 -0700


     [ 
https://issues.apache.org/jira/browse/ORC-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


William Hyun closed ORC-1116.
-----------------------------

> Csv-import tool exported field become empty
> -------------------------------------------
>
>                 Key: ORC-1116
>                 URL: https://issues.apache.org/jira/browse/ORC-1116
>             Project: ORC
>          Issue Type: Bug
>          Components: tools
>    Affects Versions: 1.7.3
>            Reporter: kyle
>            Assignee: kyle
>            Priority: Minor
>             Fix For: 1.7.4
>
>         Attachments: CSVFileImport.dif
>
>
> When exporting a orc file with schema like "struct<a:string,b:binary>", if 
> the data in column "b" has very long bytes (over 4MB), the process could 
> segmentation fault or exported data in column "a" becomes empty string.
> Here is me trying to explain the code, maybe totally not correct, please bear 
> with me.
> Following the code in CSVFileImport.cc, when writing to a orc file, all 
> string type columns is using one databuffer inside function 
> fillStringValues(). When one data length is larger than the buffer, the 
> buffer will be resized. The resize() operation will cause all references and 
> iterators into buffer.data() become invalid. 
> In this case, when field "a" finished writing data into buffer, field "b" 
> begin writing will resize the buffer, invalidate previous buffer.data(), so 
> field "a"'s stringBatch pointing to buffer.data() is no longer valid.
> A workaround could use different databuffers for each string type column, 
> however requires allocating 4MB memory each. (As the attached file) Or let 
> all previous stringBatch re-points to new databuffer's address.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ORC-1116) Csv-import tool exported field become empty

Reply via email to