[ 
https://issues.apache.org/jira/browse/ORC-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726191#comment-15726191
 ] 

Scott Wells commented on ORC-115:
---------------------------------

Thanks for the help, Owen.  So does this mean that the resulting ORC files 
should be valid but just can't currently be read by Apache ORC, or are they 
invalid when they contain string values?  Ultimately I'm just trying to find a 
convenient way to convert some data that's currently in CSV format into ORC 
format as I evaluate AWS Athena (i.e., PrestoDB/Hive), and I was hoping to have 
a simple little standalone tool to do that instead of having to resort to 
Hadoop/Spark.

> Unable to write string data into ORC file (or at least read it back)
> --------------------------------------------------------------------
>
>                 Key: ORC-115
>                 URL: https://issues.apache.org/jira/browse/ORC-115
>             Project: Orc
>          Issue Type: Bug
>          Components: Java
>    Affects Versions: 1.2.2
>            Reporter: Scott Wells
>
> I'm trying to create a little utility to convert CSV files into ORC files.  
> I've noticed that the resulting ORC files don't seem quite correct, though.  
> In an effort to create a simple reproducible test case, I just changed the 
> "Writing/Reading ORC Files" examples here:
> https://orc.apache.org/docs/core-java.html
> to create a file based on a pair of strings instead of integers.  Basically I 
> changed the loop as follows:
> {code}
>         BytesColumnVector first = (BytesColumnVector) writeBatch.cols[0];
>         BytesColumnVector last = (BytesColumnVector) writeBatch.cols[1];
>         for (int r = 0; r < 10; ++r)
>         {
>             String firstName = ("First-" + r).intern();
>             String lastName = ("Last-" + (r * 3)).intern();
>             ...
>         }
> {code}
> The file writes without errors, and if I write it with no compression, I can 
> see the data using {{strings my-file.orc}}.  However, when I then try to read 
> the data back from the file and print out the resulting batches to the 
> console, I get the following:
> {noformat}
> ["       ", "      "]
> ["       ", "      "]
> ["       ", "      "]
> ["       ", "      "]
> ["       ", "       "]
> ["       ", "       "]
> ["       ", "       "]
> ["       ", "       "]
> ["       ", "       "]
> ["       ", "       "]
> {noformat}
> I've been completely unable to find any documentation or example code that 
> would help me to understand why this isn't working.  Any insights about what 
> I may be doing wrong here would be greatly appreciated!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to