vuule opened a new issue, #1475:
URL: https://github.com/apache/orc/issues/1475

   When writing a file with a string column and multiple row groups, the 
resulting file has incorrect row index streams. 
   The string column is encoded using direct encoding. The file footer contains 
the LENGTH (kind 2) stream before DATA (kind 1) stream. However, the row index 
seems to contain the index data for the DATA stream before the LENGTH stream. 
Switching out the order in which we read the row index streams fixes the issue 
and everything can be used correctly.
   
   Isolation:
   Only observing this behavior with string columns. Other types with multiple 
streams look correct in this regard.
   Behavior looks unrelated to string content in the column.
   No info on dictionary encoded string columns - writer seemlingly defaults to 
direct encoding.
   
   See attached repro file. The file contains a single string column, with 
`["*"] * 10001`
   
[10001_strings.zip](https://github.com/apache/orc/files/11299279/10001_strings.zip)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to