vuule opened a new issue, #1475: URL: https://github.com/apache/orc/issues/1475
When writing a file with a string column and multiple row groups, the resulting file has incorrect row index streams. The string column is encoded using direct encoding. The file footer contains the LENGTH (kind 2) stream before DATA (kind 1) stream. However, the row index seems to contain the index data for the DATA stream before the LENGTH stream. Switching out the order in which we read the row index streams fixes the issue and everything can be used correctly. Isolation: Only observing this behavior with string columns. Other types with multiple streams look correct in this regard. Behavior looks unrelated to string content in the column. No info on dictionary encoded string columns - writer seemlingly defaults to direct encoding. See attached repro file. The file contains a single string column, with `["*"] * 10001` [10001_strings.zip](https://github.com/apache/orc/files/11299279/10001_strings.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
