[
https://issues.apache.org/jira/browse/ORC-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481338#comment-16481338
]
Owen O'Malley commented on ORC-362:
-----------------------------------
The "78,16,0,541,0,11" for an uncompressed file with direct encoded string,
which I assume this is:
present:
bytes: 78
byte rle: 16
bit: 0
data:
bytes: 541
length:
bytes: 0
int rle: 11
Since the previous entry has 2 elements, it makes sense that it moved up from 9
to 11. The 11 is the number of items and not a number of bytes, so I'd guess
the 13 bytes is right.
> String direct length streams gets some values even if data is null
> ------------------------------------------------------------------
>
> Key: ORC-362
> URL: https://issues.apache.org/jira/browse/ORC-362
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.4.3
> Reporter: Prasanth Jayachandran
> Priority: Major
>
> Observed this in one of the orc files recently.
> Looking at the orcfiledump (compression is NONE) something looks odd
> {code:java}
> Stream: column 2 section PRESENT start: 13976 length 80
> Stream: column 2 section DATA start: 14056 length 541
> Stream: column 2 section LENGTH start: 14597 length 13
> ..
> ..
> ..
> Row group indices for column 2:
> Entry 0: count: 4 hasNull: true min: a max: z sum: 157 positions:
> 0,0,0,0,0,0
> Entry 1: count: 5 hasNull: true min: a max: z sum: 314 positions:
> 26,111,0,157,0,4
> Entry 2: count: 2 hasNull: true min: a max: z sum: 70 positions:
> 52,62,0,471,0,9
> Entry 3: count: 0 hasNull: true positions: 78,16,0,541,0,11
> {code}
> If we look at Entry 3 (last entry) and related the stream positions, last
> entry is all nulls, the corresponding data stream ended at 541 offset (which
> is same as length). Data stream looks correct. But now if we look at length
> stream, the position is recorded as 11 in last entry but the length is
> actually 13 (this last 2 bytes is not expected). If there is no data the
> length stream is supposedly not record anything. If the data is null, only
> isPresent stream is expected to have an entry. Looks like orc writer is
> writing entries to length stream even if data is null (probably recording 0
> lengths).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)