[ 
https://issues.apache.org/jira/browse/ORC-362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16481338#comment-16481338
 ] 

Owen O'Malley commented on ORC-362:
-----------------------------------

The "78,16,0,541,0,11" for an uncompressed file with direct encoded string, 
which I assume this is:

present:
   bytes: 78
   byte rle: 16
   bit: 0
data:
  bytes: 541
length:
  bytes: 0
  int rle: 11

Since the previous entry has 2 elements, it makes sense that it moved up from 9 
to 11. The 11 is the number of items and not a number of bytes, so I'd guess 
the 13 bytes is right.

> String direct length streams gets some values even if data is null
> ------------------------------------------------------------------
>
>                 Key: ORC-362
>                 URL: https://issues.apache.org/jira/browse/ORC-362
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.4.3
>            Reporter: Prasanth Jayachandran
>            Priority: Major
>
> Observed this in one of the orc files recently.
> Looking at the orcfiledump (compression is NONE) something looks odd
> {code:java}
>     Stream: column 2 section PRESENT start: 13976 length 80
>     Stream: column 2 section DATA start: 14056 length 541
>     Stream: column 2 section LENGTH start: 14597 length 13
> ..
> ..
> ..
>     Row group indices for column 2:
>       Entry 0: count: 4 hasNull: true min: a max: z sum: 157 positions: 
> 0,0,0,0,0,0
>       Entry 1: count: 5 hasNull: true min: a max: z sum: 314 positions: 
> 26,111,0,157,0,4
>       Entry 2: count: 2 hasNull: true min: a max: z sum: 70 positions: 
> 52,62,0,471,0,9
>       Entry 3: count: 0 hasNull: true positions: 78,16,0,541,0,11
> {code}
> If we look at Entry 3 (last entry) and related the stream positions, last 
> entry is all nulls, the corresponding data stream ended at 541 offset (which 
> is same as length). Data stream looks correct. But now if we look at length 
> stream, the position is recorded as 11 in last entry but the length is 
> actually 13 (this last 2 bytes is not expected). If there is no data the 
> length stream is supposedly not record anything. If the data is null, only 
> isPresent stream is expected to have an entry. Looks like orc writer is 
> writing entries to length stream even if data is null (probably recording 0 
> lengths).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to