[ 
https://issues.apache.org/jira/browse/ORC-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated ORC-362:
--------------------------------------
    Description: 
Observed this in one of the orc files recently.

Looking at the orcfiledump (compression is NONE) something looks odd
{code:java}
    Stream: column 2 section PRESENT start: 13976 length 80
    Stream: column 2 section DATA start: 14056 length 541
    Stream: column 2 section LENGTH start: 14597 length 13
..
..
..
    Row group indices for column 2:
      Entry 0: count: 4 hasNull: true min: a max: z sum: 157 positions: 
0,0,0,0,0,0
      Entry 1: count: 5 hasNull: true min: a max: z sum: 314 positions: 
26,111,0,157,0,4
      Entry 2: count: 2 hasNull: true min: a max: z sum: 70 positions: 
52,62,0,471,0,9
      Entry 3: count: 0 hasNull: true positions: 78,16,0,541,0,11
{code}
If we look at Entry 3 (last entry) and related the stream positions, last entry 
is all nulls, the corresponding data stream ended at 541 offset (which is same 
as length). Data stream looks correct. But now if we look at length stream, the 
position is recorded as 11 in last entry but the length is actually 13 (this 
last 2 bytes is not expected). If there is no data the length stream is 
supposedly not record anything. If the data is null, only isPresent stream is 
expected to have an entry. Looks like orc writer is writing entries to length 
stream even if data is null (probably recording 0 lengths).

  was:
Observed this in one of the orc files recently.

Looking at the orcfiledump (compression is NONE) something looks odd
{code:java}
    Stream: column 2 section PRESENT start: 13976 length 80
    Stream: column 2 section DATA start: 14056 length 541
    Stream: column 2 section LENGTH start: 14597 length 13
..
..
..
    Row group indices for column 2:
      Entry 0: count: 4 hasNull: true min:  a max: z sum: 157 positions: 
0,0,0,0,0,0
      Entry 1: count: 5 hasNull: true min: a max: z sum: 314 positions: 
26,111,0,157,0,4
      Entry 2: count: 2 hasNull: true min: a max: z sum: 70 positions: 
52,62,0,471,0,9
      Entry 3: count: 0 hasNull: true positions: 78,16,0,541,0,11
{code}
If we look at Entry 3 (last entry) and related the stream positions, last entry 
is all nulls, the corresponding data stream ended at 541 offset (which is same 
as length). Data stream looks correct. But now if we look at length stream, the 
position is recorded as 11 in last entry but the length is actually 13 (this 
last 2 bytes is not expected). If there is no data the length stream is 
supposedly not record anything. If the data is null, only isPresent stream is 
expected to have an entry. Looks like orc writer is writing entries to length 
stream even if data is null (probably recording 0 lengths).


> String direct length streams gets some values even if data is null
> ------------------------------------------------------------------
>
>                 Key: ORC-362
>                 URL: https://issues.apache.org/jira/browse/ORC-362
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.4.3
>            Reporter: Prasanth Jayachandran
>            Priority: Major
>
> Observed this in one of the orc files recently.
> Looking at the orcfiledump (compression is NONE) something looks odd
> {code:java}
>     Stream: column 2 section PRESENT start: 13976 length 80
>     Stream: column 2 section DATA start: 14056 length 541
>     Stream: column 2 section LENGTH start: 14597 length 13
> ..
> ..
> ..
>     Row group indices for column 2:
>       Entry 0: count: 4 hasNull: true min: a max: z sum: 157 positions: 
> 0,0,0,0,0,0
>       Entry 1: count: 5 hasNull: true min: a max: z sum: 314 positions: 
> 26,111,0,157,0,4
>       Entry 2: count: 2 hasNull: true min: a max: z sum: 70 positions: 
> 52,62,0,471,0,9
>       Entry 3: count: 0 hasNull: true positions: 78,16,0,541,0,11
> {code}
> If we look at Entry 3 (last entry) and related the stream positions, last 
> entry is all nulls, the corresponding data stream ended at 541 offset (which 
> is same as length). Data stream looks correct. But now if we look at length 
> stream, the position is recorded as 11 in last entry but the length is 
> actually 13 (this last 2 bytes is not expected). If there is no data the 
> length stream is supposedly not record anything. If the data is null, only 
> isPresent stream is expected to have an entry. Looks like orc writer is 
> writing entries to length stream even if data is null (probably recording 0 
> lengths).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to