[ 
https://issues.apache.org/jira/browse/ARROW-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ramakrishna Prabhu updated ARROW-10635:
---------------------------------------
    Description: 
The ORC file contains single column of boolean type, from row number `20000` 
the values are mismatching compared to what is expected.

 

As per my observation, the writer used for this ORC file assumes RLE is aligned 
with row index boundaries. That means, no two row groups will share same byte. 
And there will be no offset within byte. But I think that pyarrow considers 
whatever leftover of that partial byte which was left at end of a row group as 
data which causes the shift in the values.

 

I have attached another parquet file with same data for reference. You would 
notice that ORC considers last two bits of partial byte and shifts the data by 
two rows.

 
{code:java}
// code placeholder
from pyarrow import orc
f = orc.ORCFile('broken_bool.orc')
pdf_orc=f.read().to_pandas() 
pdf_pq=pd.read_parquet("bool_pq.parquet")  
pdf_orc.col_bool.dropna()[pdf_orc.col_bool.dropna() != 
pdf_pq.col_bool.dropna()] 

20002 False 
20004 False 
20005 True 
20007 False 
20014 True 
... 
21973 False 
21974 False 
21985 True 
21988 True 
21993 False
{code}
 

 

  was:
The ORC file contains single column of boolean type, from row number `20000` 
the values are mismatching compared to what is expected.

 

As per my observation, the writer used for this ORC file assumes RLE is aligned 
with row index boundaries. That means, no two row groups will share same byte. 
And there will be no offset within byte. But I think that pyarrow considers 
whatever leftover of that partial byte which was left at end of a row group as 
data which causes the shift in the values.

 

I have attached another parquet file with same data for reference. You would 
notice that Parquet considers last two bits of partial byte and shifts the data 
by two rows.

 
{code:java}
// code placeholder
from pyarrow import orc
f = orc.ORCFile('broken_bool.orc')
pdf_orc=f.read().to_pandas() 
pdf_pq=pd.read_parquet("bool_pq.parquet")  
pdf_orc.col_bool.dropna()[pdf_orc.col_bool.dropna() != 
pdf_pq.col_bool.dropna()] 

20002 False 
20004 False 
20005 True 
20007 False 
20014 True 
... 
21973 False 
21974 False 
21985 True 
21988 True 
21993 False
{code}
 

 


> [C++] ORC reader issue with bool column
> ---------------------------------------
>
>                 Key: ARROW-10635
>                 URL: https://issues.apache.org/jira/browse/ARROW-10635
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 1.0.1
>            Reporter: Ramakrishna Prabhu
>            Priority: Minor
>              Labels: orc
>         Attachments: bool_pq.parquet, broken_bool.zip
>
>
> The ORC file contains single column of boolean type, from row number `20000` 
> the values are mismatching compared to what is expected.
>  
> As per my observation, the writer used for this ORC file assumes RLE is 
> aligned with row index boundaries. That means, no two row groups will share 
> same byte. And there will be no offset within byte. But I think that pyarrow 
> considers whatever leftover of that partial byte which was left at end of a 
> row group as data which causes the shift in the values.
>  
> I have attached another parquet file with same data for reference. You would 
> notice that ORC considers last two bits of partial byte and shifts the data 
> by two rows.
>  
> {code:java}
> // code placeholder
> from pyarrow import orc
> f = orc.ORCFile('broken_bool.orc')
> pdf_orc=f.read().to_pandas() 
> pdf_pq=pd.read_parquet("bool_pq.parquet")  
> pdf_orc.col_bool.dropna()[pdf_orc.col_bool.dropna() != 
> pdf_pq.col_bool.dropna()] 
> 20002 False 
> 20004 False 
> 20005 True 
> 20007 False 
> 20014 True 
> ... 
> 21973 False 
> 21974 False 
> 21985 True 
> 21988 True 
> 21993 False
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to