[
https://issues.apache.org/jira/browse/ARROW-10635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-10635:
------------------------------------------
Summary: [C++] ORC reader issue with bool column (was: ORC reader issue
with bool column.)
> [C++] ORC reader issue with bool column
> ---------------------------------------
>
> Key: ARROW-10635
> URL: https://issues.apache.org/jira/browse/ARROW-10635
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 1.0.1
> Reporter: Ramakrishna Prabhu
> Priority: Minor
> Labels: orc
> Attachments: bool_pq.parquet, broken_bool.zip
>
>
> The ORC file contains single column of boolean type, from row number `20000`
> the values are mismatching compared to what is expected.
>
> As per my observation, the writer used for this ORC file assumes RLE is
> aligned with row index boundaries. That means, no two row groups will share
> same byte. And there will be no offset within byte. But I think that pyarrow
> considers whatever leftover of that partial byte which was left at end of a
> row group as data which causes the shift in the values.
>
> I have attached another parquet file with same data for reference. You would
> notice that Parquet considers last two bits of partial byte and shifts the
> data by two rows.
>
> {code:java}
> // code placeholder
> from pyarrow import orc
> f = orc.ORCFile('broken_bool.orc')
> pdf_orc=f.read().to_pandas()
> pdf_pq=pd.read_parquet("bool_pq.parquet")
> pdf_orc.col_bool.dropna()[pdf_orc.col_bool.dropna() !=
> pdf_pq.col_bool.dropna()]
> 20002 False
> 20004 False
> 20005 True
> 20007 False
> 20014 True
> ...
> 21973 False
> 21974 False
> 21985 True
> 21988 True
> 21993 False
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)