[
https://issues.apache.org/jira/browse/ARROW-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289504#comment-16289504
]
ASF GitHub Bot commented on ARROW-1920:
---------------------------------------
jcrist commented on issue #1418: ARROW-1920 [C++/Python] Add ORC Reader
URL: https://github.com/apache/arrow/pull/1418#issuecomment-351448881
Demo:
```python
In [1]: from pyarrow import orc
In [2]: f = orc.ORCFile('/Users/jcrist/Code/orc/examples/demo-11-none.orc')
In [3]: f.nrows
Out[3]: 1920800
In [4]: f.nstripes
Out[4]: 385
In [5]: f.read_stripe(0).to_pandas().head()
Out[5]:
_col0 _col1 _col2 _col3 _col4 _col5 _col6 _col7 _col8
0 1 M M Primary 500 Good 0 0 0
1 2 F M Primary 500 Good 0 0 0
2 3 M S Primary 500 Good 0 0 0
3 4 F S Primary 500 Good 0 0 0
4 5 M D Primary 500 Good 0 0 0
In [6]: f2 =
orc.ORCFile('/Users/jcrist/Code/orc/examples/TestOrcFile.test1.orc')
In [7]: f2.schema # A nested schema
Out[7]:
boolean1: bool
byte1: int8
short1: int16
int1: int32
long1: int64
float1: float
double1: double
bytes1: binary
string1: string
middle: struct<list: list<item: struct<int1: int32, string1: string>>>
child 0, list: list<item: struct<int1: int32, string1: string>>
child 0, item: struct<int1: int32, string1: string>
child 0, int1: int32
child 1, string1: string
list: list<item: struct<int1: int32, string1: string>>
child 0, item: struct<int1: int32, string1: string>
child 0, int1: int32
child 1, string1: string
map: list<item: struct<key: string, value: struct<int1: int32, string1:
string>>>
child 0, item: struct<key: string, value: struct<int1: int32, string1:
string>>
child 0, key: string
child 1, value: struct<int1: int32, string1: string>
child 0, int1: int32
child 1, string1: string
In [8]: f2.read(columns=['boolean1', 'middle.list.int1']).to_pydict() #
subselect nested fields
Out[8]:
OrderedDict([('boolean1', [False, True]),
('middle',
[{'list': [{'int1': 1}, {'int1': 2}]},
{'list': [{'int1': 1}, {'int1': 2}]}])])
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Add support for reading ORC files
> ---------------------------------
>
> Key: ARROW-1920
> URL: https://issues.apache.org/jira/browse/ARROW-1920
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++, Python
> Reporter: Jim Crist
> Labels: pull-request-available
>
> Would be nice to be able to read ORC files in pyarrow, similar to the already
> existing parquet support.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)