[jira] [Commented] (ARROW-1920) Add support for reading ORC files

ASF GitHub Bot (JIRA) Wed, 13 Dec 2017 08:43:22 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289504#comment-16289504
 ]


ASF GitHub Bot commented on ARROW-1920:
---------------------------------------

jcrist commented on issue #1418: ARROW-1920 [C++/Python] Add ORC Reader
URL: https://github.com/apache/arrow/pull/1418#issuecomment-351448881
 
 
   Demo:
   
   ```python
   In [1]: from pyarrow import orc
   
   In [2]: f = orc.ORCFile('/Users/jcrist/Code/orc/examples/demo-11-none.orc')
   
   In [3]: f.nrows
   Out[3]: 1920800
   
   In [4]: f.nstripes
   Out[4]: 385
   
   In [5]: f.read_stripe(0).to_pandas().head()
   Out[5]:
      _col0 _col1 _col2    _col3  _col4 _col5  _col6  _col7  _col8
   0      1     M     M  Primary    500  Good      0      0      0
   1      2     F     M  Primary    500  Good      0      0      0
   2      3     M     S  Primary    500  Good      0      0      0
   3      4     F     S  Primary    500  Good      0      0      0
   4      5     M     D  Primary    500  Good      0      0      0
   
   In [6]: f2 = 
orc.ORCFile('/Users/jcrist/Code/orc/examples/TestOrcFile.test1.orc')
   
   In [7]: f2.schema  # A nested schema
   Out[7]:
   boolean1: bool
   byte1: int8
   short1: int16
   int1: int32
   long1: int64
   float1: float
   double1: double
   bytes1: binary
   string1: string
   middle: struct<list: list<item: struct<int1: int32, string1: string>>>
     child 0, list: list<item: struct<int1: int32, string1: string>>
         child 0, item: struct<int1: int32, string1: string>
             child 0, int1: int32
             child 1, string1: string
   list: list<item: struct<int1: int32, string1: string>>
     child 0, item: struct<int1: int32, string1: string>
         child 0, int1: int32
         child 1, string1: string
   map: list<item: struct<key: string, value: struct<int1: int32, string1: 
string>>>
     child 0, item: struct<key: string, value: struct<int1: int32, string1: 
string>>
         child 0, key: string
         child 1, value: struct<int1: int32, string1: string>
             child 0, int1: int32
             child 1, string1: string
   
   In [8]: f2.read(columns=['boolean1', 'middle.list.int1']).to_pydict()  # 
subselect nested fields
   Out[8]:
   OrderedDict([('boolean1', [False, True]),
                ('middle',
                 [{'list': [{'int1': 1}, {'int1': 2}]},
                  {'list': [{'int1': 1}, {'int1': 2}]}])])
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Add support for reading ORC files
> ---------------------------------
>
>                 Key: ARROW-1920
>                 URL: https://issues.apache.org/jira/browse/ARROW-1920
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, Python
>            Reporter: Jim Crist
>              Labels: pull-request-available
>
> Would be nice to be able to read ORC files in pyarrow, similar to the already 
> existing parquet support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1920) Add support for reading ORC files

Reply via email to