[jira] [Commented] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

Gert Hulselmans (Jira) Mon, 01 Feb 2021 05:34:07 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276321#comment-17276321
 ]


Gert Hulselmans commented on ARROW-10344:
-----------------------------------------

[~weldingwelding] The first 4/6 bytes (and last 4/6 bytes) of the Feather file 
would tell you. For example, you can check it with `hexdump`.

```
❯ hexdump -C -n 8 feather_version1.feather
00000000  46 45 41 31 00 00 00 00                           |FEA1....|
00000008

❯ hexdump -C -n 8 feather_version2.feather
00000000  41 52 52 4f 57 31 00 00                           |ARROW1..|
00000008
```

```python
def feather_v1_or_v2(feather_file):
    with open(feather_file, 'rb') as fh_feather:
        fh_feather.seek(0, 0)
        feather_v1_magic_bytes_header = fh_feather.read(4)
        fh_feather.seek(-4, 2)
        feather_v1_magic_bytes_footer = fh_feather.read(4)

        if feather_v1_magic_bytes_header == feather_v1_magic_bytes_footer == 
b'FEA1':
            return 1

        fh_feather.seek(0, 0)
        feather_v2_magic_bytes_header = fh_feather.read(6)
        fh_feather.seek(-6, 2)
        feather_v2_magic_bytes_footer = fh_feather.read(6)

        if feather_v2_magic_bytes_header == feather_v2_magic_bytes_footer == 
b'ARROW1':
            return 2

        return None
```

[~jorisvandenbossche] Now that 
https://issues.apache.org/jira/browse/ARROW-10056 is resolved, Feather v1 
support is less critical. so the IPC and dataset API workaround are now useful 
for me. It still would be good to have Feather v1 support and exposure of the 
columns in the feather submodule directly.


       


> [Python]  Get all columns names (or schema) from Feather file, before loading 
> whole Feather file
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10344
>                 URL: https://issues.apache.org/jira/browse/ARROW-10344
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Gert Hulselmans
>            Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before 
> loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are 
> different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10344) [Python] Get all columns names (or schema) from Feather file, before loading whole Feather file

Reply via email to