[
https://issues.apache.org/jira/browse/ARROW-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276321#comment-17276321
]
Gert Hulselmans commented on ARROW-10344:
-----------------------------------------
[~weldingwelding] The first 4/6 bytes (and last 4/6 bytes) of the Feather file
would tell you. For example, you can check it with `hexdump`.
```
❯ hexdump -C -n 8 feather_version1.feather
00000000 46 45 41 31 00 00 00 00 |FEA1....|
00000008
❯ hexdump -C -n 8 feather_version2.feather
00000000 41 52 52 4f 57 31 00 00 |ARROW1..|
00000008
```
```python
def feather_v1_or_v2(feather_file):
with open(feather_file, 'rb') as fh_feather:
fh_feather.seek(0, 0)
feather_v1_magic_bytes_header = fh_feather.read(4)
fh_feather.seek(-4, 2)
feather_v1_magic_bytes_footer = fh_feather.read(4)
if feather_v1_magic_bytes_header == feather_v1_magic_bytes_footer ==
b'FEA1':
return 1
fh_feather.seek(0, 0)
feather_v2_magic_bytes_header = fh_feather.read(6)
fh_feather.seek(-6, 2)
feather_v2_magic_bytes_footer = fh_feather.read(6)
if feather_v2_magic_bytes_header == feather_v2_magic_bytes_footer ==
b'ARROW1':
return 2
return None
```
[~jorisvandenbossche] Now that
https://issues.apache.org/jira/browse/ARROW-10056 is resolved, Feather v1
support is less critical. so the IPC and dataset API workaround are now useful
for me. It still would be good to have Feather v1 support and exposure of the
columns in the feather submodule directly.
> [Python] Get all columns names (or schema) from Feather file, before loading
> whole Feather file
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-10344
> URL: https://issues.apache.org/jira/browse/ARROW-10344
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Python
> Affects Versions: 1.0.1
> Reporter: Gert Hulselmans
> Priority: Major
>
> Is there a way to get all column names (or schema) from a Feather file before
> loading the full Feather file?
> My Feather files are big (like 100GB) and the names of the columns are
> different per analysis and can't be hard coded.
> {code:python}
> import pyarrow.feather as feather
> # Code here to check which columns are in the feather file.
> ...
> my_columns = ...
> # Result is pandas.DataFrame
> read_df = feather.read_feather('/path/to/file', columns=my_columns)
> # Result is pyarrow.Table
> read_arrow = feather.read_table('/path/to/file', columns=my_columns)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)