[
https://issues.apache.org/jira/browse/ARROW-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217586#comment-17217586
]
Joris Van den Bossche commented on ARROW-10056:
-----------------------------------------------
The pandas metadata is required for a faithful roundtrip to cover many corner
cases. But if you have a relatively simple dataframe (eg columns with floats),
the pandas metadata should not be required.
> [from ARROW-10344] Is this pandas metadata very useful to have in my case? My
> feather files just contain one string column (row indices) and for the rest I
> have just columns of int16, int32, float32 (all other columns have the same
> type in one feather file).
For those cases the pandas metadata shouldn't be very important. The only thing
is that with the metadata present, the conversion to pandas will automatically
set the string column again as the index. But this is of course something you
can easily do yourself as well.
> [Python] PyArrow writes invalid Feather v2 file: OSError: Verification of
> flatbuffer-encoded Footer failed.
> -----------------------------------------------------------------------------------------------------------
>
> Key: ARROW-10056
> URL: https://issues.apache.org/jira/browse/ARROW-10056
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 1.0.1
> Environment: CentOS7
> conda environment with pyarrow 1.0.1, numpy 1.19.1 and pandas 1.1.1
> Reporter: Gert Hulselmans
> Priority: Major
> Fix For: 3.0.0
>
>
> pyarrow writes an invalid Feather v2 file, which it can't read afterwards.
> {code:java}
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> The following code reproduces the problem for me:
> {code:python}
> import pyarrow as pa
> import numpy as np
> import pandas as pd
> nbr_regions = 1223024
> nbr_motifs = 4891
> # Create (big) dataframe.
> df = pd.DataFrame(
> np.arange(nbr_regions * nbr_motifs,
> dtype=np.float32).reshape((nbr_regions, nbr_motifs)),
> index=pd.Index(['region' + str(i) for i in range(nbr_regions)],
> name='regions'),
> columns=pd.Index(['motif' + str(i) for i in range(nbr_motifs)],
> name='motifs')
> )
> # Transpose dataframe
> df_transposed = df.transpose()
> # Write transposed dataframe to Feather v2 format.
> pf.write_feather(df_transposed, 'df_transposed.feather')
> # Trying to read the transposed dataframe from Feather v2 format, results in
> this error:
> df_transposed_read = pf.read_feather('df_transposed.feather')
> {code}
> {code:python}
> ---------------------------------------------------------------------------
> OSError Traceback (most recent call last)
> <ipython-input-64-b41ad5157e77> in <module>
> ----> 1 df_transposed_read = pf.read_feather('df_transposed.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py
> in read_feather(source, columns, use_threads, memory_map)
> 213 """
> 214 _check_pandas_version()
> --> 215 return (read_table(source, columns=columns, memory_map=memory_map)
> 216 .to_pandas(use_threads=use_threads))
> 217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py
> in read_table(source, columns, memory_map)
> 235 """
> 236 reader = ext.FeatherReader()
> --> 237 reader.open(source, use_memory_map=memory_map)
> 238
> 239 if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi
> in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi
> in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi
> in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Later I discovered that it happens also if the original dataframe is created
> in the transposed order:
> {code:python}
> # Create (big) dataframe.
> df_without_transpose = pd.DataFrame(
> np.arange(nbr_motifs * nbr_regions,
> dtype=np.float32).reshape((nbr_motifs, nbr_regions)),
> index=pd.Index(['motif' + str(i) for i in range(nbr_motifs)],
> name='motifs'),
> columns=pd.Index(['region' + str(i) for i in range(nbr_regions)],
> name='regions'),
> )
> pf.write_feather(df_without_transpose, 'df_without_transpose.feather')
> df_without_transpose_read = pf.read_feather('df_without_transpose.feather')
> ---------------------------------------------------------------------------
> OSError Traceback (most recent call last)
> <ipython-input-91-3cdad1d58c35> in <module>
> ----> 1 df_without_transpose_read =
> pf.read_feather('df_without_transpose.feather')
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py
> in read_feather(source, columns, use_threads, memory_map)
> 213 """
> 214 _check_pandas_version()
> --> 215 return (read_table(source, columns=columns, memory_map=memory_map)
> 216 .to_pandas(use_threads=use_threads))
> 217
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.py
> in read_table(source, columns, memory_map)
> 235 """
> 236 reader = ext.FeatherReader()
> --> 237 reader.open(source, use_memory_map=memory_map)
> 238
> 239 if columns is None:
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/feather.pxi
> in pyarrow.lib.FeatherReader.open()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi
> in pyarrow.lib.pyarrow_internal_check_status()
> /software/miniconda3/envs/pyarrow/lib/python3.8/site-packages/pyarrow/error.pxi
> in pyarrow.lib.check_status()
> OSError: Verification of flatbuffer-encoded Footer failed.
> {code}
> Writing to Feather v1 format works:
> {code:python}
> pf.write_feather(df_transposed, 'df_transposed.v1.feather', version=1)
> df_transposed_read_v1 = pf.read_feather('df_transposed.v1.feather')
> # Now do the same, but also save the index in the Feather v1 file.
> df_transposed_reset_index = df_transposed.reset_index()
> pf.write_feather(df_transposed_reset_index,
> 'df_transposed_reset_index.v1.feather', version=1)
> df_transposed_reset_index_read_v1 =
> pf.read_feather('df_transposed_reset_index.v1.feather')
> # Returns True
> df_transposed_reset_index_read_v1.equals(df_transposed)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)