[jira] [Created] (ARROW-8462) Crash in lib.concat_tables on Windows
Tom Augspurger created ARROW-8462: - Summary: Crash in lib.concat_tables on Windows Key: ARROW-8462 URL: https://issues.apache.org/jira/browse/ARROW-8462 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Tom Augspurger This crashes for me with pyarrow 0.16 on my Windows VM {{ import pyarrow as pa import pandas as pd t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]})) print("concat") pa.lib.concat_tables([t]) print('done') }} Installed pyarrow from conda-forge. I'm not really sure how to get more debug info on windows unfortunately. With `python -X faulthandler` I see {{ concat Windows fatal exception: access violation Current thread 0x04f8 (most recent call first): File "bug.py", line 6 in (module) }} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7102) Make filesystem wrappers compatible with fsspec
Tom Augspurger created ARROW-7102: - Summary: Make filesystem wrappers compatible with fsspec Key: ARROW-7102 URL: https://issues.apache.org/jira/browse/ARROW-7102 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Tom Augspurger [fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a common API for a variety filesystem implementations. I'm proposing a FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec implementation. Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to s3fs. [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320]. This implementation could be removed entirely once an FSSPecWrapper is done, or kept as an alias if it's part of the public API. This is realted to ARROW-3717, which requested a GCSFSWrapper for working with google cloud storage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals
Tom Augspurger created ARROW-1897: - Summary: Incorrect numpy_type for pandas metadata of Categoricals Key: ARROW-1897 URL: https://issues.apache.org/jira/browse/ARROW-1897 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Tom Augspurger Fix For: 0.9.0 If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the *codes*. It looks like pyarrow is just using 'object' always. {{ In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: import pyarrow.parquet as pq In [4]: import io In [5]: import json In [6]: df = pd.DataFrame({"A": [1, 2]}, ...: index=pd.CategoricalIndex(['one', 'two'], name='idx')) ...: In [8]: sink = io.BytesIO() ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] ...: Out[8]: {'field_name': '__index_level_0__', 'metadata': {'num_categories': 2, 'ordered': False}, 'name': 'idx', 'numpy_type': 'object', 'pandas_type': 'categorical'} }} >From the spec: > The numpy_type is the physical storage type of the column, which is the > result of str(dtype) for the underlying NumPy array that holds the data. So > for datetimetz this is datetime64[ns] and for categorical, it may be any of > the supported integer categorical types. So 'numpy_type' field should be something like `'int8'` instead of `'object'` -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1593) [PYTHON] serialize_pandas should pass through the preserve_index keyword
Tom Augspurger created ARROW-1593: - Summary: [PYTHON] serialize_pandas should pass through the preserve_index keyword Key: ARROW-1593 URL: https://issues.apache.org/jira/browse/ARROW-1593 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Assignee: Tom Augspurger Priority: Minor Fix For: 0.8.0 I'm doing some benchmarking of Arrow serialization for dask.distributed to serialize dataframes. Overall things look good compared to the current implementation (using pickle). The biggest difference was pickle's ability to use pandas' RangeIndex to avoid serializing the entire Index of values when possible. I suspect that a "range type" isn't in scope for arrow, but in the meantime applications using Arrow could detect the `RangeIndex`, and pass {{ pyarrow.serialize_pandas(df, preserve_index=False) }} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1586) [PYTHON] serialize_pandas roundtrip loses columns name
Tom Augspurger created ARROW-1586: - Summary: [PYTHON] serialize_pandas roundtrip loses columns name Key: ARROW-1586 URL: https://issues.apache.org/jira/browse/ARROW-1586 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor Fix For: 0.8.0 The serialize / deserialize roundtrip loses {{ df.columns.name }} {code:python} In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: df = pd.DataFrame([[1, 2]], columns=pd.Index(['a', 'b'], name='col_name')) In [4]: df.columns.name Out[4]: 'col_name' In [5]: pa.deserialize_pandas(pa.serialize_pandas(df)).columns.name {code} Is this in scope for pyarrow? I suspect it would require an update to the pandas section of the Schema metadata. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1585) serialize_pandas round trip fails on integer columns
Tom Augspurger created ARROW-1585: - Summary: serialize_pandas round trip fails on integer columns Key: ARROW-1585 URL: https://issues.apache.org/jira/browse/ARROW-1585 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor Fix For: 0.8.0 This roundtrip fails, since the Integer column isn't converted to a string after deserializing {code:python} In [1]: import pandas as pd im In [2]: import pyarrow as pa In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, 2]}))).columns Out[3]: Index(['0'], dtype='object') {code} That should be an {{ Int64Index([0]) }} for the columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1557) pyarrow.Table.from_arrays doesn't validate names length
Tom Augspurger created ARROW-1557: - Summary: pyarrow.Table.from_arrays doesn't validate names length Key: ARROW-1557 URL: https://issues.apache.org/jira/browse/ARROW-1557 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor pa.Table.from_arrays doesn't validate that the length of {{arrays}} and {{names}} matches. I think this should raise with a {{ValueError}}: {{ In [1]: import pyarrow as pa In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 'b', 'c']) Out[2]: pyarrow.Table a: int64 b: int64 In [3]: pa.__version__ Out[3]: '0.7.0' }} (This is my first time using JIRA, hopefully I didn't mess up too badly) -- This message was sent by Atlassian JIRA (v6.4.14#64029)