[jira] [Created] (ARROW-8462) Crash in lib.concat_tables on Windows

2020-04-14 Thread Tom Augspurger (Jira)
Tom Augspurger created ARROW-8462:
-

 Summary: Crash in lib.concat_tables on Windows
 Key: ARROW-8462
 URL: https://issues.apache.org/jira/browse/ARROW-8462
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.16.0
Reporter: Tom Augspurger


This crashes for me with pyarrow 0.16 on my Windows VM


{{
import pyarrow as pa
import pandas as pd

t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]}))
print("concat")
pa.lib.concat_tables([t])

print('done')
}}

Installed pyarrow from conda-forge. I'm not really sure how to get more debug 
info on windows unfortunately. With `python -X faulthandler` I see

{{
concat
Windows fatal exception: access violation

Current thread 0x04f8 (most recent call first):
  File "bug.py", line 6 in (module)
}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7102) Make filesystem wrappers compatible with fsspec

2019-11-08 Thread Tom Augspurger (Jira)
Tom Augspurger created ARROW-7102:
-

 Summary: Make filesystem wrappers compatible with fsspec
 Key: ARROW-7102
 URL: https://issues.apache.org/jira/browse/ARROW-7102
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Tom Augspurger


[fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a 
common API for a variety filesystem implementations. I'm proposing a 
FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec 
implementation.

 

Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to 
s3fs. 
[https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320].
 This implementation could be removed entirely once an FSSPecWrapper is done, 
or kept as an alias if it's part of the public API.

 

This is realted to ARROW-3717, which requested a GCSFSWrapper for working with 
google cloud storage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals

2017-12-07 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1897:
-

 Summary: Incorrect numpy_type for pandas metadata of Categoricals
 Key: ARROW-1897
 URL: https://issues.apache.org/jira/browse/ARROW-1897
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Tom Augspurger
 Fix For: 0.9.0


If I'm reading 
http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format
 correctly, the "numpy_type" field of a `Categorical` should be the storage 
type used for the *codes*. It looks like pyarrow is just using 'object' always.

{{
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: import pyarrow.parquet as pq

In [4]: import io

In [5]: import json

In [6]: df = pd.DataFrame({"A": [1, 2]},
   ...:   index=pd.CategoricalIndex(['one', 'two'], name='idx'))
   ...:
In [8]: sink = io.BytesIO()
   ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
   ...: 
json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1]
   ...:
Out[8]:
{'field_name': '__index_level_0__',
 'metadata': {'num_categories': 2, 'ordered': False},
 'name': 'idx',
 'numpy_type': 'object',
 'pandas_type': 'categorical'}

}}

>From the spec:

> The numpy_type is the physical storage type of the column, which is the 
> result of str(dtype) for the underlying NumPy array that holds the data. So 
> for datetimetz this is datetime64[ns] and for categorical, it may be any of 
> the supported integer categorical types.

So 'numpy_type' field should be something like `'int8'` instead of `'object'`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1593) [PYTHON] serialize_pandas should pass through the preserve_index keyword

2017-09-21 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1593:
-

 Summary: [PYTHON] serialize_pandas should pass through the 
preserve_index keyword
 Key: ARROW-1593
 URL: https://issues.apache.org/jira/browse/ARROW-1593
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Assignee: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


I'm doing some benchmarking of Arrow serialization for dask.distributed to 
serialize dataframes.

Overall things look good compared to the current implementation (using pickle). 
The biggest difference was pickle's ability to use pandas' RangeIndex to avoid 
serializing the entire Index of values when possible.

I suspect that a "range type" isn't in scope for arrow, but in the meantime 
applications using Arrow could detect the `RangeIndex`, and pass {{ 
pyarrow.serialize_pandas(df, preserve_index=False) }} 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1586) [PYTHON] serialize_pandas roundtrip loses columns name

2017-09-20 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1586:
-

 Summary: [PYTHON] serialize_pandas roundtrip loses columns name
 Key: ARROW-1586
 URL: https://issues.apache.org/jira/browse/ARROW-1586
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


The serialize / deserialize roundtrip loses {{ df.columns.name }}

{code:python}
In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: df = pd.DataFrame([[1, 2]], columns=pd.Index(['a', 'b'], 
name='col_name'))

In [4]: df.columns.name
Out[4]: 'col_name'

In [5]: pa.deserialize_pandas(pa.serialize_pandas(df)).columns.name
{code}

Is this in scope for pyarrow? I suspect it would require an update to the 
pandas section of the Schema metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1585) serialize_pandas round trip fails on integer columns

2017-09-20 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1585:
-

 Summary: serialize_pandas round trip fails on integer columns
 Key: ARROW-1585
 URL: https://issues.apache.org/jira/browse/ARROW-1585
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor
 Fix For: 0.8.0


This roundtrip fails, since the Integer column isn't converted to a string 
after deserializing

{code:python}
In [1]: import pandas as pd
im
In [2]: import pyarrow as pa

In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, 
2]}))).columns
Out[3]: Index(['0'], dtype='object')
{code}

That should be an {{ Int64Index([0]) }} for the columns.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1557) pyarrow.Table.from_arrays doesn't validate names length

2017-09-19 Thread Tom Augspurger (JIRA)
Tom Augspurger created ARROW-1557:
-

 Summary: pyarrow.Table.from_arrays doesn't validate names length
 Key: ARROW-1557
 URL: https://issues.apache.org/jira/browse/ARROW-1557
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.0
Reporter: Tom Augspurger
Priority: Minor


pa.Table.from_arrays doesn't validate that the length of {{arrays}} and 
{{names}} matches. I think this should raise with a {{ValueError}}:

{{
In [1]: import pyarrow as pa

In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 
'b', 'c'])
Out[2]:
pyarrow.Table
a: int64
b: int64

In [3]: pa.__version__
Out[3]: '0.7.0'
}}

(This is my first time using JIRA, hopefully I didn't mess up too badly)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)