[jira] [Created] (ARROW-1883) [Python] BUG: Table.to_pandas metadata checking fails if columns are not present

2017-12-04 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-1883:


 Summary: [Python] BUG: Table.to_pandas metadata checking fails if 
columns are not present
 Key: ARROW-1883
 URL: https://issues.apache.org/jira/browse/ARROW-1883
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.7.1
Reporter: Joris Van den Bossche


Found this bug in the example in the pandas documentation (), which does:

```
df = pd.DataFrame({'a': list('abc'),
   'b': list(range(1, 4)),
   'c': np.arange(3, 6).astype('u1'),
   'd': np.arange(4.0, 7.0, dtype='float64'),
   'e': [True, False, True],
   'f': pd.date_range('20130101', periods=3),
   'g': pd.date_range('20130101', periods=3, tz='US/Eastern')})

df.to_parquet('example_pa.parquet', engine='pyarrow')

pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b'])
```

and this raises in the last line reading a subset of columns:

```
...
/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
 in _add_any_metadata(table, pandas_metadata)
357 for i, col_meta in enumerate(pandas_metadata['columns']):
358 if col_meta['pandas_type'] == 'datetimetz':
--> 359 col = table[i]
360 converted = col.to_pandas()
361 tz = col_meta['metadata']['timezone']

table.pxi in pyarrow.lib.Table.__getitem__()

table.pxi in pyarrow.lib.Table.column()

IndexError: Table column index 6 is out of range
```

This is due to checking the `pandas_metadata` for all columns (and in this case 
trying to deal with a datetime tz column), while in practice not all columns 
are present in this case ('mismatch' between pandas metadata and actual 
schema). 

A smaller example without parquet:

```
In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01", 
periods=3, tz='Europe/Brussels')})

In [39]: table = pyarrow.Table.from_pandas(df)

In [40]: table
Out[40]: 
pyarrow.Table
a: int64
b: timestamp[ns, tz=Europe/Brussels]
__index_level_0__: int64
metadata

{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}

In [41]: table.to_pandas()
Out[41]: 
   a b
0  1 2017-01-01 00:00:00+01:00
1  2 2017-01-02 00:00:00+01:00
2  3 2017-01-03 00:00:00+01:00

In [44]: table_without_tz = table.remove_column(1)

In [45]: table_without_tz
Out[45]: 
pyarrow.Table
a: int64
__index_level_0__: int64
metadata

{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}

In [46]: table_without_tz.to_pandas()  # <-- wrong output !
Out[46]: 
 a
1970-01-01 01:00:00+01:001
1970-01-01 01:00:00.1+01:00  2
1970-01-01 01:00:00.2+01:00  3

In [47]: table_without_tz2 = table_without_tz.remove_column(1)

In [48]: table_without_tz2
Out[48]: 
pyarrow.Table
a: int64
metadata

{b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t'
b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta'
b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6'
b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", '
b'"metadata": null, "numpy_type": "int64", "name": "__index_level_'
b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"'
b': "0.22.0.dev0+277.gd61f411"}'}

In [49]: table_without_tz2.to_pandas() # <-- error !
---
IndexErrorTraceback (most recent call last)
 in ()
> 1 table_without_tz2.to_pandas()

table.pxi in pyarrow.lib.Table.to_pandas()

/home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py
 in table_to_blockmanager(options, table, memory_pool, nthreads)
289 

[jira] [Created] (ARROW-3953) Pandas MultiIndex renamed labels to codes (pd 0.24)

2018-12-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-3953:


 Summary: Pandas MultiIndex renamed labels to codes (pd 0.24)
 Key: ARROW-3953
 URL: https://issues.apache.org/jira/browse/ARROW-3953
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


Pandas deprecated the `MultiIndex.labels` in favor of `MultiIndex.codes` 
([https://github.com/pandas-dev/pandas/pull/23752).] In the pandas 
parquet/feather tests, we are now seeing warnings about this (and I assume 
there will be warnings as well in pyarrow tests if running on pandas master).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5514) [C++] Printer for uint64 shows wrong values

2019-06-05 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5514:


 Summary: [C++] Printer for uint64 shows wrong values
 Key: ARROW-5514
 URL: https://issues.apache.org/jira/browse/ARROW-5514
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.13.0
Reporter: Joris Van den Bossche


>From the example in ARROW-5430:

{code}
In [16]: pa.array([14989096668145380166, 15869664087396458664], 
type=pa.uint64())   

Out[16]: 

[
  -3457647405564171450,
  -2577079986313092952
]
{code}

I _think_ the actual conversion is correct, and it's only the printer that is 
going wrong, as {{to_numpy}} gives the correct values:

{code}
In [17]: pa.array([14989096668145380166, 15869664087396458664], 
type=pa.uint64()).to_numpy()

Out[17]: array([14989096668145380166, 15869664087396458664], dtype=uint64)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5436) [Python] expose filters argument in parquet.read_table

2019-05-29 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5436:


 Summary: [Python] expose filters argument in parquet.read_table
 Key: ARROW-5436
 URL: https://issues.apache.org/jira/browse/ARROW-5436
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


Currently, the {{parquet.read_table}} function can be used both for reading a 
single file (interface to ParquetFile) as a directory (interface to 
ParquetDataset). 

ParquetDataset has some extra keywords such as {{filters}} that would be nice 
to expose through {{read_table}} as well.

Of course one can always use {{ParquetDataset}} if you need its power, but for 
pandas wrapping pyarrow it is easier to be able to pass through keywords just 
to {{parquet.read_table}} instead of calling either {{read_table}} or 
{{ParquetDataset}}. Context: https://github.com/pandas-dev/pandas/issues/26551



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5572) [Python] raise error message when passing invalid filter in parquet reading

2019-06-12 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5572:


 Summary: [Python] raise error message when passing invalid filter 
in parquet reading
 Key: ARROW-5572
 URL: https://issues.apache.org/jira/browse/ARROW-5572
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset

For example, when specifying a column in the filter which is a normal column 
and not a key in your partitioned folder hierarchy, the filter gets silently 
ignored. It would be nice to get an error message for this.  
Reproducible example:

{code:python}
df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1], 'c': [1, 2, 3, 4]})
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, 'test_parquet_row_filters', partition_cols=['a'])
# filter on 'a' (partition column) -> works
pq.read_table('test_parquet_row_filters', filters=[('a', '=', 1)]).to_pandas()
# filter on normal column (in future could do row group filtering) -> silently 
does nothing
pq.read_table('test_parquet_row_filters', filters=[('b', '=', 1)]).to_pandas()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5606) [Python] pandas.RangeIndex._start/_stop/_step are deprecated

2019-06-14 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5606:


 Summary: [Python] pandas.RangeIndex._start/_stop/_step are 
deprecated
 Key: ARROW-5606
 URL: https://issues.apache.org/jira/browse/ARROW-5606
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.14.0


There are public attributes added RangeIndex.start/stop/step, and the private 
{{_start/_stop/_step}} are deprecated. See 
https://github.com/pandas-dev/pandas/pull/26581



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5655) [Python] Table.from_pydict/from_arrays not using types in specified schema correctly

2019-06-19 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5655:


 Summary: [Python] Table.from_pydict/from_arrays not using types in 
specified schema correctly 
 Key: ARROW-5655
 URL: https://issues.apache.org/jira/browse/ARROW-5655
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Example with {{from_pydict}} (from 
https://github.com/apache/arrow/pull/4601#issuecomment-503676534):

{code:python}
In [15]: table = pa.Table.from_pydict(
...: {'a': [1, 2, 3], 'b': [3, 4, 5]},
...: schema=pa.schema([('a', pa.int64()), ('c', pa.int32())]))

In [16]: table
Out[16]: 
pyarrow.Table
a: int64
c: int32

In [17]: table.to_pandas()
Out[17]: 
   a  c
0  1  3
1  2  0
2  3  4
{code}

Note that the specified schema has 1) different column names and 2) has a 
non-default type (int32 vs int64) which leads to corrupted values.

This is partly due to {{Table.from_pydict}} not using the type information in 
the schema to convert the dictionary items to pyarrow arrays. But then it is 
also {{Table.from_arrays}} that is not correctly casting the arrays to another 
dtype if the schema specifies as such.

Additional question for {{Table.pydict}} is whether it actually should override 
the 'b' key from the dictionary as column 'c' as defined in the schema (this 
behaviour depends on the order of the dictionary, which is not guaranteed below 
python 3.6).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5654) [C++] ChunkedArray should validate the types of the arrays

2019-06-19 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5654:


 Summary: [C++] ChunkedArray should validate the types of the arrays
 Key: ARROW-5654
 URL: https://issues.apache.org/jira/browse/ARROW-5654
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Example from Python, showing that you can currently create a ChunkedArray with 
incompatible types:

{code:python}
In [8]: a1 = pa.array([1, 2])

In [9]: a2 = pa.array(['a', 'b'])

In [10]: pa.chunked_array([a1, a2])
Out[10]:

[
  [
1,
2
  ],
  [
"a",
"b"
  ]
]
{code}

So a {{ChunkedArray::Validate}} can be implemented (and which should probably 
be called by default upon creation?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5295) [Python] accept pyarrow values / scalars in constructor functions ?

2019-05-09 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5295:


 Summary: [Python] accept pyarrow values / scalars in constructor 
functions ?
 Key: ARROW-5295
 URL: https://issues.apache.org/jira/browse/ARROW-5295
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently, functions like \{{pyarrow.array}} don't accept pyarrow Arrays, or 
also not scalars of it:

{code}
In [42]: arr = pa.array([1, 2, 3])

In [43]: pa.array(arr)
...
ArrowInvalid: Could not convert 1 with type pyarrow.lib.Int64Value: did not 
recognize Python value type when inferring an Arrow data type

In [44]: pa.array(list(arr))
...
ArrowInvalid: Could not convert 1 with type pyarrow.lib.Int64Value: did not 
recognize Python value type when inferring an Arrow data type
{code}

Do we want to allow those / recognize those here? (the first case could even 
have a fastpath, as we don't need to do it element by element).

Also scalars are not supported:

{code}
In [46]: type(arr.sum())
Out[46]: pyarrow.lib.Int64Scalar

In [47]: pa.array([arr.sum()])
...
ArrowInvalid: Could not convert 6 with type pyarrow.lib.Int64Scalar: did not 
recognize Python value type when inferring an Arrow data type
{code}

And also in other functions we don't accept arrow scalars / values:

{code}
In [48]: string = pa.array(['a'])[0]

In [49]: type(string)
Out[49]: pyarrow.lib.StringValue

In [50]: pa.field(string, pa.int64())
...
TypeError: expected bytes, pyarrow.lib.StringValue found
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5291) [Python] Add wrapper for "take" kernel on Array

2019-05-09 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5291:


 Summary: [Python] Add wrapper for "take" kernel on Array 
 Key: ARROW-5291
 URL: https://issues.apache.org/jira/browse/ARROW-5291
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


Expose the {{take}} kernel (for primitive types, ARROW-2102) on the python 
{{Array}} class. Part of ARROW-2667.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5293) [C++] Take kernel on DictionaryArray does not preserve ordered flag

2019-05-09 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5293:


 Summary: [C++] Take kernel on DictionaryArray does not preserve 
ordered flag
 Key: ARROW-5293
 URL: https://issues.apache.org/jira/browse/ARROW-5293
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


In the Python tests I was adding, this was failing for an ordered 
DictionaryArray: 
https://github.com/apache/arrow/pull/4281/commits/1f65936e1a06ae415647af7d5c7f54c5937861f6#diff-01b63f189a63c0d4016f2f91370e08fcR92



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5301) [Python] parquet documentation outdated on nthreads argument

2019-05-11 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5301:


 Summary: [Python] parquet documentation outdated on nthreads 
argument
 Key: ARROW-5301
 URL: https://issues.apache.org/jira/browse/ARROW-5301
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


[https://arrow.apache.org/docs/python/parquet.html#multithreaded-reads] still 
mentions {{nthreads}} instead of {{use_threads}}.

 

>From https://github.com/pandas-dev/pandas/issues/26340



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5311) [C++] Return more specific invalid Status in Take kernel

2019-05-13 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5311:


 Summary: [C++] Return more specific invalid Status in Take kernel
 Key: ARROW-5311
 URL: https://issues.apache.org/jira/browse/ARROW-5311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


Currently the {{Take}} kernel returns generic Invalid Status for certain cases, 
that could use more specific error:

- indices of wrong type (eg floats) -> TypeError instead of Invalid?
- out of bounds index -> new IndexError ?

>From review in https://github.com/apache/arrow/pull/4281

cc [~bkietz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5310) [Python] better error message on creating ParquetDataset from empty directory

2019-05-13 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5310:


 Summary: [Python] better error message on creating ParquetDataset 
from empty directory
 Key: ARROW-5310
 URL: https://issues.apache.org/jira/browse/ARROW-5310
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Currently, you get when {{path}} is an existing but empty directory:

{code:python}
>>> dataset = pq.ParquetDataset(path)
---
IndexErrorTraceback (most recent call last)
 in 
> 1 dataset = pq.ParquetDataset(path)

~/scipy/repos/arrow/python/pyarrow/parquet.py in __init__(self, path_or_paths, 
filesystem, schema, metadata, split_row_groups, validate_schema, filters, 
metadata_nthreads, memory_map)
989 
990 if validate_schema:
--> 991 self.validate_schemas()
992 
993 if filters is not None:

~/scipy/repos/arrow/python/pyarrow/parquet.py in validate_schemas(self)
   1025 self.schema = self.common_metadata.schema
   1026 else:
-> 1027 self.schema = self.pieces[0].get_metadata().schema
   1028 elif self.schema is None:
   1029 self.schema = self.metadata.schema

IndexError: list index out of range
{code}

That could be a nicer error message. 

Unless we actually want to allow this? (although I am not sure there are good 
use cases of empty directories to support this, because from an empty directory 
we cannot get any schema or metadata information?) 
It is only failing when validating the schemas, so with 
{{validate_schema=False}} it actually returns a ParquetDataset object, just 
with an empty list for {{pieces}} and no schema. So it would be easy to not 
error when validating the schemas as well for this empty-directory case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas

2019-05-20 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5379:


 Summary: [Python] support pandas' nullable Integer type in 
from_pandas
 Key: ARROW-5379
 URL: https://issues.apache.org/jira/browse/ARROW-5379
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/apache/arrow/issues/4168. We should add support for 
>pandas' nullable Integer extension dtypes, as those could map nicely to arrows 
>integer types. 

Ideally this happens in a generic way though, and not specific for this 
extension type, which is discussed in ARROW-5271



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData

2019-05-16 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5349:


 Summary: [Python/C++] Provide a way to specify the file path in 
parquet ColumnChunkMetaData
 Key: ARROW-5349
 URL: https://issues.apache.org/jira/browse/ARROW-5349
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now possible 
to collect the file metadata while writing different files (then how to write 
those metadata was not yet addressed -> original issue ARROW-1983).

However, currently, the {{file_path}} information in the ColumnChunkMetaData 
object is not set. This is, I think, expected / correct for the metadata as 
included within the single file; but for using the metadata in the combined 
dataset `_metadata`, it needs a file path set.

So if you want to use this metadata for a partitioned dataset, there needs to 
be a way to specify this file path. 
Ideas I am thinking of currently: either, we could specify a file path to be 
used when writing, or expose the `set_file_path` method on the Python side so 
you can create an updated version of the metadata after collecting it.

cc [~pearu] [~mdurant]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5237) [Python] pandas_version key in pandas metadata no longer populated

2019-04-29 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5237:


 Summary: [Python] pandas_version key in pandas metadata no longer 
populated
 Key: ARROW-5237
 URL: https://issues.apache.org/jira/browse/ARROW-5237
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.14.0


While looking at the pandas metadata, I noticed that the {{pandas_version}} 
field now is None. I suppose this is due to the recent refactoring of the 
pandas api compat (https://github.com/apache/arrow/pull/3893). PR coming.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5271) [Python] Interface for converting pandas ExtensionArray / other custom array objects to pyarrow Array

2019-05-06 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5271:


 Summary: [Python] Interface for converting pandas ExtensionArray / 
other custom array objects to pyarrow Array
 Key: ARROW-5271
 URL: https://issues.apache.org/jira/browse/ARROW-5271
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Related to ARROW-2428, which describes the issue to convert back to an 
ExtensionArray in {{to_pandas}}.

To start supporting to convert custom ExtensionArrays (eg the nullable 
Int64Dtype in pandas, or the arrow-backed fletcher arrays, ...) to arrow Arrays 
(eg in {{pyarrow.array(..)}}), I think it would be good to define an interface 
or hook that external projects can implement and that pyarrow will call if 
available. 
This would allow external projects to define how they can be converted to arrow 
arrays, without the need that pyarrow itself starts to gather a lot of special 
cased code for certain types (like pandas' nullable Int64).

This could similar to how numpy looks for the {{__array__}} method, so we might 
call it {{__arrow_array__}}.

See also https://github.com/pandas-dev/pandas/issues/20612 for an issue 
discussing this on the pandas side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5248) [Python] support dateutil timezones

2019-05-02 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5248:


 Summary: [Python] support dateutil timezones
 Key: ARROW-5248
 URL: https://issues.apache.org/jira/browse/ARROW-5248
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


The {{dateutil}} packages also provides a set of timezone objects 
(https://dateutil.readthedocs.io/en/stable/tz.html) in addition to {{pytz}}. In 
pyarrow, we only support pytz timezones (and the stdlib datetime.timezone fixed 
offset):

{code}
In [2]: import dateutil.tz  
  

In [3]: import pyarrow as pa
  

In [5]: pa.timestamp('us', dateutil.tz.gettz('Europe/Brussels'))
  
...
~/miniconda3/envs/dev37/lib/python3.7/site-packages/pyarrow/types.pxi in 
pyarrow.lib.tzinfo_to_string()

ValueError: Unable to convert timezone 
`tzfile('/usr/share/zoneinfo/Europe/Brussels')` to string
{code}

But pandas also supports dateutil timezones. As a consequence, when having a 
pandas DataFrame that uses a dateutil timezone, you get an error when 
converting to an arrow table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5287) [Python] automatic type inference for arrays of tuples

2019-05-08 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5287:


 Summary: [Python] automatic type inference for arrays of tuples
 Key: ARROW-5287
 URL: https://issues.apache.org/jira/browse/ARROW-5287
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Arrays of tuples are support to be converted to either ListArray or 
StructArray, if you specify the type explicitly:

{code}
In [6]: pa.array([(1, 2), (3, 4, 5)], type=pa.list_(pa.int64()))
  
Out[6]: 

[
  [
    1,
    2
  ],
  [
    3,
    4,
    5
  ]
]

In [7]: pa.array([(1, 2), (3, 4)], type=pa.struct([('a', pa.int64()), ('b', 
pa.int64())]))  
  
Out[7]: 

-- is_valid: all not null
-- child 0 type: int64
  [
    1,
    3
  ]
-- child 1 type: int64
  [
    2,
    4
  ]
{code}

But not when no type is specified:

{code}
In [8]: pa.array([(1, 2), (3, 4)])  
  
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.array([(1, 2), (3, 4)])

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Could not convert (1, 2) with type tuple: did not recognize 
Python value type when inferring an Arrow data type
{code}

Do we want to do automatic type inference for tuples as well? (defaulting to 
the ListArray case, just as arrays of python lists are supported) 
Or was there a specific reason to not support this by default?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5857) [Python] converting multidimensional numpy arrays to nested list type

2019-07-04 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5857:


 Summary: [Python] converting multidimensional numpy arrays to 
nested list type
 Key: ARROW-5857
 URL: https://issues.apache.org/jira/browse/ARROW-5857
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently we only support 1-dimensional numpy arrays:

{code:python}
In [28]: pa.array([np.array([[1, 2], [3, 4]])], 
type=pa.list_(pa.list_(pa.int64( 
...
ArrowInvalid: Can only convert 1-dimensional array values
{code}

So to create a nested list array, you can do that with lists of lists or object 
numpy arrays with arrays as elements. We could expand this support to 
multi-dimensional numpy arrays.

I am not sure we should do inference by default for this case, but at least 
when specifying a nested ListType, this would be nice. 

It can be an alternative way to have some support for tensors, next to an 
ExtensionType (ARROW-1614 / ARROW-5819)

Related discussions: 
https://lists.apache.org/thread.html/9b142c1709aa37dc35f1ce8db4e1ced94fcc4cdd96cc72b5772b373b@%3Cdev.arrow.apache.org%3E,
 https://github.com/apache/arrow/issues/4802



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5858) [Doc] Better document the Tensor classes in the prose documentation

2019-07-04 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5858:


 Summary: [Doc] Better document the Tensor classes in the prose 
documentation
 Key: ARROW-5858
 URL: https://issues.apache.org/jira/browse/ARROW-5858
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation, Python
Reporter: Joris Van den Bossche


>From a comment from [~wesmckinn] in ARROW-2714:

{quote}The Tensor classes are independent from the columnar data structures, 
though they reuse pieces of metadata, metadata serialization, memory 
management, and IPC.

The purpose of adding these to the library was to have in-memory data 
structures for handling Tensor/ndarray data and metadata that "plug in" to the 
rest of the Arrow C++ system (Plasma store, IO subsystem, memory pools, 
buffers, etc.).

Theoretically you could return a Tensor when creating a non-contiguous slice of 
an Array; in light of the above, I don't think that would be intuitive.

When we started the project, our focus was creating an open standard for 
in-memory columnar data, a hitherto unsolved problem. The project's scope has 
expanded into peripheral problems in the same domain in the meantime (with the 
mantra of creating interoperable components, a use-what-you-need development 
platform for system developers). I think this aspect of the project could be 
better documented / advertised, since the project's initial focus on the 
columnar standard has given some the mistaken impression that we are not 
interested in any work outside of that.
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5853) [Python] Expose boolean filter kernel on Array

2019-07-04 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5853:


 Summary: [Python] Expose boolean filter kernel on Array
 Key: ARROW-5853
 URL: https://issues.apache.org/jira/browse/ARROW-5853
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Expose the filter kernel (https://issues.apache.org/jira/browse/ARROW-1558) on 
the python Array class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5855) [Python] Add support for Duration type

2019-07-04 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5855:


 Summary: [Python] Add support for Duration type
 Key: ARROW-5855
 URL: https://issues.apache.org/jira/browse/ARROW-5855
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Add support for the Duration type (added in C++: ARROW-835, ARROW-5261)

- add DurationType and DurationArray wrappers
- add inference support for datetime.timedelta / np.timedelta64



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5859) [Python] Support ExtentionType on conversion to numpy/pandas

2019-07-04 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5859:


 Summary: [Python] Support ExtentionType on conversion to 
numpy/pandas
 Key: ARROW-5859
 URL: https://issues.apache.org/jira/browse/ARROW-5859
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently converting a Table of RecordBatch with an ExtensionType array to 
pandas gives:

{{code}}
ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of 
type extension is known.
{{code}}

And similarly converting the array itself to a python object (to_pandas or 
to_pylist) gives an ArrowNotImplementedError or a "KeyError: 28"

Initial support could be to fall back to the storage type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5854) [Python] Expose compare kernels on Array class

2019-07-04 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5854:


 Summary: [Python] Expose compare kernels on Array class
 Key: ARROW-5854
 URL: https://issues.apache.org/jira/browse/ARROW-5854
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Expose the compare kernel for comparing with scalar or array 
(https://issues.apache.org/jira/browse/ARROW-3087, 
https://issues.apache.org/jira/browse/ARROW-4990) on the python Array class.

This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5864) [Python] simplify cython wrapping of Result

2019-07-05 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5864:


 Summary: [Python] simplify cython wrapping of Result
 Key: ARROW-5864
 URL: https://issues.apache.org/jira/browse/ARROW-5864
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


See answer in https://github.com/cython/cython/issues/3018



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5915) [C++] [Python] Set up testing for backwards compatibility of the parquet reader

2019-07-11 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5915:


 Summary: [C++] [Python] Set up testing for backwards compatibility 
of the parquet reader
 Key: ARROW-5915
 URL: https://issues.apache.org/jira/browse/ARROW-5915
 Project: Apache Arrow
  Issue Type: Test
  Components: C++, Python
Reporter: Joris Van den Bossche


Given the recent parquet compat problems, we should have better testing for 
this.

For easy testing of backwards compatibility, we could add some files (with 
different types) written with older versions, add them to 
/pyarrow/tests/data/parquet (we already have some files from 0.7 there) and 
ensure they are read correctly with the current version.

Similarly as what Kartothek is doing: 
https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5905) [Python] support conversion to decimal type from floats?

2019-07-10 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5905:


 Summary: [Python] support conversion to decimal type from floats?
 Key: ARROW-5905
 URL: https://issues.apache.org/jira/browse/ARROW-5905
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


We currently allow constructing a decimal array from decimal.Decimal objects or 
from ints:

{code}
In [14]: pa.array([1, 0], type=pa.decimal128(2))
  
Out[14]: 

[
  1,
  0
]

In [31]: pa.array([decimal.Decimal('0.1'), decimal.Decimal('0.2')], 
pa.decimal128(2, 1))
  
Out[31]: 

[
  0.1,
  0.2
]
{code}

but not from floats (or strings):

{code}
In [18]: pa.array([0.1, 0.2], pa.decimal128(2)) 
  
...
ArrowTypeError: int or Decimal object expected, got float
{code}

Is this something we would like to support?

There are for sure precision issues you run into, but if the decimal type is 
fully specified, it seems clear what the user wants. In general, since decimal 
objects in pandas are not that easy to work with, many people might have plain 
float columns that they want to convert to decimal. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5201) [Python] Import ABCs from collections is deprecated in Python 3.7

2019-04-23 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5201:


 Summary: [Python] Import ABCs from collections is deprecated in 
Python 3.7
 Key: ARROW-5201
 URL: https://issues.apache.org/jira/browse/ARROW-5201
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From running the tests, I see a few deprecation warnings related to that on 
>Python 3, abstract base classes should be imported from `collections.abc` 
>instead of `collections`:

{code:none}
pyarrow/tests/test_array.py:808
  /home/joris/scipy/repos/arrow/python/pyarrow/tests/test_array.py:808: 
DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
from 'collections.abc' is deprecated, and in 3.8 it will stop working
    pa.struct([pa.field('a', pa.int64()), pa.field('b', pa.string())]))

pyarrow/tests/test_table.py:18
  /home/joris/scipy/repos/arrow/python/pyarrow/tests/test_table.py:18: 
DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import OrderedDict, Iterable

pyarrow/tests/test_feather.py::TestFeatherReader::test_non_string_columns
  /home/joris/scipy/repos/arrow/python/pyarrow/pandas_compat.py:294: 
DeprecationWarning: Using or importing the ABCs from 'collections' instead of 
from 'collections.abc' is deprecated, and in 3.8 it will stop working
    elif isinstance(name, collections.Sequence):{code}

Those could be imported depending on python 2/3 in the ``pyarrow.compat`` 
module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5210) [Python] editable install (pip install -e .) is failing

2019-04-24 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5210:


 Summary: [Python] editable install (pip install -e .) is failing 
 Key: ARROW-5210
 URL: https://issues.apache.org/jira/browse/ARROW-5210
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Following the python development documentation on building arrow and pyarrow 
([https://arrow.apache.org/docs/developers/python.html#build-and-test),] 
building pyarrow inplace with {{python setup.py build_ext --inplace}} works 
fine.

 

But if you want to also install this inplace version in the current python 
environment (editable install / development install) using pip ({{pip install 
-e .}}), this fails during the {{built_ext}} / cmake phase:
{code:none}
 
-- Looking for python3.7m
    -- Found Python lib 
/home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so
    CMake Error at cmake_modules/FindNumPy.cmake:62 (message):
  NumPy import failure:

  Traceback (most recent call last):

    File "", line 1, in 

  ModuleNotFoundError: No module named 'numpy'

    Call Stack (most recent call first):
  CMakeLists.txt:186 (find_package)


    -- Configuring incomplete, errors occurred!
    See also 
"/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log".
    See also 
"/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log".
    error: command 'cmake' failed with exit status 1
Cleaning up...
{code}
 

Alternatively, doing `python setup.py develop` to achieve the same does work.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas

2019-04-26 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5220:


 Summary: [Python] index / unknown columns in specified schema in 
Table.from_pandas
 Key: ARROW-5220
 URL: https://issues.apache.org/jira/browse/ARROW-5220
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


The {{Table.from_pandas}} method allows to specify a schema ("This can be used 
to indicate the type of columns if we cannot infer it automatically.").

But, if you also want to specify the type of the index, you get an error:

{code:python}
df = pd.DataFrame(\{'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]})
df.index = pd.Index(['a', 'b', 'c'], name='index')

my_schema = pa.schema([('index', pa.string()),
   ('a', pa.int64()),
   ('b', pa.float64()),
  ])

table = pa.Table.from_pandas(df, schema=my_schema)
{code}

gives {{KeyError: 'index'}} (because it tries to look up the "column names" 
from the schema in the dataframe, and thus does not find column 'index').

This also has the consequence that re-using the schema does not work: {{table1 
= pa.Table.from_pandas(df1);  table2 = pa.Table.from_pandas(df2, 
schema=table1.schema)}}

Extra note: also unknown columns in general give this error (column specified 
in the schema that are not in the dataframe).

At least in pyarrow 0.11, this did not give an error (eg noticed this from the 
code in example in ARROW-3861). So before, unknown columns in the specified 
schema were ignored, while now they raise an error. Was this a conscious 
change?  
So before also specifying the index in the schema "worked" in the sense that it 
didn't raise an error, but it was also ignored, so didn't actually do what you 
would expect)

Questions:

- I think that we should support specifying the index in the passed {{schema}} 
? So that the example above works (although this might be complicated with 
RangeIndex that is not serialized any more)
- But what to do in general with additional columns in the schema that are not 
in the DataFrame? Are we fine with keep raising an error as it is now (the 
error message could be improved then)? Or do we again want to ignore them? (or, 
it could actually also add them as all nulls to the table)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-6321) [Python] Ability to create ExtensionBlock on conversion to pandas

2019-08-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6321:


 Summary: [Python] Ability to create ExtensionBlock on conversion 
to pandas
 Key: ARROW-6321
 URL: https://issues.apache.org/jira/browse/ARROW-6321
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


To be able to create a pandas DataFrame in {{to_pandas()}} that holds 
ExtensionArrays (e.g. towards ARROW-2428 to register a conversion), we first 
need to add to the {{table_to_blockmanager}} / {{ConvertTableToPandas}} 
conversion utilities the ability to create an pandas {{ExtensionBlock}} that 
can hold a pandas {{ExtensionArray}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6305) [Python] scalar pd.NaT incorrectly parsed in conversion from Python

2019-08-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6305:


 Summary: [Python] scalar pd.NaT incorrectly parsed in conversion 
from Python
 Key: ARROW-6305
 URL: https://issues.apache.org/jira/browse/ARROW-6305
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


When converting from scalar values, using {{pd.NaT}} (the missing value 
indicator that pandas uses for datetime64 data) results in an incorrect 
timestamp:

{code}
In [6]: pa.array([pd.Timestamp("2012-01-01"), pd.NaT]) 
Out[6]: 

[
  2012-01-01 00:00:00.00,
  0001-01-01 00:00:00.00
]
{code}

where {{pd.NaT}} is converted to "0001-01-01", which is strange, as that does 
not even correspond with the integer value of pd.NaT. 

Numpy's version ({{np.datetime64('NaT')}}) is correctly handled. Which also 
means that a pandas Series holding pd.NaT is handled correctly (as when 
converting to numpy it is using numpy's NaT).

Related to ARROW-842.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6325) [Python] wrong conversion of DataFrame with boolean values

2019-08-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6325:


 Summary: [Python] wrong conversion of DataFrame with boolean values
 Key: ARROW-6325
 URL: https://issues.apache.org/jira/browse/ARROW-6325
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: Joris Van den Bossche
 Fix For: 0.15.0


>From https://github.com/pandas-dev/pandas/issues/28090

{code}
In [19]: df = pd.DataFrame(np.ones((5, 2), dtype=bool), columns=['a', 'b']) 

In [20]: df  
Out[20]: 
  a b
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

In [21]: table = pa.table(df) 

In [23]: table.column(0)
Out[23]: 

[
  [
true,
false,
false,
false,
false
  ]
]
{code}

The resulting table has False values while the original DataFrame had only true 
values. 
It seems this has to do with the fact that it are multiple columns, as with a 
single column it converts correctly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6548) [Python] consistently handle conversion of all-NaN arrays across types

2019-09-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6548:


 Summary: [Python] consistently handle conversion of all-NaN arrays 
across types
 Key: ARROW-6548
 URL: https://issues.apache.org/jira/browse/ARROW-6548
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


In ARROW-5682 (https://github.com/apache/arrow/pull/5333), next to fixing 
actual conversion bugs, I added the ability to convert all-NaN float arrays 
when converting to string type (and only with {{from_pandas=True}}). So this 
now works:

{code}
>>> pa.array(np.array([np.nan, np.nan], dtype=float), type=pa.string())

[
  null,
  null
]
{code}

However, I only added this for string type (and it already works for float and 
int types). If we are happy with this behaviour, we should also add it for 
other types.




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6492) [Python] file written with latest fastparquet cannot be read with latest pyarrow

2019-09-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6492:


 Summary: [Python] file written with latest fastparquet cannot be 
read with latest pyarrow
 Key: ARROW-6492
 URL: https://issues.apache.org/jira/browse/ARROW-6492
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From report on the pandas issue tracker: 
>https://github.com/pandas-dev/pandas/issues/28252

With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1), 
writing a file with pandas using the fastparquet engine cannot be read with the 
pyarrow engine:

{code}
df = pd.DataFrame({'A': [1, 2, 3]})
df.to_parquet("test.parquet", engine="fastparquet", compression=None)   

  
pd.read_parquet("test.parquet", engine="pyarrow")   
{code}

gives the following error when reading:

{code}
> 1 pd.read_parquet("test.parquet", engine="pyarrow")

~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in 
read_parquet(path, engine, columns, **kwargs)
292 
293 impl = get_engine(engine)
--> 294 return impl.read(path, columns=columns, **kwargs)

~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, 
path, columns, **kwargs)
123 kwargs["use_pandas_metadata"] = True
124 result = self.api.parquet.read_table(
--> 125 path, columns=columns, **kwargs
126 ).to_pandas()
127 if should_close:

~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in 
pyarrow.lib._PandasConvertible.to_pandas()

~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in 
pyarrow.lib.Table._to_pandas()

~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
table_to_blockmanager(options, table, categories, ignore_metadata)
642 column_indexes = pandas_metadata.get('column_indexes', [])
643 index_descriptors = pandas_metadata['index_columns']
--> 644 table = _add_any_metadata(table, pandas_metadata)
645 table, index = _reconstruct_index(table, index_descriptors,
646   all_columns)

~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
_add_any_metadata(table, pandas_metadata)
965 raw_name = 'None'
966 
--> 967 idx = schema.get_field_index(raw_name)
968 if idx != -1:
969 if col_meta['pandas_type'] == 'datetimetz':

~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in 
pyarrow.lib.Schema.get_field_index()

~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
 in string.from_py.__pyx_convert_string_from_py_std__in_string()

TypeError: expected bytes, dict found
{code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6529) [C++] Feather: slow writing of NullArray

2019-09-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6529:


 Summary: [C++] Feather: slow writing of NullArray
 Key: ARROW-6529
 URL: https://issues.apache.org/jira/browse/ARROW-6529
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


>From 
>https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none

Smaller example with just using pyarrow, it seems that writing an array of 
nulls takes much longer than an array of for example ints, which seems a bit 
strange:

{code}
In [93]: arr = pa.array([1]*1000)  

In [94]: %%timeit 
...: w = pyarrow.feather.FeatherWriter('__test.feather') 
...: w.writer.write_array('x', arr) 
...: w.writer.close() 

31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 1 loops each)

In [95]: arr = pa.array([None]*1000)  

In [96]: arr
Out[96]: 

1000 nulls

In [97]: %%timeit 
...: w = pyarrow.feather.FeatherWriter('__test.feather') 
...: w.writer.write_array('x', arr) 
...: w.writer.close() 

3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{code}

So writing the same length NullArray takes ca 100x more time.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6488) [Python] pyarrow.NULL equals to itself

2019-09-08 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6488:


 Summary: [Python] pyarrow.NULL equals to itself
 Key: ARROW-6488
 URL: https://issues.apache.org/jira/browse/ARROW-6488
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.15.0


Somewhat related to ARROW-6386 on the interpretation of nulls, we currently 
have the following behaviour:

{code}
In [28]: pa.NULL == pa.NULL 

   
Out[28]: True
{code}

Which I think is certainly unexpected for a null / missing value. I still need 
to check what the array-level compare kernel does (NULL or False? ideally NULL 
I think), but we should follow that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6506) [C++] Validation of ExtensionType with nested type fails

2019-09-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6506:


 Summary: [C++] Validation of ExtensionType with nested type fails
 Key: ARROW-6506
 URL: https://issues.apache.org/jira/browse/ARROW-6506
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 0.15.0


A reproducer using the Python ExtensionType:

{code}
class MyStructType(pa.ExtensionType): 

def __init__(self): 
storage_type = pa.struct([('a', pa.int64()), ('b', pa.int64())]) 
pa.ExtensionType.__init__(self, storage_type, 'my_struct_type') 

def __arrow_ext_serialize__(self): 
return b'' 

@classmethod 
def __arrow_ext_deserialize__(self, storage_type, serialized): 
return MyStructType() 

ty = MyStructType()
storage_array = pa.array([{'a': 1, 'b': 2}], ty.storage_type) 
arr = pa.ExtensionArray.from_storage(ty, storage_array) 
{code}

then validating this array fails because it expects no children (the extension 
array itself has no children, only the storage array):

{code}
In [8]: arr.validate()   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 arr.validate()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.validate()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Expected 0 child arrays in array of type 
extension, got 2
{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6507) [C++] Add ExtensionArray::ExtensionValidate for custom validation?

2019-09-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6507:


 Summary: [C++] Add ExtensionArray::ExtensionValidate for custom 
validation?
 Key: ARROW-6507
 URL: https://issues.apache.org/jira/browse/ARROW-6507
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


>From discussing ARROW-6506, [~bkietz] said: an extension type might place more 
>constraints on an array than those implicit in its storage type, and users 
>will probably expect to be able to plug those into {{Validate}}.

So we could have a {{ExtensionArray::ExtensionValidate}} that the visitor for 
{{ExtensionArray}} can call, similarly like there is also an 
{{ExtensionType::ExtensionEquals}} that the visitor calls when extension types 
are checked for equality.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6556) [Python] prepare on pandas release without SparseDataFrame

2019-09-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6556:


 Summary: [Python] prepare on pandas release without SparseDataFrame
 Key: ARROW-6556
 URL: https://issues.apache.org/jira/browse/ARROW-6556
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


We still have a few places where we use SparseDataFrame. An upcoming release of 
pandas will remove this class, so we should already make sure it works for that.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays

2019-08-05 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6132:


 Summary: [Python] ListArray.from_arrays does not check validity of 
input arrays
 Key: ARROW-6132
 URL: https://issues.apache.org/jira/browse/ARROW-6132
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/apache/arrow/pull/4979#issuecomment-517593918.

When creating a ListArray from offsets and values in python, there is no 
validation of the offsets that it starts with 0 and ends with the length of the 
array (but is that required? the docs seem to indicate that: 
https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type
 ("The first value in the offsets array is 0, and the last element is the 
length of the values array.").

The array you get "seems" ok (the repr), but on conversion to python or 
flattened arrays, things go wrong:

{code}
In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 

In [62]: a
Out[62]: 

[
  [
1,
2
  ],
  [
3,
4
  ]
]

In [63]: a.flatten()
Out[63]: 

[
  0,   # <--- includes the 0
  1,
  2,
  3,
  4
]

In [64]: a.to_pylist()
Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes more 
elements as garbage
{code}


Calling {{validate}} manually correctly raises:

{code}
In [65]: a.validate()
...
ArrowInvalid: Final offset invariant not equal to values length: 10!=5
{code}

In C++ the main constructors are not safe, and as the caller you need to ensure 
that the data is correct or call a safe (slower) constructor. But do we want to 
use the unsafe / fast constructors without validation in Python as default as 
well? Or should we do a call to {{validate}} here?

A quick search seems to indicate that `pa.Array.from_buffers` does validation, 
but other `from_arrays` method don't seem to explicitly do this. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6159:


 Summary: [C++] PrettyPrint of arrow::Schema missing identation for 
first line
 Key: ARROW-6159
 URL: https://issues.apache.org/jira/browse/ARROW-6159
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.14.1
Reporter: Joris Van den Bossche


Minor issue, but I noticed when printing a Schema with indentation, like:

{code}
  std::shared_ptr field1 = arrow::field("column1", 
arrow::int32());
  std::shared_ptr field2 = arrow::field("column2", arrow::utf8());

  std::shared_ptr schema = arrow::schema({field1, field2});

  arrow::PrettyPrintOptions options{4};
  arrow::PrettyPrint(*schema, options, ::cout);
{code}

you get 

{code}
column1: int32
column2: string
{code}

so not applying the indent for the first line.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6157:


 Summary: [Python][C++] UnionArray with invalid data passes 
validation / leads to segfaults
 Key: ARROW-6157
 URL: https://issues.apache.org/jira/browse/ARROW-6157
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


>From the Python side, you can create an "invalid" UnionArray:

{code}
binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
int64 = pa.array([1, 2, 3], type='int64') 
types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
of bound for number of childs
value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')

a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
{code}

Eg on conversion to python this leads to a segfault:

{code}
In [7]: a.to_pylist()
Segmentation fault (core dumped)
{code}

On the other hand, doing an explicit validation does not give an error:

{code}
In [8]: a.validate()
{code}

Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} 
for UnionArray does nothing)




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types

2019-08-07 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6158:


 Summary: [Python] possible to create StructArray with type that 
conflicts with child array's types
 Key: ARROW-6158
 URL: https://issues.apache.org/jira/browse/ARROW-6158
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Using the Python interface as example. This creates a {{StructArray}} where the 
field types don't match the child array types:

{code}
a = pa.array([1, 2, 3], type=pa.int64())
b = pa.array(['a', 'b', 'c'], type=pa.string())
inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]

a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) 
{code}

The above works fine. I didn't find anything that errors (eg conversion to 
pandas, slicing), also validation passes, but the type actually has the 
inconsistent child types:

{code}
In [2]: a
Out[2]: 

-- is_valid: all not null
-- child 0 type: int64
  [
1,
2,
3
  ]
-- child 1 type: string
  [
"a",
"b",
"c"
  ]

In [3]: a.type
Out[3]: StructType(struct)

In [4]: a.to_pandas()
Out[4]: 
array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
  dtype=object)

In [5]: a.validate() 
{code}

Shouldn't this be disallowed somehow? (it could be checked in the Python 
{{from_arrays}} method, but maybe also in {{StructArray::Make}} which already 
checks for the number of fields vs arrays and a consistent array length). 

Similarly to discussion in ARROW-6132, I would also expect that this the 
{{ValidateArray}} catches this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6115) [Python] support LargeList, LargeString, LargeBinary in conversion to pandas

2019-08-02 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6115:


 Summary: [Python] support LargeList, LargeString, LargeBinary in 
conversion to pandas
 Key: ARROW-6115
 URL: https://issues.apache.org/jira/browse/ARROW-6115
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


General python support for those 3 new types has been added: ARROW-6000, 
ARROW-6084

However, one aspect that is not yet implemented is conversion to pandas (or 
numpy array):

{code}
In [67]: a = pa.array(['a', 'b', 'c'], pa.large_string()) 

In [68]: a.to_pandas() 
...
ArrowNotImplementedError: large_utf8

In [69]: pa.table({'a': a}).to_pandas()
...
ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of 
type large_string is known.
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?

2019-08-08 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6179:


 Summary: [C++] ExtensionType subclass for "unknown" types?
 Key: ARROW-6179
 URL: https://issues.apache.org/jira/browse/ARROW-6179
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche


In C++, when receiving IPC with extension type metadata for a type that is 
unknown (the name is not registered), we currently fall back to returning the 
"raw" storage array. The custom metadata (extension name and metadata) is still 
available in the Field metadata.

Alternatively, we could also have a generic {{ExtensionType}} class that can 
hold such "unknown" extension type (eg {{UnknowExtensionType}} or 
{{GenericExtensionType}}), keeping the extension name and metadata in the 
Array's type. 

This could be a single class where several instances can be created given a 
storage type, extension name and optionally extension metadata. It would be a 
way to have an unregistered extension type.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6176) [Python] Allow to subclass ExtensionArray to attach to custom extension type

2019-08-08 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6176:


 Summary: [Python] Allow to subclass ExtensionArray to attach to 
custom extension type
 Key: ARROW-6176
 URL: https://issues.apache.org/jira/browse/ARROW-6176
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently, you can define a custom extension type in Python with 

{code}
class UuidType(pa.ExtensionType):

def __init__(self):
pa.ExtensionType.__init__(self, pa.binary(16))

def __reduce__(self):
return UuidType, ()
{code}

but the array you can create with this is always ExtensionArray. We should 
provide a way to define a subclass (eg `UuidArray` in this case) that can hold 
custom logic.

For example, a user might want to define `UuidArray` such that `arr[i]` returns 
an instance of Python's `uuid.UUID`

>From https://github.com/apache/arrow/pull/4532#pullrequestreview-249396691



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6187) [C++] fallback to storage type when writing ExtensionType to Parquet

2019-08-09 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6187:


 Summary: [C++] fallback to storage type when writing ExtensionType 
to Parquet
 Key: ARROW-6187
 URL: https://issues.apache.org/jira/browse/ARROW-6187
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Writing a table that contains an ExtensionType array to a parquet file is not 
yet implemented. It currently raises "ArrowNotImplementedError: Unhandled type 
for Arrow to Parquet schema conversion: extension" 
(for a PyExtensionType in this case).

I think minimal support can consist of writing the storage type / array. 

We also might want to save the extension name and metadata in the parquet 
FileMetadata. 

Later on, this could be potentially be used to restore the extension type when 
reading. This is related to other issues that need to save the arrow schema 
(categorical: ARROW-5480, time zones: ARROW-5888). Only in this case, we 
probably want to store the serialised type in addition to the schema (which 
only has the extension type's name). 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6082) [Python] create pa.dictionary() type with non-integer indices type crashes

2019-07-31 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-6082:


 Summary: [Python] create pa.dictionary() type with non-integer 
indices type crashes
 Key: ARROW-6082
 URL: https://issues.apache.org/jira/browse/ARROW-6082
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


For example if you mixed the order of the indices and values type:

{code}
In [1]: pa.dictionary(pa.int8(), pa.string())   

   
Out[1]: DictionaryType(dictionary)

In [2]: pa.dictionary(pa.string(), pa.int8())   

   
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0731 14:40:42.748589 26310 type.cc:440]  Check failed: 
is_integer(index_type->id()) dictionary index type should be signed integer
*** Check failure stack trace: ***
Aborted (core dumped)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6642) [Python] chained access of ParquetDataset's metadata segfaults

2019-09-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6642:


 Summary: [Python] chained access of ParquetDataset's metadata 
segfaults
 Key: ARROW-6642
 URL: https://issues.apache.org/jira/browse/ARROW-6642
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Creating and reading a parquet dataset:

{code}
table = pa.table({'a': [1, 2, 3]})

import pyarrow.parquet as pq
pq.write_table(table, '__test_statistics_segfault.parquet')
dataset = pq.ParquetDataset('__test_statistics_segfault.parquet')
dataset_piece = dataset.pieces[0]
{code}

If you access the metadata and a column's statistics in steps, this works fine:

{code}
meta = dataset_piece.get_metadata()
row = meta.row_group(0)
col = row.column(0)
{code}

but doing it chained in one step, this segfaults:

{code}
dataset_piece.get_metadata().row_group(0).column(0)
{code}

{{dataset_piece.get_metadata().row_group(0)}} still works, but additionally 
with {{.column(0)}} then it segfaults. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6704) [C++] Cast from timestamp to higher resolution does not check out of bounds timestamps

2019-09-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6704:


 Summary: [C++] Cast from timestamp to higher resolution does not 
check out of bounds timestamps
 Key: ARROW-6704
 URL: https://issues.apache.org/jira/browse/ARROW-6704
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


When casting eg {{timestamp('s')}} to {{timestamp('ns')}}, we do not check for 
out of bounds timestamps, giving "garbage" timestamps in the result:

{code}
In [74]: a_np = np.array(["2012-01-01", "2412-01-01"], dtype="datetime64[s]")   

   

In [75]: arr = pa.array(a_np)   

   

In [76]: arr

   
Out[76]: 

[
  2012-01-01 00:00:00,
  2412-01-01 00:00:00
]

In [77]: arr.cast(pa.timestamp('ns'))   

   
Out[77]: 

[
  2012-01-01 00:00:00.0,
  1827-06-13 00:25:26.290448384
]
{code}

Now, this is the same behaviour as numpy, so not sure we should do this. 
However, since we have a {{safe=True/False}}, I would expect that for 
{{safe=True}} we check this and for {{safe=False}} we do not check this.  
(numpy has a similiar {{casting='safe'}} but also does not raise an error in 
that case).




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6763) [Python] Parquet s3 tests are skipped because dependencies are not installed

2019-10-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6763:


 Summary: [Python] Parquet s3 tests are skipped because 
dependencies are not installed
 Key: ARROW-6763
 URL: https://issues.apache.org/jira/browse/ARROW-6763
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Currently the s3 parquet test is skipped on both Travis as ursabot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-5603) [Python] registere pytest markers to avoid warnings

2019-06-14 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5603:


 Summary: [Python] registere pytest markers to avoid warnings
 Key: ARROW-5603
 URL: https://issues.apache.org/jira/browse/ARROW-5603
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.14.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5890) [C++][Python] Support ExtensionType arrays in more kernels

2019-07-09 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5890:


 Summary: [C++][Python] Support ExtensionType arrays in more kernels
 Key: ARROW-5890
 URL: https://issues.apache.org/jira/browse/ARROW-5890
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


>From a quick test (through Python), it seems that {{slice}} and {{take}} work, 
>but the following not:

- {{cast}}: it could rely on the casting rules for the storage type. Or do we 
want that you explicitly have to take the storage array before casting?
- {{dictionary_encode}} / {{unique}}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-7027) [Python] pa.table(..) returns instead of raises error if passing invalid object

2019-10-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7027:


 Summary: [Python] pa.table(..) returns instead of raises error if 
passing invalid object
 Key: ARROW-7027
 URL: https://issues.apache.org/jira/browse/ARROW-7027
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


When passing eg a Series instead of a DataFrame, you get:

{code}
In [4]: df = pd.DataFrame({'a': [1, 2, 3]}) 

   

In [5]: table = pa.table(df['a'])   

   

In [6]: table   

   
Out[6]: TypeError('Expected pandas DataFrame or python dictionary')

In [7]: type(table) 

   
Out[7]: TypeError
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7068) [C++] Expose the offsets of a ListArray as a Int32Array

2019-11-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7068:


 Summary: [C++] Expose the offsets of a ListArray as a Int32Array
 Key: ARROW-7068
 URL: https://issues.apache.org/jira/browse/ARROW-7068
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


As follow-up on ARROW-7031 (https://github.com/apache/arrow/pull/5759), we can 
move this into C++ and use that implementation from Python.

 

Cfr [https://github.com/apache/arrow/pull/5759#discussion_r342244521,] this 
could be a \{{ListArray::value_offsets_array}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7031) [Python] Expose the offsets of a ListArray in python

2019-10-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7031:


 Summary: [Python] Expose the offsets of a ListArray in python
 Key: ARROW-7031
 URL: https://issues.apache.org/jira/browse/ARROW-7031
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Assume the following ListArray:

{code}
In [1]: arr = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 
5]) 
 

In [2]: arr 

   
Out[2]: 

[
  [
1,
2,
3
  ],
  [
4,
5
  ]
]
{code}

You can get the actual values as a flat array through {{.values}} / 
{{.flatten()}}, but there is currently no easy way to get back to the offsets 
(except from interpreting the buffers manually). 

We should probably add an {{offsets}} attribute (there is actually also a TODO 
comment for that).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7154) [C++] Build error when building tests but not with snappy

2019-11-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7154:


 Summary: [C++] Build error when building tests but not with snappy
 Key: ARROW-7154
 URL: https://issues.apache.org/jira/browse/ARROW-7154
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Since the docker-compose PR landed, I am having build errors like:
{code:java}
[361/376] Linking CXX executable debug/arrow-python-test
FAILED: debug/arrow-python-test
: && /home/joris/miniconda3/envs/arrow-dev/bin/ccache 
/home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++  
-Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 
-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
-fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0  
-Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
-msse4.2  -g  -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now 
-Wl,--disable-new-dtags -Wl,--gc-sections   -rdynamic 
src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o  -o 
debug/arrow-python-test  
-Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib
 debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 
debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
/home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread -lpthread 
-ldl  -lutil -lrt -ldl 
/home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a 
/home/joris/miniconda3/envs/arrow-dev/lib/libglog.so 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt 
/home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && :
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, 
not found (try using -rpath or -rpath-link)
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not 
found (try using -rpath or -rpath-link)
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 debug/libarrow.so.100.0.0: undefined reference to 
`boost::system::detail::generic_category_ncx()'
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 debug/libarrow.so.100.0.0: undefined reference to 
`boost::filesystem::path::operator/=(boost::filesystem::path const&)'
collect2: error: ld returned 1 exit status
{code}
which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed by 
debug/libarrow.so.100.0.0, not found" (although this is certainly present).

The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set to 
OFF, it works fine.

It also seems to be related to this specific change in the docker compose PR:
{code:java}
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index c80ac3310..3b3c9eb8f 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -266,6 +266,15 @@ endif(UNIX)
 # Set up various options
 #

-if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS)
-  # Currently the compression tests require at least these libraries; bz2 and
-  # zstd are optional. See ARROW-3984
-  set(ARROW_WITH_BROTLI ON)
-  set(ARROW_WITH_LZ4 ON)
-  set(ARROW_WITH_SNAPPY ON)
-  set(ARROW_WITH_ZLIB ON)
-endif()
-
 if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
   set(ARROW_JSON ON)
 endif()
{code}

If I add that back, the build works.

With only `set(ARROW_WITH_BROTLI ON)`, it still fails
 With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about 
liblz4 instead of libboost (but also liblz4 is actually present)
 With only `set(ARROW_WITH_SNAPPY ON)`, it works
 With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about 
libz.so.1 not found

So it seems that the absence of snappy causes others to fail.

In the recommended build settings in the development docs 
([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),]
 the compression libraries are enabled. But I was still building without them 
(stemming from the time they were enabled by default). So I was using:

{code}
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -GNinja \
 -DCMAKE_INSTALL_LIBDIR=lib \
 -DARROW_PARQUET=ON \
 -DARROW_PYTHON=ON \
 -DARROW_BUILD_TESTS=ON \
 ..
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7066) [Python] support returning ChunkedArray from __arrow_array__ ?

2019-11-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7066:


 Summary: [Python] support returning ChunkedArray from 
__arrow_array__ ?
 Key: ARROW-7066
 URL: https://issues.apache.org/jira/browse/ARROW-7066
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


The {{\_\_arrow_array\_\_}} protocol was added so that custom objects can 
define how they should be converted to a pyarrow Array (similar to numpy's 
{{\_\_array\_\_}}). This is then also used to support converting pandas 
DataFrames with columns using pandas' ExtensionArrays to a pyarrow Table (if 
the pandas ExtensionArray, such as nullable integer type, implements this 
{{\_\_arrow_array\_\_}} method).

This last use case could also be useful for fletcher 
(https://github.com/xhochy/fletcher/, a package that implements pandas 
ExtensionArrays that wrap pyarrow arrays, so they can be stored as is in a 
pandas DataFrame).  
However, fletcher stores ChunkedArrays in ExtensionArry / the columns of a 
pandas DataFrame (to have a better mapping with a Table, where the columns also 
consist of chunked arrays). While we currently require that the return value of 
{{\_\_arrow_array\_\_}} is a pyarrow.Array.

So I was wondering: could we relax this constraint and also allow ChunkedArray 
as return value? 
However, this protocol is currently called in the {{pa.array(..)}} function, 
which probably should keep returning an Array (and not ChunkedArray in certain 
cases).

cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2019-12-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7365:


 Summary: [Python] Support FixedSizeList type in conversion to 
numpy/pandas
 Key: ARROW-7365
 URL: https://issues.apache.org/jira/browse/ARROW-7365
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-7261, still need to add support for FixedSizeListType in the 
arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6885) [Python] Remove superfluous skipped timedelta test

2019-10-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6885:


 Summary: [Python] Remove superfluous skipped timedelta test
 Key: ARROW-6885
 URL: https://issues.apache.org/jira/browse/ARROW-6885
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Now that we support timedelta / duration type, there is an old xfailed test 
that can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7022) [Python] __arrow_array__ does not work for ExtensionTypes in Table.from_pandas

2019-10-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7022:


 Summary: [Python] __arrow_array__ does not work for ExtensionTypes 
in Table.from_pandas
 Key: ARROW-7022
 URL: https://issues.apache.org/jira/browse/ARROW-7022
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


When someone has a custom ExtensionType defined in Python, and an array class 
that gets converted to that (through {{\_\_arrow_array\_\_}}), the conversion 
in pyarrow works with the array class, but not yet for the array stored in a 
pandas DataFrame.

Eg using my definition of ArrowPeriodType in 
https://github.com/pandas-dev/pandas/pull/28371, I see:

{code}
In [15]: pd_array = pd.period_range("2012-01-01", periods=3, freq="D").array

   

In [16]: pd_array   

   
Out[16]: 

['2012-01-01', '2012-01-02', '2012-01-03']
Length: 3, dtype: period[D]

In [17]: pa.array(pd_array) 

   
Out[17]: 

[
  15340,
  15341,
  15342
]

In [18]: df = pd.DataFrame({'periods': pd_array})   

   

In [19]: pa.table(df)   

   
...
ArrowInvalid: ('Could not convert 2012-01-01 with type Period: did not 
recognize Python value type when inferring an Arrow data type', 'Conversion 
failed for column periods with type period[D]')
{code}

(this is working correctly for array objects whose {{\_\_arrow_array\_\_}} is 
returning a built-in pyarrow Array).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7023) [Python] pa.array does not use "from_pandas" semantics for pd.Index

2019-10-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7023:


 Summary: [Python] pa.array does not use "from_pandas" semantics 
for pd.Index
 Key: ARROW-7023
 URL: https://issues.apache.org/jira/browse/ARROW-7023
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 1.0.0


{code}
In [15]: idx = pd.Index([1, 2, np.nan], dtype=object)   

   

In [16]: pa.array(idx)  

   
Out[16]: 

[
  1,
  2,
  nan
]

In [17]: pa.array(idx, from_pandas=True)

   
Out[17]: 

[
  1,
  2,
  null
]

In [18]: pa.array(pd.Series(idx))   

   
Out[18]: 

[
  1,
  2,
  null
]
{code}

We should probably handle Series and Index the same in this regard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6974) [C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern

2019-10-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6974:


 Summary: [C++] Implement Cast kernel for time-likes with 
ArrayDataVisitor pattern
 Key: ARROW-6974
 URL: https://issues.apache.org/jira/browse/ARROW-6974
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently, the casting for time-like data is done with the {{ShiftTime}} 
function. It _might_ be possible to simplify this with ArrayDataVisitor (to 
avoid looping / checking the bitmap).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6923) [C++] Option for Filter kernel how to handle nulls in the selection vector

2019-10-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6923:


 Summary: [C++] Option for Filter kernel how to handle nulls in the 
selection vector
 Key: ARROW-6923
 URL: https://issues.apache.org/jira/browse/ARROW-6923
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


How nulls are handled in the boolean mask (selection vector) in a filter kernel 
varies between languages / data analytics systems (e.g. base R propagates 
nulls, dplyr R skips (sees as False), SQL generally skips them as well I think, 
Julia raises an error).

Currently, in Arrow C++ we "propagate" nulls (null in the selection vector 
gives a null in the output):

{code}
In [7]: arr = pa.array([1, 2, 3]) 

In [8]: mask = pa.array([True, False, None]) 

In [9]: arr.filter(mask) 
Out[9]: 

[
  1,
  null
]
{code}

Given the different ways this could be done (propagate, skip, error), should we 
provide an option to control this behaviour?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6922) [Python] Pandas master build is failing (MultiIndex.levels change)

2019-10-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6922:


 Summary: [Python] Pandas master build is failing 
(MultiIndex.levels change)
 Key: ARROW-6922
 URL: https://issues.apache.org/jira/browse/ARROW-6922
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.15.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7217) Docker compose / github actions ignores PYTHON env

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7217:


 Summary: Docker compose / github actions ignores PYTHON env
 Key: ARROW-7217
 URL: https://issues.apache.org/jira/browse/ARROW-7217
 Project: Apache Arrow
  Issue Type: Test
  Components: CI
Reporter: Joris Van den Bossche


The "AMD64 Conda Python 2.7" build is actually using Python 3.6. 

This python 3.6 version is written in the conda-python.dockerfile: 
https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24
 
and I am not fully sure how the ENV variable overrides that or not

cc [~kszucs]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7218) [Python] Conversion from boolean numpy scalars not working

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7218:


 Summary: [Python] Conversion from boolean numpy scalars not working
 Key: ARROW-7218
 URL: https://issues.apache.org/jira/browse/ARROW-7218
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


In general, we are fine to accept a list of numpy scalars:

{code}
In [12]: type(list(np.array([1, 2]))[0])

   
Out[12]: numpy.int64

In [13]: pa.array(list(np.array([1, 2])))   

   
Out[13]: 

[
  1,
  2
]
{code}

But for booleans, this doesn't work:

{code}
In [14]: pa.array(list(np.array([True, False])))

   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.array(list(np.array([True, False])))

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

ArrowInvalid: Could not convert True with type numpy.bool_: tried to convert to 
boolean
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7220) [CI] Docker compose (github actions) Mac Python 3 build is using Python 2

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7220:


 Summary: [CI] Docker compose (github actions) Mac Python 3 build 
is using Python 2
 Key: ARROW-7220
 URL: https://issues.apache.org/jira/browse/ARROW-7220
 Project: Apache Arrow
  Issue Type: Test
Reporter: Joris Van den Bossche


The "AMD64 MacOS 10.15 Python 3" build is also running in python 2.

Possibly related to how brew is installing python 2 / 3, or because it is using 
the system python, ... (not familiar with mac)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas

2019-11-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7209:


 Summary: [Python] tests with pandas master are failing now 
__from_arrow__ support landed in pandas
 Key: ARROW-7209
 URL: https://issues.apache.org/jira/browse/ARROW-7209
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in 
https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our 
tests where assuming this did not yet work in pandas, and thus need to be 
updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type

2019-11-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7261:


 Summary: [Python] Python support for fixed size list type
 Key: ARROW-7261
 URL: https://issues.apache.org/jira/browse/ARROW-7261
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is 
not yet exposed in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet

2019-11-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7273:


 Summary: [Python] Non-nullable null field is allowed / crashes 
when writing to parquet
 Key: ARROW-7273
 URL: https://issues.apache.org/jira/browse/ARROW-7273
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


It seems to be possible to create a "non-nullable null field". While this does 
not make any sense (so already a reason to disallow this I think), this can 
also lead to crashed in further operations, such as writing to parquet:

{code}
In [18]: table = pa.table([pa.array([None, None], pa.null())], 
schema=pa.schema([pa.field('a', pa.null(), nullable=False)]))

In [19]: table
Out[19]:
pyarrow.Table
a: null not null

In [20]: pq.write_table(table, "test_null.parquet")
WARNING: Logging before InitGoogleLogging() is written to STDERR
F1128 14:08:30.267439 27560 column_writer.cc:837]  Check failed: (nullptr) != 
(values)
*** Check failure stack trace: ***
Aborted (core dumped)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions

2019-11-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7167:


 Summary: [CI][Python] Add nightly tests for older pandas versions 
to Github Actions
 Key: ARROW-7167
 URL: https://issues.apache.org/jira/browse/ARROW-7167
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6823) [C++][Python][R] Support metadata in the feather format?

2019-10-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6823:


 Summary: [C++][Python][R] Support metadata in the feather format?
 Key: ARROW-6823
 URL: https://issues.apache.org/jira/browse/ARROW-6823
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche


This might need to wait / could be enabled by the feather v2 (ARROW-5510), but 
thought to open a specific issue about it: do we want to support saving 
metadata in feather files?

With Parquet files, you can have file-level metadata (which we currently use to 
eg store the pandas_metadata). I think it would be useful to have a similar 
mechanism for Feather files.

A use case where this came up is in GeoPandas where we would like to store the 
Coordinate Reference System identifier of the geometry data inside the file, to 
avoid needing a sidecar file just for that.

In a v2 world (using the IPC format), I suppose this could be the metadata of 
the Schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6778) [C++] Support DurationType in Cast kernel

2019-10-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6778:


 Summary: [C++] Support DurationType in Cast kernel
 Key: ARROW-6778
 URL: https://issues.apache.org/jira/browse/ARROW-6778
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6779) [Python] Conversion from datetime.datetime to timstamp('ns') can overflow

2019-10-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6779:


 Summary: [Python] Conversion from datetime.datetime to 
timstamp('ns') can overflow
 Key: ARROW-6779
 URL: https://issues.apache.org/jira/browse/ARROW-6779
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


In the python conversion of datetime scalars, there is no check for integer 
overflow:

{code}
In [32]: pa.array([datetime.datetime(3000, 1, 1)], pa.timestamp('ns'))  

   
Out[32]: 

[
  1830-11-23 00:50:52.580896768
]
{code}

So in case the target type has nanosecond unit, this can give wrong results (I 
don't think the other resolutions can reach overflow, given the limited range 
of years of datetime.datetime).

We should probably check for this case and raise an error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6780) [C++][Parquet] Support DurationType in writing/reading parquet

2019-10-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6780:


 Summary: [C++][Parquet] Support DurationType in writing/reading 
parquet
 Key: ARROW-6780
 URL: https://issues.apache.org/jira/browse/ARROW-6780
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche


Currently this is not supported:

{code}
In [37]: table = pa.table({'a': pa.array([1, 2], pa.duration('s'))}) 

In [39]: table
Out[39]: 
pyarrow.Table
a: duration[s]

In [41]: pq.write_table(table, 'test_duration.parquet')
...
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema 
conversion: duration[s]
{code}

There is no direct mapping to Parquet logical types. There is an INTERVAL type, 
but this more matches Arrow's  ( YEAR_MONTH or DAY_TIME) interval type. 

But, those duration values could be stored as just integers, and based on the 
serialized arrow schema, it could be restored when reading back in.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6877:


 Summary: [C++] Boost not found from the correct environment
 Key: ARROW-6877
 URL: https://issues.apache.org/jira/browse/ARROW-6877
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche


My local dev build started to fail, due to cmake founding a wrong boost (it 
found {{-- Found Boost 1.70.0 at 
/home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
conda environment.

I can reproduce this with creating a new conda env from scratch following our 
documentation.

By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7431) [Python] Add dataset API to reference docs

2019-12-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7431:


 Summary: [Python] Add dataset API to reference docs
 Key: ARROW-7431
 URL: https://issues.apache.org/jira/browse/ARROW-7431
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Add dataset to python API docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7432) [Python] Add higher-level datasets functions

2019-12-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7432:


 Summary: [Python] Add higher-level datasets functions
 Key: ARROW-7432
 URL: https://issues.apache.org/jira/browse/ARROW-7432
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


>From [~kszucs]: We need to define a more pythonic API for the dataset 
>bindings, because the current one is pretty low-level.

One option is to provide a "open_dataset" function similar as what is available 
in R.

A short-cut to go from a Dataset to a Table might also be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7430) [Python] Add more docstrings to dataset bindings

2019-12-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7430:


 Summary: [Python] Add more docstrings to dataset bindings
 Key: ARROW-7430
 URL: https://issues.apache.org/jira/browse/ARROW-7430
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7839) [Python][Dataset] Add IPC format to python bindings

2020-02-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7839:


 Summary: [Python][Dataset] Add IPC format to python bindings
 Key: ARROW-7839
 URL: https://issues.apache.org/jira/browse/ARROW-7839
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


The C++ / R was done in ARROW-7415, we should add bindings for it in Python as 
well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7963) [C++][Python][Dataset] Expose listing fragments

2020-02-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7963:


 Summary: [C++][Python][Dataset] Expose listing fragments
 Key: ARROW-7963
 URL: https://issues.apache.org/jira/browse/ARROW-7963
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche
Assignee: Ben Kietzman


It would be useful to able to list the fragments, to get their paths / 
partition expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7781) [C++][Dataset] Filtering on a non-existent column gives a segfault

2020-02-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7781:


 Summary: [C++][Dataset] Filtering on a non-existent column gives a 
segfault
 Key: ARROW-7781
 URL: https://issues.apache.org/jira/browse/ARROW-7781
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Example with python code:

{code}
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [1, 2, 3]})

In [3]: df.to_parquet("test-filter-crash.parquet")

In [4]: import pyarrow.dataset as ds

In [5]: dataset = ds.dataset("test-filter-crash.parquet")

In [6]: dataset.to_table(filter=ds.field('a') > 1).to_pandas()
Out[6]:
   a
0  2
1  3

In [7]: dataset.to_table(filter=ds.field('b') > 1).to_pandas()
../src/arrow/dataset/filter.cc:929:  Check failed: _s.ok() Operation failed: 
maybe_value.status()
Bad status: Invalid: attempting to cast non-null scalar to NullScalar
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f744c)[0x7fb1390f444c]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ca)[0x7fb1390f43ca]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ec)[0x7fb1390f43ec]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(_ZN5arrow4util8ArrowLogD1Ev+0x57)[0x7fb1390f4759]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x169fc6)[0x7fb145594fc6]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x16b9be)[0x7fb1455969be]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset15VisitExpressionINS0_23InsertImplicitCastsImplEEEDTclfp0_fp_EERKNS0_10ExpressionEOT_+0x2ae)[0x7fb1455a0dee]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset19InsertImplicitCastsERKNS0_10ExpressionERKNS_6SchemaE+0x44)[0x7fb145596d4e]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x48286)[0x7fb1456dd286]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x49220)[0x7fb1456de220]
/home/joris/miniconda3/envs/arrow-dev/bin/python(+0x170f37)[0x55e5127e1f37]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x22bd6)[0x7fb1456b7bd6]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x33b81)[0x7fb1456c8b81]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x305)[0x55e5127d9c75]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x5460)[0x55e512847c40]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]
/home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCodeEx+0x44)[0x55e512789064]
/home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCode+0x1c)[0x55e51278908c]
/home/joris/miniconda3/envs/arrow-dev/bin/python(+0x1e1650)[0x55e512852650]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9)[0x55e5127d9a59]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x48e4)[0x55e5128470c4]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x8c)[0x55e5127d99fc]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDescr_FastCallKeywords+0x4f)[0x55e5127e1fdf]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x4ddc)[0x55e5128475bc]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x416)[0x55e512842bf6]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x6f3)[0x55e512842ed3]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0x387)[0x55e5127d93e7]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x14e4)[0x55e512843cc4]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]

[jira] [Created] (ARROW-7677) [C++] Handle Windows file paths with backslashes in GetTargetStats

2020-01-24 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7677:


 Summary: [C++] Handle Windows file paths with backslashes in 
GetTargetStats
 Key: ARROW-7677
 URL: https://issues.apache.org/jira/browse/ARROW-7677
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Currently, if the base path passed to  {{GetTargetStats}} has backslashes, the 
produces FileStats also include them, resulting in some other functionality 
(such as splitting the path) not working. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7703) [C++][Dataset] Give more informative error message for mismatching schemas for FileSystemSources

2020-01-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7703:


 Summary: [C++][Dataset] Give more informative error message for 
mismatching schemas for FileSystemSources
 Key: ARROW-7703
 URL: https://issues.apache.org/jira/browse/ARROW-7703
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche


Currently, if you try to create a dataset from files with different schemes, 
you get this error:

{code}
ArrowInvalid: Unable to merge: Field a has incompatible types: int64 vs int32
{code}

If you are reading a directory of files, it would be very helpful if the error 
message can indicate which files are involved here (eg if you have a lot of 
files and only one has an error).

You can already inspect the schema's if you first make a SourceFactory 
manually, but that also only gives a list of schema's, not mapped to the 
original file (this last item probably relates to ARROW-7608 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches

2020-01-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7702:


 Summary: [C++][Dataset] Provide (optional) deterministic order of 
batches
 Key: ARROW-7702
 URL: https://issues.apache.org/jira/browse/ARROW-7702
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche


Example with python:

{code}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': range(12)}) 
pq.write_table(table, "test_chunks.parquet", chunk_size=3) 

# reading with dataset
import pyarrow.dataset as ds
ds.dataset("test_chunks.parquet").to_table().to_pandas()
{code}

gives non-deterministic result (order of the row groups in the parquet file):

```
In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()   

   
Out[25]: 
 a
00
11
22
33
44
55
66
77
88
99
10  10
11  11

In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()   

   
Out[26]: 
 a
00
11
22
33
48
59
6   10
7   11
84
95
10   6
11   7

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7762) [Python] Exceptions in ParquetWriter get ignored

2020-02-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7762:


 Summary: [Python] Exceptions in ParquetWriter get ignored
 Key: ARROW-7762
 URL: https://issues.apache.org/jira/browse/ARROW-7762
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


For example:

{code:python}
In [43]: table = pa.table({'a': [1, 2, 3]}) 

In [44]: pq.write_table(table, "test.parquet", version="2.2")   

   
---
ArrowExceptionTraceback (most recent call last)
ArrowException: Unsupported Parquet format version
Exception ignored in: 'pyarrow._parquet.ParquetWriter._set_version'
pyarrow.lib.ArrowException: Unsupported Parquet format version
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7907) [Python] Conversion to pandas of empty table with timestamp type aborts

2020-02-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7907:


 Summary: [Python] Conversion to pandas of empty table with 
timestamp type aborts
 Key: ARROW-7907
 URL: https://issues.apache.org/jira/browse/ARROW-7907
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.16.1


Creating an empty table:

{code}
In [1]: table = pa.table({'a': pa.array([], type=pa.timestamp('us'))})  

   

In [2]: table['a']  

   
Out[2]: 

[
  []
]

In [3]: table.to_pandas()   

   
Out[3]: 
Empty DataFrame
Columns: [a]
Index: []
{code}

the above works. But the ChunkedArray still has 1 empty chunk. When filtering 
data, you can actually get no chunks, and this fails:


{code}
In [4]: table2 = table.slice(0, 0)  

   

In [5]: table2['a'] 

   
Out[5]: 

[

]

In [6]: table2.to_pandas()  

   
../src/arrow/table.cc:48:  Check failed: (chunks.size()) > (0) cannot construct 
ChunkedArray from empty vector and omitted type
...
Aborted (core dumped)
{code}

and this seems to happen specifically for timestamp type, and specifically with 
non-ns unit (eg with us as above, which is the default in arrow).

I noticed this when reading a parquet file of the taxi dataset, where the 
filter I used resulted in an empty batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format

2020-02-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7854:


 Summary: [C++][Dataset] Option to memory map when reading IPC 
format
 Key: ARROW-7854
 URL: https://issues.apache.org/jira/browse/ARROW-7854
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Joris Van den Bossche


For the IPC format it would be interesting to be able to memory map the IPC 
files?

cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7892) [Python] Expose FilesystemSource.format attribute

2020-02-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7892:


 Summary: [Python] Expose FilesystemSource.format attribute
 Key: ARROW-7892
 URL: https://issues.apache.org/jira/browse/ARROW-7892
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7858) [C++][Python] Support casting an Extension type to its storage type

2020-02-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7858:


 Summary: [C++][Python] Support casting an Extension type to its 
storage type
 Key: ARROW-7858
 URL: https://issues.apache.org/jira/browse/ARROW-7858
 Project: Apache Arrow
  Issue Type: Test
  Components: C++, Python
Reporter: Joris Van den Bossche


Currently, casting an extension type will always fail: "No cast implemented 
from extension to ...".

However, for casting, we could fall back to the storage array's casting rules?





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7857) [Python] Failing test with pandas master for extension type conversion

2020-02-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7857:


 Summary: [Python] Failing test with pandas master for extension 
type conversion
 Key: ARROW-7857
 URL: https://issues.apache.org/jira/browse/ARROW-7857
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


The pandas master test build has one failure


{code}
___ test_conversion_extensiontype_to_extensionarray 

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcd6c580bd0>

def test_conversion_extensiontype_to_extensionarray(monkeypatch):
# converting extension type to linked pandas ExtensionDtype/Array
import pandas.core.internals as _int

storage = pa.array([1, 2, 3, 4], pa.int64())
arr = pa.ExtensionArray.from_storage(MyCustomIntegerType(), storage)
table = pa.table({'a': arr})

if LooseVersion(pd.__version__) < "0.26.0.dev":
# ensure pandas Int64Dtype has the protocol method (for older 
pandas)
monkeypatch.setattr(
pd.Int64Dtype, '__from_arrow__', _Int64Dtype__from_arrow__,
raising=False)

# extension type points to Int64Dtype, which knows how to create a
# pandas ExtensionArray
>   result = table.to_pandas()

opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_pandas.py:3560:
 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/ipc.pxi:559: in pyarrow.lib.read_message
???
pyarrow/table.pxi:1369: in pyarrow.lib.Table._to_pandas
???
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:764: 
in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: 
in _table_to_blocks
for item in result]
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: 
in 
for item in result]
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:723: 
in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/arrays/integer.py:108:
 in __from_arrow__
array = array.cast(pyarrow_type)
pyarrow/table.pxi:240: in pyarrow.lib.ChunkedArray.cast
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowNotImplementedError: No cast implemented from 
extension to int64
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7528) [Python] The pandas.datetime class (import of datetime.datetime) is deprecated

2020-01-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7528:


 Summary: [Python] The pandas.datetime class (import of 
datetime.datetime) is deprecated
 Key: ARROW-7528
 URL: https://issues.apache.org/jira/browse/ARROW-7528
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.16.0


The {{pd.datetime}} was actually just an import from {{datetime.datetime}}, and 
is being removed from pandas (to use the stdlib one directly).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7527) [Python] pandas/feather tests failing on pandas master

2020-01-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7527:


 Summary: [Python] pandas/feather tests failing on pandas master
 Key: ARROW-7527
 URL: https://issues.apache.org/jira/browse/ARROW-7527
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Because I merged a PR in pandas to support Period dtype, some tests in pyarrow 
are now failing (they were using period dtype to test "unsupported" dtypes)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7593) [CI][Python] Python datasets failing on master / not run on CI

2020-01-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7593:


 Summary: [CI][Python] Python datasets failing on master / not run 
on CI
 Key: ARROW-7593
 URL: https://issues.apache.org/jira/browse/ARROW-7593
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7649) [Python] Expose dataset PartitioningFactory.inspect ?

2020-01-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7649:


 Summary: [Python] Expose dataset PartitioningFactory.inspect ?
 Key: ARROW-7649
 URL: https://issues.apache.org/jira/browse/ARROW-7649
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


In C++, the PartitioningFactory has a {{Inspect}} method, which, given a path, 
will infer the schema. 

We could expose this in Python as well, it could eg be used to easily explore 
or illustrate what types are inferred from a path (int32, string)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7652) [Python] Insert implicit cast in ScannerBuilder.filter

2020-01-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7652:


 Summary: [Python] Insert implicit cast in ScannerBuilder.filter
 Key: ARROW-7652
 URL: https://issues.apache.org/jira/browse/ARROW-7652
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >