[jira] [Created] (ARROW-1883) [Python] BUG: Table.to_pandas metadata checking fails if columns are not present
Joris Van den Bossche created ARROW-1883: Summary: [Python] BUG: Table.to_pandas metadata checking fails if columns are not present Key: ARROW-1883 URL: https://issues.apache.org/jira/browse/ARROW-1883 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.1 Reporter: Joris Van den Bossche Found this bug in the example in the pandas documentation (), which does: ``` df = pd.DataFrame({'a': list('abc'), 'b': list(range(1, 4)), 'c': np.arange(3, 6).astype('u1'), 'd': np.arange(4.0, 7.0, dtype='float64'), 'e': [True, False, True], 'f': pd.date_range('20130101', periods=3), 'g': pd.date_range('20130101', periods=3, tz='US/Eastern')}) df.to_parquet('example_pa.parquet', engine='pyarrow') pd.read_parquet('example_pa.parquet', engine='pyarrow', columns=['a', 'b']) ``` and this raises in the last line reading a subset of columns: ``` ... /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata) 357 for i, col_meta in enumerate(pandas_metadata['columns']): 358 if col_meta['pandas_type'] == 'datetimetz': --> 359 col = table[i] 360 converted = col.to_pandas() 361 tz = col_meta['metadata']['timezone'] table.pxi in pyarrow.lib.Table.__getitem__() table.pxi in pyarrow.lib.Table.column() IndexError: Table column index 6 is out of range ``` This is due to checking the `pandas_metadata` for all columns (and in this case trying to deal with a datetime tz column), while in practice not all columns are present in this case ('mismatch' between pandas metadata and actual schema). A smaller example without parquet: ``` In [38]: df = pd.DataFrame({'a': [1, 2, 3], 'b': pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')}) In [39]: table = pyarrow.Table.from_pandas(df) In [40]: table Out[40]: pyarrow.Table a: int64 b: timestamp[ns, tz=Europe/Brussels] __index_level_0__: int64 metadata {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t' b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta' b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6' b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", ' b'"metadata": null, "numpy_type": "int64", "name": "__index_level_' b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"' b': "0.22.0.dev0+277.gd61f411"}'} In [41]: table.to_pandas() Out[41]: a b 0 1 2017-01-01 00:00:00+01:00 1 2 2017-01-02 00:00:00+01:00 2 3 2017-01-03 00:00:00+01:00 In [44]: table_without_tz = table.remove_column(1) In [45]: table_without_tz Out[45]: pyarrow.Table a: int64 __index_level_0__: int64 metadata {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t' b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta' b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6' b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", ' b'"metadata": null, "numpy_type": "int64", "name": "__index_level_' b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"' b': "0.22.0.dev0+277.gd61f411"}'} In [46]: table_without_tz.to_pandas() # <-- wrong output ! Out[46]: a 1970-01-01 01:00:00+01:001 1970-01-01 01:00:00.1+01:00 2 1970-01-01 01:00:00.2+01:00 3 In [47]: table_without_tz2 = table_without_tz.remove_column(1) In [48]: table_without_tz2 Out[48]: pyarrow.Table a: int64 metadata {b'pandas': b'{"columns": [{"pandas_type": "int64", "metadata": null, "numpy_t' b'ype": "int64", "name": "a"}, {"pandas_type": "datetimetz", "meta' b'data": {"timezone": "Europe/Brussels"}, "numpy_type": "datetime6' b'4[ns, Europe/Brussels]", "name": "b"}, {"pandas_type": "int64", ' b'"metadata": null, "numpy_type": "int64", "name": "__index_level_' b'0__"}], "index_columns": ["__index_level_0__"], "pandas_version"' b': "0.22.0.dev0+277.gd61f411"}'} In [49]: table_without_tz2.to_pandas() # <-- error ! --- IndexErrorTraceback (most recent call last) in () > 1 table_without_tz2.to_pandas() table.pxi in pyarrow.lib.Table.to_pandas() /home/joris/miniconda3/envs/dev/lib/python3.5/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads) 289
[jira] [Created] (ARROW-3953) Pandas MultiIndex renamed labels to codes (pd 0.24)
Joris Van den Bossche created ARROW-3953: Summary: Pandas MultiIndex renamed labels to codes (pd 0.24) Key: ARROW-3953 URL: https://issues.apache.org/jira/browse/ARROW-3953 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Pandas deprecated the `MultiIndex.labels` in favor of `MultiIndex.codes` ([https://github.com/pandas-dev/pandas/pull/23752).] In the pandas parquet/feather tests, we are now seeing warnings about this (and I assume there will be warnings as well in pyarrow tests if running on pandas master). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5514) [C++] Printer for uint64 shows wrong values
Joris Van den Bossche created ARROW-5514: Summary: [C++] Printer for uint64 shows wrong values Key: ARROW-5514 URL: https://issues.apache.org/jira/browse/ARROW-5514 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.13.0 Reporter: Joris Van den Bossche >From the example in ARROW-5430: {code} In [16]: pa.array([14989096668145380166, 15869664087396458664], type=pa.uint64()) Out[16]: [ -3457647405564171450, -2577079986313092952 ] {code} I _think_ the actual conversion is correct, and it's only the printer that is going wrong, as {{to_numpy}} gives the correct values: {code} In [17]: pa.array([14989096668145380166, 15869664087396458664], type=pa.uint64()).to_numpy() Out[17]: array([14989096668145380166, 15869664087396458664], dtype=uint64) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5436) [Python] expose filters argument in parquet.read_table
Joris Van den Bossche created ARROW-5436: Summary: [Python] expose filters argument in parquet.read_table Key: ARROW-5436 URL: https://issues.apache.org/jira/browse/ARROW-5436 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.14.0 Currently, the {{parquet.read_table}} function can be used both for reading a single file (interface to ParquetFile) as a directory (interface to ParquetDataset). ParquetDataset has some extra keywords such as {{filters}} that would be nice to expose through {{read_table}} as well. Of course one can always use {{ParquetDataset}} if you need its power, but for pandas wrapping pyarrow it is easier to be able to pass through keywords just to {{parquet.read_table}} instead of calling either {{read_table}} or {{ParquetDataset}}. Context: https://github.com/pandas-dev/pandas/issues/26551 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5572) [Python] raise error message when passing invalid filter in parquet reading
Joris Van den Bossche created ARROW-5572: Summary: [Python] raise error message when passing invalid filter in parquet reading Key: ARROW-5572 URL: https://issues.apache.org/jira/browse/ARROW-5572 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Reporter: Joris Van den Bossche >From >https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset For example, when specifying a column in the filter which is a normal column and not a key in your partitioned folder hierarchy, the filter gets silently ignored. It would be nice to get an error message for this. Reproducible example: {code:python} df = pd.DataFrame({'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1], 'c': [1, 2, 3, 4]}) table = pa.Table.from_pandas(df) pq.write_to_dataset(table, 'test_parquet_row_filters', partition_cols=['a']) # filter on 'a' (partition column) -> works pq.read_table('test_parquet_row_filters', filters=[('a', '=', 1)]).to_pandas() # filter on normal column (in future could do row group filtering) -> silently does nothing pq.read_table('test_parquet_row_filters', filters=[('b', '=', 1)]).to_pandas() {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5606) [Python] pandas.RangeIndex._start/_stop/_step are deprecated
Joris Van den Bossche created ARROW-5606: Summary: [Python] pandas.RangeIndex._start/_stop/_step are deprecated Key: ARROW-5606 URL: https://issues.apache.org/jira/browse/ARROW-5606 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.14.0 There are public attributes added RangeIndex.start/stop/step, and the private {{_start/_stop/_step}} are deprecated. See https://github.com/pandas-dev/pandas/pull/26581 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5655) [Python] Table.from_pydict/from_arrays not using types in specified schema correctly
Joris Van den Bossche created ARROW-5655: Summary: [Python] Table.from_pydict/from_arrays not using types in specified schema correctly Key: ARROW-5655 URL: https://issues.apache.org/jira/browse/ARROW-5655 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Example with {{from_pydict}} (from https://github.com/apache/arrow/pull/4601#issuecomment-503676534): {code:python} In [15]: table = pa.Table.from_pydict( ...: {'a': [1, 2, 3], 'b': [3, 4, 5]}, ...: schema=pa.schema([('a', pa.int64()), ('c', pa.int32())])) In [16]: table Out[16]: pyarrow.Table a: int64 c: int32 In [17]: table.to_pandas() Out[17]: a c 0 1 3 1 2 0 2 3 4 {code} Note that the specified schema has 1) different column names and 2) has a non-default type (int32 vs int64) which leads to corrupted values. This is partly due to {{Table.from_pydict}} not using the type information in the schema to convert the dictionary items to pyarrow arrays. But then it is also {{Table.from_arrays}} that is not correctly casting the arrays to another dtype if the schema specifies as such. Additional question for {{Table.pydict}} is whether it actually should override the 'b' key from the dictionary as column 'c' as defined in the schema (this behaviour depends on the order of the dictionary, which is not guaranteed below python 3.6). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5654) [C++] ChunkedArray should validate the types of the arrays
Joris Van den Bossche created ARROW-5654: Summary: [C++] ChunkedArray should validate the types of the arrays Key: ARROW-5654 URL: https://issues.apache.org/jira/browse/ARROW-5654 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche Fix For: 1.0.0 Example from Python, showing that you can currently create a ChunkedArray with incompatible types: {code:python} In [8]: a1 = pa.array([1, 2]) In [9]: a2 = pa.array(['a', 'b']) In [10]: pa.chunked_array([a1, a2]) Out[10]: [ [ 1, 2 ], [ "a", "b" ] ] {code} So a {{ChunkedArray::Validate}} can be implemented (and which should probably be called by default upon creation?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5295) [Python] accept pyarrow values / scalars in constructor functions ?
Joris Van den Bossche created ARROW-5295: Summary: [Python] accept pyarrow values / scalars in constructor functions ? Key: ARROW-5295 URL: https://issues.apache.org/jira/browse/ARROW-5295 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently, functions like \{{pyarrow.array}} don't accept pyarrow Arrays, or also not scalars of it: {code} In [42]: arr = pa.array([1, 2, 3]) In [43]: pa.array(arr) ... ArrowInvalid: Could not convert 1 with type pyarrow.lib.Int64Value: did not recognize Python value type when inferring an Arrow data type In [44]: pa.array(list(arr)) ... ArrowInvalid: Could not convert 1 with type pyarrow.lib.Int64Value: did not recognize Python value type when inferring an Arrow data type {code} Do we want to allow those / recognize those here? (the first case could even have a fastpath, as we don't need to do it element by element). Also scalars are not supported: {code} In [46]: type(arr.sum()) Out[46]: pyarrow.lib.Int64Scalar In [47]: pa.array([arr.sum()]) ... ArrowInvalid: Could not convert 6 with type pyarrow.lib.Int64Scalar: did not recognize Python value type when inferring an Arrow data type {code} And also in other functions we don't accept arrow scalars / values: {code} In [48]: string = pa.array(['a'])[0] In [49]: type(string) Out[49]: pyarrow.lib.StringValue In [50]: pa.field(string, pa.int64()) ... TypeError: expected bytes, pyarrow.lib.StringValue found {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5291) [Python] Add wrapper for "take" kernel on Array
Joris Van den Bossche created ARROW-5291: Summary: [Python] Add wrapper for "take" kernel on Array Key: ARROW-5291 URL: https://issues.apache.org/jira/browse/ARROW-5291 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Expose the {{take}} kernel (for primitive types, ARROW-2102) on the python {{Array}} class. Part of ARROW-2667. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5293) [C++] Take kernel on DictionaryArray does not preserve ordered flag
Joris Van den Bossche created ARROW-5293: Summary: [C++] Take kernel on DictionaryArray does not preserve ordered flag Key: ARROW-5293 URL: https://issues.apache.org/jira/browse/ARROW-5293 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche In the Python tests I was adding, this was failing for an ordered DictionaryArray: https://github.com/apache/arrow/pull/4281/commits/1f65936e1a06ae415647af7d5c7f54c5937861f6#diff-01b63f189a63c0d4016f2f91370e08fcR92 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5301) [Python] parquet documentation outdated on nthreads argument
Joris Van den Bossche created ARROW-5301: Summary: [Python] parquet documentation outdated on nthreads argument Key: ARROW-5301 URL: https://issues.apache.org/jira/browse/ARROW-5301 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.14.0 [https://arrow.apache.org/docs/python/parquet.html#multithreaded-reads] still mentions {{nthreads}} instead of {{use_threads}}. >From https://github.com/pandas-dev/pandas/issues/26340 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5311) [C++] Return more specific invalid Status in Take kernel
Joris Van den Bossche created ARROW-5311: Summary: [C++] Return more specific invalid Status in Take kernel Key: ARROW-5311 URL: https://issues.apache.org/jira/browse/ARROW-5311 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 0.14.0 Currently the {{Take}} kernel returns generic Invalid Status for certain cases, that could use more specific error: - indices of wrong type (eg floats) -> TypeError instead of Invalid? - out of bounds index -> new IndexError ? >From review in https://github.com/apache/arrow/pull/4281 cc [~bkietz] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5310) [Python] better error message on creating ParquetDataset from empty directory
Joris Van den Bossche created ARROW-5310: Summary: [Python] better error message on creating ParquetDataset from empty directory Key: ARROW-5310 URL: https://issues.apache.org/jira/browse/ARROW-5310 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Currently, you get when {{path}} is an existing but empty directory: {code:python} >>> dataset = pq.ParquetDataset(path) --- IndexErrorTraceback (most recent call last) in > 1 dataset = pq.ParquetDataset(path) ~/scipy/repos/arrow/python/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, memory_map) 989 990 if validate_schema: --> 991 self.validate_schemas() 992 993 if filters is not None: ~/scipy/repos/arrow/python/pyarrow/parquet.py in validate_schemas(self) 1025 self.schema = self.common_metadata.schema 1026 else: -> 1027 self.schema = self.pieces[0].get_metadata().schema 1028 elif self.schema is None: 1029 self.schema = self.metadata.schema IndexError: list index out of range {code} That could be a nicer error message. Unless we actually want to allow this? (although I am not sure there are good use cases of empty directories to support this, because from an empty directory we cannot get any schema or metadata information?) It is only failing when validating the schemas, so with {{validate_schema=False}} it actually returns a ParquetDataset object, just with an empty list for {{pieces}} and no schema. So it would be easy to not error when validating the schemas as well for this empty-directory case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5379) [Python] support pandas' nullable Integer type in from_pandas
Joris Van den Bossche created ARROW-5379: Summary: [Python] support pandas' nullable Integer type in from_pandas Key: ARROW-5379 URL: https://issues.apache.org/jira/browse/ARROW-5379 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche >From https://github.com/apache/arrow/issues/4168. We should add support for >pandas' nullable Integer extension dtypes, as those could map nicely to arrows >integer types. Ideally this happens in a generic way though, and not specific for this extension type, which is discussed in ARROW-5271 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5349) [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData
Joris Van den Bossche created ARROW-5349: Summary: [Python/C++] Provide a way to specify the file path in parquet ColumnChunkMetaData Key: ARROW-5349 URL: https://issues.apache.org/jira/browse/ARROW-5349 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Joris Van den Bossche Fix For: 0.14.0 After ARROW-5258 / https://github.com/apache/arrow/pull/4236 it is now possible to collect the file metadata while writing different files (then how to write those metadata was not yet addressed -> original issue ARROW-1983). However, currently, the {{file_path}} information in the ColumnChunkMetaData object is not set. This is, I think, expected / correct for the metadata as included within the single file; but for using the metadata in the combined dataset `_metadata`, it needs a file path set. So if you want to use this metadata for a partitioned dataset, there needs to be a way to specify this file path. Ideas I am thinking of currently: either, we could specify a file path to be used when writing, or expose the `set_file_path` method on the Python side so you can create an updated version of the metadata after collecting it. cc [~pearu] [~mdurant] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5237) [Python] pandas_version key in pandas metadata no longer populated
Joris Van den Bossche created ARROW-5237: Summary: [Python] pandas_version key in pandas metadata no longer populated Key: ARROW-5237 URL: https://issues.apache.org/jira/browse/ARROW-5237 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.14.0 While looking at the pandas metadata, I noticed that the {{pandas_version}} field now is None. I suppose this is due to the recent refactoring of the pandas api compat (https://github.com/apache/arrow/pull/3893). PR coming. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5271) [Python] Interface for converting pandas ExtensionArray / other custom array objects to pyarrow Array
Joris Van den Bossche created ARROW-5271: Summary: [Python] Interface for converting pandas ExtensionArray / other custom array objects to pyarrow Array Key: ARROW-5271 URL: https://issues.apache.org/jira/browse/ARROW-5271 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Related to ARROW-2428, which describes the issue to convert back to an ExtensionArray in {{to_pandas}}. To start supporting to convert custom ExtensionArrays (eg the nullable Int64Dtype in pandas, or the arrow-backed fletcher arrays, ...) to arrow Arrays (eg in {{pyarrow.array(..)}}), I think it would be good to define an interface or hook that external projects can implement and that pyarrow will call if available. This would allow external projects to define how they can be converted to arrow arrays, without the need that pyarrow itself starts to gather a lot of special cased code for certain types (like pandas' nullable Int64). This could similar to how numpy looks for the {{__array__}} method, so we might call it {{__arrow_array__}}. See also https://github.com/pandas-dev/pandas/issues/20612 for an issue discussing this on the pandas side. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5248) [Python] support dateutil timezones
Joris Van den Bossche created ARROW-5248: Summary: [Python] support dateutil timezones Key: ARROW-5248 URL: https://issues.apache.org/jira/browse/ARROW-5248 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche The {{dateutil}} packages also provides a set of timezone objects (https://dateutil.readthedocs.io/en/stable/tz.html) in addition to {{pytz}}. In pyarrow, we only support pytz timezones (and the stdlib datetime.timezone fixed offset): {code} In [2]: import dateutil.tz In [3]: import pyarrow as pa In [5]: pa.timestamp('us', dateutil.tz.gettz('Europe/Brussels')) ... ~/miniconda3/envs/dev37/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.tzinfo_to_string() ValueError: Unable to convert timezone `tzfile('/usr/share/zoneinfo/Europe/Brussels')` to string {code} But pandas also supports dateutil timezones. As a consequence, when having a pandas DataFrame that uses a dateutil timezone, you get an error when converting to an arrow table. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5287) [Python] automatic type inference for arrays of tuples
Joris Van den Bossche created ARROW-5287: Summary: [Python] automatic type inference for arrays of tuples Key: ARROW-5287 URL: https://issues.apache.org/jira/browse/ARROW-5287 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Arrays of tuples are support to be converted to either ListArray or StructArray, if you specify the type explicitly: {code} In [6]: pa.array([(1, 2), (3, 4, 5)], type=pa.list_(pa.int64())) Out[6]: [ [ 1, 2 ], [ 3, 4, 5 ] ] In [7]: pa.array([(1, 2), (3, 4)], type=pa.struct([('a', pa.int64()), ('b', pa.int64())])) Out[7]: -- is_valid: all not null -- child 0 type: int64 [ 1, 3 ] -- child 1 type: int64 [ 2, 4 ] {code} But not when no type is specified: {code} In [8]: pa.array([(1, 2), (3, 4)]) --- ArrowInvalid Traceback (most recent call last) in > 1 pa.array([(1, 2), (3, 4)]) ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array() ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Could not convert (1, 2) with type tuple: did not recognize Python value type when inferring an Arrow data type {code} Do we want to do automatic type inference for tuples as well? (defaulting to the ListArray case, just as arrays of python lists are supported) Or was there a specific reason to not support this by default? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5857) [Python] converting multidimensional numpy arrays to nested list type
Joris Van den Bossche created ARROW-5857: Summary: [Python] converting multidimensional numpy arrays to nested list type Key: ARROW-5857 URL: https://issues.apache.org/jira/browse/ARROW-5857 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently we only support 1-dimensional numpy arrays: {code:python} In [28]: pa.array([np.array([[1, 2], [3, 4]])], type=pa.list_(pa.list_(pa.int64( ... ArrowInvalid: Can only convert 1-dimensional array values {code} So to create a nested list array, you can do that with lists of lists or object numpy arrays with arrays as elements. We could expand this support to multi-dimensional numpy arrays. I am not sure we should do inference by default for this case, but at least when specifying a nested ListType, this would be nice. It can be an alternative way to have some support for tensors, next to an ExtensionType (ARROW-1614 / ARROW-5819) Related discussions: https://lists.apache.org/thread.html/9b142c1709aa37dc35f1ce8db4e1ced94fcc4cdd96cc72b5772b373b@%3Cdev.arrow.apache.org%3E, https://github.com/apache/arrow/issues/4802 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5858) [Doc] Better document the Tensor classes in the prose documentation
Joris Van den Bossche created ARROW-5858: Summary: [Doc] Better document the Tensor classes in the prose documentation Key: ARROW-5858 URL: https://issues.apache.org/jira/browse/ARROW-5858 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation, Python Reporter: Joris Van den Bossche >From a comment from [~wesmckinn] in ARROW-2714: {quote}The Tensor classes are independent from the columnar data structures, though they reuse pieces of metadata, metadata serialization, memory management, and IPC. The purpose of adding these to the library was to have in-memory data structures for handling Tensor/ndarray data and metadata that "plug in" to the rest of the Arrow C++ system (Plasma store, IO subsystem, memory pools, buffers, etc.). Theoretically you could return a Tensor when creating a non-contiguous slice of an Array; in light of the above, I don't think that would be intuitive. When we started the project, our focus was creating an open standard for in-memory columnar data, a hitherto unsolved problem. The project's scope has expanded into peripheral problems in the same domain in the meantime (with the mantra of creating interoperable components, a use-what-you-need development platform for system developers). I think this aspect of the project could be better documented / advertised, since the project's initial focus on the columnar standard has given some the mistaken impression that we are not interested in any work outside of that. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5853) [Python] Expose boolean filter kernel on Array
Joris Van den Bossche created ARROW-5853: Summary: [Python] Expose boolean filter kernel on Array Key: ARROW-5853 URL: https://issues.apache.org/jira/browse/ARROW-5853 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Expose the filter kernel (https://issues.apache.org/jira/browse/ARROW-1558) on the python Array class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5855) [Python] Add support for Duration type
Joris Van den Bossche created ARROW-5855: Summary: [Python] Add support for Duration type Key: ARROW-5855 URL: https://issues.apache.org/jira/browse/ARROW-5855 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 Add support for the Duration type (added in C++: ARROW-835, ARROW-5261) - add DurationType and DurationArray wrappers - add inference support for datetime.timedelta / np.timedelta64 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5859) [Python] Support ExtentionType on conversion to numpy/pandas
Joris Van den Bossche created ARROW-5859: Summary: [Python] Support ExtentionType on conversion to numpy/pandas Key: ARROW-5859 URL: https://issues.apache.org/jira/browse/ARROW-5859 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently converting a Table of RecordBatch with an ExtensionType array to pandas gives: {{code}} ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type extension is known. {{code}} And similarly converting the array itself to a python object (to_pandas or to_pylist) gives an ArrowNotImplementedError or a "KeyError: 28" Initial support could be to fall back to the storage type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5854) [Python] Expose compare kernels on Array class
Joris Van den Bossche created ARROW-5854: Summary: [Python] Expose compare kernels on Array class Key: ARROW-5854 URL: https://issues.apache.org/jira/browse/ARROW-5854 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Expose the compare kernel for comparing with scalar or array (https://issues.apache.org/jira/browse/ARROW-3087, https://issues.apache.org/jira/browse/ARROW-4990) on the python Array class. This can implement the {{\_\_eq\_\_}} et al dunder methods on the Array class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5864) [Python] simplify cython wrapping of Result
Joris Van den Bossche created ARROW-5864: Summary: [Python] simplify cython wrapping of Result Key: ARROW-5864 URL: https://issues.apache.org/jira/browse/ARROW-5864 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche See answer in https://github.com/cython/cython/issues/3018 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5915) [C++] [Python] Set up testing for backwards compatibility of the parquet reader
Joris Van den Bossche created ARROW-5915: Summary: [C++] [Python] Set up testing for backwards compatibility of the parquet reader Key: ARROW-5915 URL: https://issues.apache.org/jira/browse/ARROW-5915 Project: Apache Arrow Issue Type: Test Components: C++, Python Reporter: Joris Van den Bossche Given the recent parquet compat problems, we should have better testing for this. For easy testing of backwards compatibility, we could add some files (with different types) written with older versions, add them to /pyarrow/tests/data/parquet (we already have some files from 0.7 there) and ensure they are read correctly with the current version. Similarly as what Kartothek is doing: https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5905) [Python] support conversion to decimal type from floats?
Joris Van den Bossche created ARROW-5905: Summary: [Python] support conversion to decimal type from floats? Key: ARROW-5905 URL: https://issues.apache.org/jira/browse/ARROW-5905 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche We currently allow constructing a decimal array from decimal.Decimal objects or from ints: {code} In [14]: pa.array([1, 0], type=pa.decimal128(2)) Out[14]: [ 1, 0 ] In [31]: pa.array([decimal.Decimal('0.1'), decimal.Decimal('0.2')], pa.decimal128(2, 1)) Out[31]: [ 0.1, 0.2 ] {code} but not from floats (or strings): {code} In [18]: pa.array([0.1, 0.2], pa.decimal128(2)) ... ArrowTypeError: int or Decimal object expected, got float {code} Is this something we would like to support? There are for sure precision issues you run into, but if the decimal type is fully specified, it seems clear what the user wants. In general, since decimal objects in pandas are not that easy to work with, many people might have plain float columns that they want to convert to decimal. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5201) [Python] Import ABCs from collections is deprecated in Python 3.7
Joris Van den Bossche created ARROW-5201: Summary: [Python] Import ABCs from collections is deprecated in Python 3.7 Key: ARROW-5201 URL: https://issues.apache.org/jira/browse/ARROW-5201 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche >From running the tests, I see a few deprecation warnings related to that on >Python 3, abstract base classes should be imported from `collections.abc` >instead of `collections`: {code:none} pyarrow/tests/test_array.py:808 /home/joris/scipy/repos/arrow/python/pyarrow/tests/test_array.py:808: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working pa.struct([pa.field('a', pa.int64()), pa.field('b', pa.string())])) pyarrow/tests/test_table.py:18 /home/joris/scipy/repos/arrow/python/pyarrow/tests/test_table.py:18: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import OrderedDict, Iterable pyarrow/tests/test_feather.py::TestFeatherReader::test_non_string_columns /home/joris/scipy/repos/arrow/python/pyarrow/pandas_compat.py:294: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working elif isinstance(name, collections.Sequence):{code} Those could be imported depending on python 2/3 in the ``pyarrow.compat`` module. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5210) [Python] editable install (pip install -e .) is failing
Joris Van den Bossche created ARROW-5210: Summary: [Python] editable install (pip install -e .) is failing Key: ARROW-5210 URL: https://issues.apache.org/jira/browse/ARROW-5210 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Following the python development documentation on building arrow and pyarrow ([https://arrow.apache.org/docs/developers/python.html#build-and-test),] building pyarrow inplace with {{python setup.py build_ext --inplace}} works fine. But if you want to also install this inplace version in the current python environment (editable install / development install) using pip ({{pip install -e .}}), this fails during the {{built_ext}} / cmake phase: {code:none} -- Looking for python3.7m -- Found Python lib /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so CMake Error at cmake_modules/FindNumPy.cmake:62 (message): NumPy import failure: Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'numpy' Call Stack (most recent call first): CMakeLists.txt:186 (find_package) -- Configuring incomplete, errors occurred! See also "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeOutput.log". See also "/home/joris/scipy/repos/arrow/python/build/temp.linux-x86_64-3.7/CMakeFiles/CMakeError.log". error: command 'cmake' failed with exit status 1 Cleaning up... {code} Alternatively, doing `python setup.py develop` to achieve the same does work. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5220) [Python] index / unknown columns in specified schema in Table.from_pandas
Joris Van den Bossche created ARROW-5220: Summary: [Python] index / unknown columns in specified schema in Table.from_pandas Key: ARROW-5220 URL: https://issues.apache.org/jira/browse/ARROW-5220 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche The {{Table.from_pandas}} method allows to specify a schema ("This can be used to indicate the type of columns if we cannot infer it automatically."). But, if you also want to specify the type of the index, you get an error: {code:python} df = pd.DataFrame(\{'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3]}) df.index = pd.Index(['a', 'b', 'c'], name='index') my_schema = pa.schema([('index', pa.string()), ('a', pa.int64()), ('b', pa.float64()), ]) table = pa.Table.from_pandas(df, schema=my_schema) {code} gives {{KeyError: 'index'}} (because it tries to look up the "column names" from the schema in the dataframe, and thus does not find column 'index'). This also has the consequence that re-using the schema does not work: {{table1 = pa.Table.from_pandas(df1); table2 = pa.Table.from_pandas(df2, schema=table1.schema)}} Extra note: also unknown columns in general give this error (column specified in the schema that are not in the dataframe). At least in pyarrow 0.11, this did not give an error (eg noticed this from the code in example in ARROW-3861). So before, unknown columns in the specified schema were ignored, while now they raise an error. Was this a conscious change? So before also specifying the index in the schema "worked" in the sense that it didn't raise an error, but it was also ignored, so didn't actually do what you would expect) Questions: - I think that we should support specifying the index in the passed {{schema}} ? So that the example above works (although this might be complicated with RangeIndex that is not serialized any more) - But what to do in general with additional columns in the schema that are not in the DataFrame? Are we fine with keep raising an error as it is now (the error message could be improved then)? Or do we again want to ignore them? (or, it could actually also add them as all nulls to the table) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-6321) [Python] Ability to create ExtensionBlock on conversion to pandas
Joris Van den Bossche created ARROW-6321: Summary: [Python] Ability to create ExtensionBlock on conversion to pandas Key: ARROW-6321 URL: https://issues.apache.org/jira/browse/ARROW-6321 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche To be able to create a pandas DataFrame in {{to_pandas()}} that holds ExtensionArrays (e.g. towards ARROW-2428 to register a conversion), we first need to add to the {{table_to_blockmanager}} / {{ConvertTableToPandas}} conversion utilities the ability to create an pandas {{ExtensionBlock}} that can hold a pandas {{ExtensionArray}}. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6305) [Python] scalar pd.NaT incorrectly parsed in conversion from Python
Joris Van den Bossche created ARROW-6305: Summary: [Python] scalar pd.NaT incorrectly parsed in conversion from Python Key: ARROW-6305 URL: https://issues.apache.org/jira/browse/ARROW-6305 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche When converting from scalar values, using {{pd.NaT}} (the missing value indicator that pandas uses for datetime64 data) results in an incorrect timestamp: {code} In [6]: pa.array([pd.Timestamp("2012-01-01"), pd.NaT]) Out[6]: [ 2012-01-01 00:00:00.00, 0001-01-01 00:00:00.00 ] {code} where {{pd.NaT}} is converted to "0001-01-01", which is strange, as that does not even correspond with the integer value of pd.NaT. Numpy's version ({{np.datetime64('NaT')}}) is correctly handled. Which also means that a pandas Series holding pd.NaT is handled correctly (as when converting to numpy it is using numpy's NaT). Related to ARROW-842. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6325) [Python] wrong conversion of DataFrame with boolean values
Joris Van den Bossche created ARROW-6325: Summary: [Python] wrong conversion of DataFrame with boolean values Key: ARROW-6325 URL: https://issues.apache.org/jira/browse/ARROW-6325 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Reporter: Joris Van den Bossche Fix For: 0.15.0 >From https://github.com/pandas-dev/pandas/issues/28090 {code} In [19]: df = pd.DataFrame(np.ones((5, 2), dtype=bool), columns=['a', 'b']) In [20]: df Out[20]: a b 0 True True 1 True True 2 True True 3 True True 4 True True In [21]: table = pa.table(df) In [23]: table.column(0) Out[23]: [ [ true, false, false, false, false ] ] {code} The resulting table has False values while the original DataFrame had only true values. It seems this has to do with the fact that it are multiple columns, as with a single column it converts correctly. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6548) [Python] consistently handle conversion of all-NaN arrays across types
Joris Van den Bossche created ARROW-6548: Summary: [Python] consistently handle conversion of all-NaN arrays across types Key: ARROW-6548 URL: https://issues.apache.org/jira/browse/ARROW-6548 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche In ARROW-5682 (https://github.com/apache/arrow/pull/5333), next to fixing actual conversion bugs, I added the ability to convert all-NaN float arrays when converting to string type (and only with {{from_pandas=True}}). So this now works: {code} >>> pa.array(np.array([np.nan, np.nan], dtype=float), type=pa.string()) [ null, null ] {code} However, I only added this for string type (and it already works for float and int types). If we are happy with this behaviour, we should also add it for other types. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6492) [Python] file written with latest fastparquet cannot be read with latest pyarrow
Joris Van den Bossche created ARROW-6492: Summary: [Python] file written with latest fastparquet cannot be read with latest pyarrow Key: ARROW-6492 URL: https://issues.apache.org/jira/browse/ARROW-6492 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche >From report on the pandas issue tracker: >https://github.com/pandas-dev/pandas/issues/28252 With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1), writing a file with pandas using the fastparquet engine cannot be read with the pyarrow engine: {code} df = pd.DataFrame({'A': [1, 2, 3]}) df.to_parquet("test.parquet", engine="fastparquet", compression=None) pd.read_parquet("test.parquet", engine="pyarrow") {code} gives the following error when reading: {code} > 1 pd.read_parquet("test.parquet", engine="pyarrow") ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs) 292 293 impl = get_engine(engine) --> 294 return impl.read(path, columns=columns, **kwargs) ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs) 123 kwargs["use_pandas_metadata"] = True 124 result = self.api.parquet.read_table( --> 125 path, columns=columns, **kwargs 126 ).to_pandas() 127 if should_close: ~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._PandasConvertible.to_pandas() ~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table._to_pandas() ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, categories, ignore_metadata) 642 column_indexes = pandas_metadata.get('column_indexes', []) 643 index_descriptors = pandas_metadata['index_columns'] --> 644 table = _add_any_metadata(table, pandas_metadata) 645 table, index = _reconstruct_index(table, index_descriptors, 646 all_columns) ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _add_any_metadata(table, pandas_metadata) 965 raw_name = 'None' 966 --> 967 idx = schema.get_field_index(raw_name) 968 if idx != -1: 969 if col_meta['pandas_type'] == 'datetimetz': ~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in pyarrow.lib.Schema.get_field_index() ~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so in string.from_py.__pyx_convert_string_from_py_std__in_string() TypeError: expected bytes, dict found {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6529) [C++] Feather: slow writing of NullArray
Joris Van den Bossche created ARROW-6529: Summary: [C++] Feather: slow writing of NullArray Key: ARROW-6529 URL: https://issues.apache.org/jira/browse/ARROW-6529 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche >From >https://stackoverflow.com/questions/57877017/pandas-feather-format-is-slow-when-writing-a-column-of-none Smaller example with just using pyarrow, it seems that writing an array of nulls takes much longer than an array of for example ints, which seems a bit strange: {code} In [93]: arr = pa.array([1]*1000) In [94]: %%timeit ...: w = pyarrow.feather.FeatherWriter('__test.feather') ...: w.writer.write_array('x', arr) ...: w.writer.close() 31.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 1 loops each) In [95]: arr = pa.array([None]*1000) In [96]: arr Out[96]: 1000 nulls In [97]: %%timeit ...: w = pyarrow.feather.FeatherWriter('__test.feather') ...: w.writer.write_array('x', arr) ...: w.writer.close() 3.75 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) {code} So writing the same length NullArray takes ca 100x more time. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6488) [Python] pyarrow.NULL equals to itself
Joris Van den Bossche created ARROW-6488: Summary: [Python] pyarrow.NULL equals to itself Key: ARROW-6488 URL: https://issues.apache.org/jira/browse/ARROW-6488 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.15.0 Somewhat related to ARROW-6386 on the interpretation of nulls, we currently have the following behaviour: {code} In [28]: pa.NULL == pa.NULL Out[28]: True {code} Which I think is certainly unexpected for a null / missing value. I still need to check what the array-level compare kernel does (NULL or False? ideally NULL I think), but we should follow that. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6506) [C++] Validation of ExtensionType with nested type fails
Joris Van den Bossche created ARROW-6506: Summary: [C++] Validation of ExtensionType with nested type fails Key: ARROW-6506 URL: https://issues.apache.org/jira/browse/ARROW-6506 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 0.15.0 A reproducer using the Python ExtensionType: {code} class MyStructType(pa.ExtensionType): def __init__(self): storage_type = pa.struct([('a', pa.int64()), ('b', pa.int64())]) pa.ExtensionType.__init__(self, storage_type, 'my_struct_type') def __arrow_ext_serialize__(self): return b'' @classmethod def __arrow_ext_deserialize__(self, storage_type, serialized): return MyStructType() ty = MyStructType() storage_array = pa.array([{'a': 1, 'b': 2}], ty.storage_type) arr = pa.ExtensionArray.from_storage(ty, storage_array) {code} then validating this array fails because it expects no children (the extension array itself has no children, only the storage array): {code} In [8]: arr.validate() --- ArrowInvalid Traceback (most recent call last) in > 1 arr.validate() ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.validate() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Expected 0 child arrays in array of type extension, got 2 {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6507) [C++] Add ExtensionArray::ExtensionValidate for custom validation?
Joris Van den Bossche created ARROW-6507: Summary: [C++] Add ExtensionArray::ExtensionValidate for custom validation? Key: ARROW-6507 URL: https://issues.apache.org/jira/browse/ARROW-6507 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche >From discussing ARROW-6506, [~bkietz] said: an extension type might place more >constraints on an array than those implicit in its storage type, and users >will probably expect to be able to plug those into {{Validate}}. So we could have a {{ExtensionArray::ExtensionValidate}} that the visitor for {{ExtensionArray}} can call, similarly like there is also an {{ExtensionType::ExtensionEquals}} that the visitor calls when extension types are checked for equality. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6556) [Python] prepare on pandas release without SparseDataFrame
Joris Van den Bossche created ARROW-6556: Summary: [Python] prepare on pandas release without SparseDataFrame Key: ARROW-6556 URL: https://issues.apache.org/jira/browse/ARROW-6556 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche We still have a few places where we use SparseDataFrame. An upcoming release of pandas will remove this class, so we should already make sure it works for that. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-6132) [Python] ListArray.from_arrays does not check validity of input arrays
Joris Van den Bossche created ARROW-6132: Summary: [Python] ListArray.from_arrays does not check validity of input arrays Key: ARROW-6132 URL: https://issues.apache.org/jira/browse/ARROW-6132 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche >From https://github.com/apache/arrow/pull/4979#issuecomment-517593918. When creating a ListArray from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array."). The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong: {code} In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) In [62]: a Out[62]: [ [ 1, 2 ], [ 3, 4 ] ] In [63]: a.flatten() Out[63]: [ 0, # <--- includes the 0 1, 2, 3, 4 ] In [64]: a.to_pylist() Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]] # <--includes more elements as garbage {code} Calling {{validate}} manually correctly raises: {code} In [65]: a.validate() ... ArrowInvalid: Final offset invariant not equal to values length: 10!=5 {code} In C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well? Or should we do a call to {{validate}} here? A quick search seems to indicate that `pa.Array.from_buffers` does validation, but other `from_arrays` method don't seem to explicitly do this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6159) [C++] PrettyPrint of arrow::Schema missing identation for first line
Joris Van den Bossche created ARROW-6159: Summary: [C++] PrettyPrint of arrow::Schema missing identation for first line Key: ARROW-6159 URL: https://issues.apache.org/jira/browse/ARROW-6159 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.14.1 Reporter: Joris Van den Bossche Minor issue, but I noticed when printing a Schema with indentation, like: {code} std::shared_ptr field1 = arrow::field("column1", arrow::int32()); std::shared_ptr field2 = arrow::field("column2", arrow::utf8()); std::shared_ptr schema = arrow::schema({field1, field2}); arrow::PrettyPrintOptions options{4}; arrow::PrettyPrint(*schema, options, ::cout); {code} you get {code} column1: int32 column2: string {code} so not applying the indent for the first line. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
Joris Van den Bossche created ARROW-6157: Summary: [Python][C++] UnionArray with invalid data passes validation / leads to segfaults Key: ARROW-6157 URL: https://issues.apache.org/jira/browse/ARROW-6157 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche >From the Python side, you can create an "invalid" UnionArray: {code} binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') int64 = pa.array([1, 2, 3], type='int64') types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out of bound for number of childs value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) {code} Eg on conversion to python this leads to a segfault: {code} In [7]: a.to_pylist() Segmentation fault (core dumped) {code} On the other hand, doing an explicit validation does not give an error: {code} In [8]: a.validate() {code} Should the validation raise errors for this case? (the C++ {{ValidateVisitor}} for UnionArray does nothing) -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6158) [Python] possible to create StructArray with type that conflicts with child array's types
Joris Van den Bossche created ARROW-6158: Summary: [Python] possible to create StructArray with type that conflicts with child array's types Key: ARROW-6158 URL: https://issues.apache.org/jira/browse/ARROW-6158 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Using the Python interface as example. This creates a {{StructArray}} where the field types don't match the child array types: {code} a = pa.array([1, 2, 3], type=pa.int64()) b = pa.array(['a', 'b', 'c'], type=pa.string()) inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())] a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields) {code} The above works fine. I didn't find anything that errors (eg conversion to pandas, slicing), also validation passes, but the type actually has the inconsistent child types: {code} In [2]: a Out[2]: -- is_valid: all not null -- child 0 type: int64 [ 1, 2, 3 ] -- child 1 type: string [ "a", "b", "c" ] In [3]: a.type Out[3]: StructType(struct) In [4]: a.to_pandas() Out[4]: array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}], dtype=object) In [5]: a.validate() {code} Shouldn't this be disallowed somehow? (it could be checked in the Python {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already checks for the number of fields vs arrays and a consistent array length). Similarly to discussion in ARROW-6132, I would also expect that this the {{ValidateArray}} catches this. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6115) [Python] support LargeList, LargeString, LargeBinary in conversion to pandas
Joris Van den Bossche created ARROW-6115: Summary: [Python] support LargeList, LargeString, LargeBinary in conversion to pandas Key: ARROW-6115 URL: https://issues.apache.org/jira/browse/ARROW-6115 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche General python support for those 3 new types has been added: ARROW-6000, ARROW-6084 However, one aspect that is not yet implemented is conversion to pandas (or numpy array): {code} In [67]: a = pa.array(['a', 'b', 'c'], pa.large_string()) In [68]: a.to_pandas() ... ArrowNotImplementedError: large_utf8 In [69]: pa.table({'a': a}).to_pandas() ... ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of type large_string is known. {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6179) [C++] ExtensionType subclass for "unknown" types?
Joris Van den Bossche created ARROW-6179: Summary: [C++] ExtensionType subclass for "unknown" types? Key: ARROW-6179 URL: https://issues.apache.org/jira/browse/ARROW-6179 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche In C++, when receiving IPC with extension type metadata for a type that is unknown (the name is not registered), we currently fall back to returning the "raw" storage array. The custom metadata (extension name and metadata) is still available in the Field metadata. Alternatively, we could also have a generic {{ExtensionType}} class that can hold such "unknown" extension type (eg {{UnknowExtensionType}} or {{GenericExtensionType}}), keeping the extension name and metadata in the Array's type. This could be a single class where several instances can be created given a storage type, extension name and optionally extension metadata. It would be a way to have an unregistered extension type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6176) [Python] Allow to subclass ExtensionArray to attach to custom extension type
Joris Van den Bossche created ARROW-6176: Summary: [Python] Allow to subclass ExtensionArray to attach to custom extension type Key: ARROW-6176 URL: https://issues.apache.org/jira/browse/ARROW-6176 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Currently, you can define a custom extension type in Python with {code} class UuidType(pa.ExtensionType): def __init__(self): pa.ExtensionType.__init__(self, pa.binary(16)) def __reduce__(self): return UuidType, () {code} but the array you can create with this is always ExtensionArray. We should provide a way to define a subclass (eg `UuidArray` in this case) that can hold custom logic. For example, a user might want to define `UuidArray` such that `arr[i]` returns an instance of Python's `uuid.UUID` >From https://github.com/apache/arrow/pull/4532#pullrequestreview-249396691 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6187) [C++] fallback to storage type when writing ExtensionType to Parquet
Joris Van den Bossche created ARROW-6187: Summary: [C++] fallback to storage type when writing ExtensionType to Parquet Key: ARROW-6187 URL: https://issues.apache.org/jira/browse/ARROW-6187 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Writing a table that contains an ExtensionType array to a parquet file is not yet implemented. It currently raises "ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: extension" (for a PyExtensionType in this case). I think minimal support can consist of writing the storage type / array. We also might want to save the extension name and metadata in the parquet FileMetadata. Later on, this could be potentially be used to restore the extension type when reading. This is related to other issues that need to save the arrow schema (categorical: ARROW-5480, time zones: ARROW-5888). Only in this case, we probably want to store the serialised type in addition to the schema (which only has the extension type's name). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6082) [Python] create pa.dictionary() type with non-integer indices type crashes
Joris Van den Bossche created ARROW-6082: Summary: [Python] create pa.dictionary() type with non-integer indices type crashes Key: ARROW-6082 URL: https://issues.apache.org/jira/browse/ARROW-6082 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche For example if you mixed the order of the indices and values type: {code} In [1]: pa.dictionary(pa.int8(), pa.string()) Out[1]: DictionaryType(dictionary) In [2]: pa.dictionary(pa.string(), pa.int8()) WARNING: Logging before InitGoogleLogging() is written to STDERR F0731 14:40:42.748589 26310 type.cc:440] Check failed: is_integer(index_type->id()) dictionary index type should be signed integer *** Check failure stack trace: *** Aborted (core dumped) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6642) [Python] chained access of ParquetDataset's metadata segfaults
Joris Van den Bossche created ARROW-6642: Summary: [Python] chained access of ParquetDataset's metadata segfaults Key: ARROW-6642 URL: https://issues.apache.org/jira/browse/ARROW-6642 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Creating and reading a parquet dataset: {code} table = pa.table({'a': [1, 2, 3]}) import pyarrow.parquet as pq pq.write_table(table, '__test_statistics_segfault.parquet') dataset = pq.ParquetDataset('__test_statistics_segfault.parquet') dataset_piece = dataset.pieces[0] {code} If you access the metadata and a column's statistics in steps, this works fine: {code} meta = dataset_piece.get_metadata() row = meta.row_group(0) col = row.column(0) {code} but doing it chained in one step, this segfaults: {code} dataset_piece.get_metadata().row_group(0).column(0) {code} {{dataset_piece.get_metadata().row_group(0)}} still works, but additionally with {{.column(0)}} then it segfaults. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6704) [C++] Cast from timestamp to higher resolution does not check out of bounds timestamps
Joris Van den Bossche created ARROW-6704: Summary: [C++] Cast from timestamp to higher resolution does not check out of bounds timestamps Key: ARROW-6704 URL: https://issues.apache.org/jira/browse/ARROW-6704 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche When casting eg {{timestamp('s')}} to {{timestamp('ns')}}, we do not check for out of bounds timestamps, giving "garbage" timestamps in the result: {code} In [74]: a_np = np.array(["2012-01-01", "2412-01-01"], dtype="datetime64[s]") In [75]: arr = pa.array(a_np) In [76]: arr Out[76]: [ 2012-01-01 00:00:00, 2412-01-01 00:00:00 ] In [77]: arr.cast(pa.timestamp('ns')) Out[77]: [ 2012-01-01 00:00:00.0, 1827-06-13 00:25:26.290448384 ] {code} Now, this is the same behaviour as numpy, so not sure we should do this. However, since we have a {{safe=True/False}}, I would expect that for {{safe=True}} we check this and for {{safe=False}} we do not check this. (numpy has a similiar {{casting='safe'}} but also does not raise an error in that case). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6763) [Python] Parquet s3 tests are skipped because dependencies are not installed
Joris Van den Bossche created ARROW-6763: Summary: [Python] Parquet s3 tests are skipped because dependencies are not installed Key: ARROW-6763 URL: https://issues.apache.org/jira/browse/ARROW-6763 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Currently the s3 parquet test is skipped on both Travis as ursabot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-5603) [Python] registere pytest markers to avoid warnings
Joris Van den Bossche created ARROW-5603: Summary: [Python] registere pytest markers to avoid warnings Key: ARROW-5603 URL: https://issues.apache.org/jira/browse/ARROW-5603 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.14.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5890) [C++][Python] Support ExtensionType arrays in more kernels
Joris Van den Bossche created ARROW-5890: Summary: [C++][Python] Support ExtensionType arrays in more kernels Key: ARROW-5890 URL: https://issues.apache.org/jira/browse/ARROW-5890 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche >From a quick test (through Python), it seems that {{slice}} and {{take}} work, >but the following not: - {{cast}}: it could rely on the casting rules for the storage type. Or do we want that you explicitly have to take the storage array before casting? - {{dictionary_encode}} / {{unique}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-7027) [Python] pa.table(..) returns instead of raises error if passing invalid object
Joris Van den Bossche created ARROW-7027: Summary: [Python] pa.table(..) returns instead of raises error if passing invalid object Key: ARROW-7027 URL: https://issues.apache.org/jira/browse/ARROW-7027 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 When passing eg a Series instead of a DataFrame, you get: {code} In [4]: df = pd.DataFrame({'a': [1, 2, 3]}) In [5]: table = pa.table(df['a']) In [6]: table Out[6]: TypeError('Expected pandas DataFrame or python dictionary') In [7]: type(table) Out[7]: TypeError {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7068) [C++] Expose the offsets of a ListArray as a Int32Array
Joris Van den Bossche created ARROW-7068: Summary: [C++] Expose the offsets of a ListArray as a Int32Array Key: ARROW-7068 URL: https://issues.apache.org/jira/browse/ARROW-7068 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche As follow-up on ARROW-7031 (https://github.com/apache/arrow/pull/5759), we can move this into C++ and use that implementation from Python. Cfr [https://github.com/apache/arrow/pull/5759#discussion_r342244521,] this could be a \{{ListArray::value_offsets_array}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7031) [Python] Expose the offsets of a ListArray in python
Joris Van den Bossche created ARROW-7031: Summary: [Python] Expose the offsets of a ListArray in python Key: ARROW-7031 URL: https://issues.apache.org/jira/browse/ARROW-7031 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assume the following ListArray: {code} In [1]: arr = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 5]) In [2]: arr Out[2]: [ [ 1, 2, 3 ], [ 4, 5 ] ] {code} You can get the actual values as a flat array through {{.values}} / {{.flatten()}}, but there is currently no easy way to get back to the offsets (except from interpreting the buffers manually). We should probably add an {{offsets}} attribute (there is actually also a TODO comment for that). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7154) [C++] Build error when building tests but not with snappy
Joris Van den Bossche created ARROW-7154: Summary: [C++] Build error when building tests but not with snappy Key: ARROW-7154 URL: https://issues.apache.org/jira/browse/ARROW-7154 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Since the docker-compose PR landed, I am having build errors like: {code:java} [361/376] Linking CXX executable debug/arrow-python-test FAILED: debug/arrow-python-test : && /home/joris/miniconda3/envs/arrow-dev/bin/ccache /home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++ -Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0 -Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror -msse4.2 -g -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -rdynamic src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o -o debug/arrow-python-test -Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 /home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread -lpthread -ldl -lutil -lrt -ldl /home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a /home/joris/miniconda3/envs/arrow-dev/lib/libglog.so jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt /home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && : /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found (try using -rpath or -rpath-link) /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found (try using -rpath or -rpath-link) /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: debug/libarrow.so.100.0.0: undefined reference to `boost::system::detail::generic_category_ncx()' /home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: debug/libarrow.so.100.0.0: undefined reference to `boost::filesystem::path::operator/=(boost::filesystem::path const&)' collect2: error: ld returned 1 exit status {code} which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, not found" (although this is certainly present). The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set to OFF, it works fine. It also seems to be related to this specific change in the docker compose PR: {code:java} diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index c80ac3310..3b3c9eb8f 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -266,6 +266,15 @@ endif(UNIX) # Set up various options # -if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) - # Currently the compression tests require at least these libraries; bz2 and - # zstd are optional. See ARROW-3984 - set(ARROW_WITH_BROTLI ON) - set(ARROW_WITH_LZ4 ON) - set(ARROW_WITH_SNAPPY ON) - set(ARROW_WITH_ZLIB ON) -endif() - if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION) set(ARROW_JSON ON) endif() {code} If I add that back, the build works. With only `set(ARROW_WITH_BROTLI ON)`, it still fails With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about liblz4 instead of libboost (but also liblz4 is actually present) With only `set(ARROW_WITH_SNAPPY ON)`, it works With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about libz.so.1 not found So it seems that the absence of snappy causes others to fail. In the recommended build settings in the development docs ([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),] the compression libraries are enabled. But I was still building without them (stemming from the time they were enabled by default). So I was using: {code} cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -GNinja \ -DCMAKE_INSTALL_LIBDIR=lib \ -DARROW_PARQUET=ON \ -DARROW_PYTHON=ON \ -DARROW_BUILD_TESTS=ON \ .. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7066) [Python] support returning ChunkedArray from __arrow_array__ ?
Joris Van den Bossche created ARROW-7066: Summary: [Python] support returning ChunkedArray from __arrow_array__ ? Key: ARROW-7066 URL: https://issues.apache.org/jira/browse/ARROW-7066 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 The {{\_\_arrow_array\_\_}} protocol was added so that custom objects can define how they should be converted to a pyarrow Array (similar to numpy's {{\_\_array\_\_}}). This is then also used to support converting pandas DataFrames with columns using pandas' ExtensionArrays to a pyarrow Table (if the pandas ExtensionArray, such as nullable integer type, implements this {{\_\_arrow_array\_\_}} method). This last use case could also be useful for fletcher (https://github.com/xhochy/fletcher/, a package that implements pandas ExtensionArrays that wrap pyarrow arrays, so they can be stored as is in a pandas DataFrame). However, fletcher stores ChunkedArrays in ExtensionArry / the columns of a pandas DataFrame (to have a better mapping with a Table, where the columns also consist of chunked arrays). While we currently require that the return value of {{\_\_arrow_array\_\_}} is a pyarrow.Array. So I was wondering: could we relax this constraint and also allow ChunkedArray as return value? However, this protocol is currently called in the {{pa.array(..)}} function, which probably should keep returning an Array (and not ChunkedArray in certain cases). cc [~uwe] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas
Joris Van den Bossche created ARROW-7365: Summary: [Python] Support FixedSizeList type in conversion to numpy/pandas Key: ARROW-7365 URL: https://issues.apache.org/jira/browse/ARROW-7365 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Follow-up on ARROW-7261, still need to add support for FixedSizeListType in the arrow -> python conversion (arrow_to_pandas.cc) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6885) [Python] Remove superfluous skipped timedelta test
Joris Van den Bossche created ARROW-6885: Summary: [Python] Remove superfluous skipped timedelta test Key: ARROW-6885 URL: https://issues.apache.org/jira/browse/ARROW-6885 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 Now that we support timedelta / duration type, there is an old xfailed test that can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7022) [Python] __arrow_array__ does not work for ExtensionTypes in Table.from_pandas
Joris Van den Bossche created ARROW-7022: Summary: [Python] __arrow_array__ does not work for ExtensionTypes in Table.from_pandas Key: ARROW-7022 URL: https://issues.apache.org/jira/browse/ARROW-7022 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 When someone has a custom ExtensionType defined in Python, and an array class that gets converted to that (through {{\_\_arrow_array\_\_}}), the conversion in pyarrow works with the array class, but not yet for the array stored in a pandas DataFrame. Eg using my definition of ArrowPeriodType in https://github.com/pandas-dev/pandas/pull/28371, I see: {code} In [15]: pd_array = pd.period_range("2012-01-01", periods=3, freq="D").array In [16]: pd_array Out[16]: ['2012-01-01', '2012-01-02', '2012-01-03'] Length: 3, dtype: period[D] In [17]: pa.array(pd_array) Out[17]: [ 15340, 15341, 15342 ] In [18]: df = pd.DataFrame({'periods': pd_array}) In [19]: pa.table(df) ... ArrowInvalid: ('Could not convert 2012-01-01 with type Period: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column periods with type period[D]') {code} (this is working correctly for array objects whose {{\_\_arrow_array\_\_}} is returning a built-in pyarrow Array). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7023) [Python] pa.array does not use "from_pandas" semantics for pd.Index
Joris Van den Bossche created ARROW-7023: Summary: [Python] pa.array does not use "from_pandas" semantics for pd.Index Key: ARROW-7023 URL: https://issues.apache.org/jira/browse/ARROW-7023 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 1.0.0 {code} In [15]: idx = pd.Index([1, 2, np.nan], dtype=object) In [16]: pa.array(idx) Out[16]: [ 1, 2, nan ] In [17]: pa.array(idx, from_pandas=True) Out[17]: [ 1, 2, null ] In [18]: pa.array(pd.Series(idx)) Out[18]: [ 1, 2, null ] {code} We should probably handle Series and Index the same in this regard. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6974) [C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern
Joris Van den Bossche created ARROW-6974: Summary: [C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern Key: ARROW-6974 URL: https://issues.apache.org/jira/browse/ARROW-6974 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Currently, the casting for time-like data is done with the {{ShiftTime}} function. It _might_ be possible to simplify this with ArrayDataVisitor (to avoid looping / checking the bitmap). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6923) [C++] Option for Filter kernel how to handle nulls in the selection vector
Joris Van den Bossche created ARROW-6923: Summary: [C++] Option for Filter kernel how to handle nulls in the selection vector Key: ARROW-6923 URL: https://issues.apache.org/jira/browse/ARROW-6923 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche How nulls are handled in the boolean mask (selection vector) in a filter kernel varies between languages / data analytics systems (e.g. base R propagates nulls, dplyr R skips (sees as False), SQL generally skips them as well I think, Julia raises an error). Currently, in Arrow C++ we "propagate" nulls (null in the selection vector gives a null in the output): {code} In [7]: arr = pa.array([1, 2, 3]) In [8]: mask = pa.array([True, False, None]) In [9]: arr.filter(mask) Out[9]: [ 1, null ] {code} Given the different ways this could be done (propagate, skip, error), should we provide an option to control this behaviour? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6922) [Python] Pandas master build is failing (MultiIndex.levels change)
Joris Van den Bossche created ARROW-6922: Summary: [Python] Pandas master build is failing (MultiIndex.levels change) Key: ARROW-6922 URL: https://issues.apache.org/jira/browse/ARROW-6922 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Fix For: 0.15.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7217) Docker compose / github actions ignores PYTHON env
Joris Van den Bossche created ARROW-7217: Summary: Docker compose / github actions ignores PYTHON env Key: ARROW-7217 URL: https://issues.apache.org/jira/browse/ARROW-7217 Project: Apache Arrow Issue Type: Test Components: CI Reporter: Joris Van den Bossche The "AMD64 Conda Python 2.7" build is actually using Python 3.6. This python 3.6 version is written in the conda-python.dockerfile: https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24 and I am not fully sure how the ENV variable overrides that or not cc [~kszucs] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7218) [Python] Conversion from boolean numpy scalars not working
Joris Van den Bossche created ARROW-7218: Summary: [Python] Conversion from boolean numpy scalars not working Key: ARROW-7218 URL: https://issues.apache.org/jira/browse/ARROW-7218 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche In general, we are fine to accept a list of numpy scalars: {code} In [12]: type(list(np.array([1, 2]))[0]) Out[12]: numpy.int64 In [13]: pa.array(list(np.array([1, 2]))) Out[13]: [ 1, 2 ] {code} But for booleans, this doesn't work: {code} In [14]: pa.array(list(np.array([True, False]))) --- ArrowInvalid Traceback (most recent call last) in > 1 pa.array(list(np.array([True, False]))) ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array() ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array() ~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array() ArrowInvalid: Could not convert True with type numpy.bool_: tried to convert to boolean {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7220) [CI] Docker compose (github actions) Mac Python 3 build is using Python 2
Joris Van den Bossche created ARROW-7220: Summary: [CI] Docker compose (github actions) Mac Python 3 build is using Python 2 Key: ARROW-7220 URL: https://issues.apache.org/jira/browse/ARROW-7220 Project: Apache Arrow Issue Type: Test Reporter: Joris Van den Bossche The "AMD64 MacOS 10.15 Python 3" build is also running in python 2. Possibly related to how brew is installing python 2 / 3, or because it is using the system python, ... (not familiar with mac) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas
Joris Van den Bossche created ARROW-7209: Summary: [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas Key: ARROW-7209 URL: https://issues.apache.org/jira/browse/ARROW-7209 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our tests where assuming this did not yet work in pandas, and thus need to be updated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type
Joris Van den Bossche created ARROW-7261: Summary: [Python] Python support for fixed size list type Key: ARROW-7261 URL: https://issues.apache.org/jira/browse/ARROW-7261 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is not yet exposed in Python. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet
Joris Van den Bossche created ARROW-7273: Summary: [Python] Non-nullable null field is allowed / crashes when writing to parquet Key: ARROW-7273 URL: https://issues.apache.org/jira/browse/ARROW-7273 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Joris Van den Bossche It seems to be possible to create a "non-nullable null field". While this does not make any sense (so already a reason to disallow this I think), this can also lead to crashed in further operations, such as writing to parquet: {code} In [18]: table = pa.table([pa.array([None, None], pa.null())], schema=pa.schema([pa.field('a', pa.null(), nullable=False)])) In [19]: table Out[19]: pyarrow.Table a: null not null In [20]: pq.write_table(table, "test_null.parquet") WARNING: Logging before InitGoogleLogging() is written to STDERR F1128 14:08:30.267439 27560 column_writer.cc:837] Check failed: (nullptr) != (values) *** Check failure stack trace: *** Aborted (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions
Joris Van den Bossche created ARROW-7167: Summary: [CI][Python] Add nightly tests for older pandas versions to Github Actions Key: ARROW-7167 URL: https://issues.apache.org/jira/browse/ARROW-7167 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6823) [C++][Python][R] Support metadata in the feather format?
Joris Van den Bossche created ARROW-6823: Summary: [C++][Python][R] Support metadata in the feather format? Key: ARROW-6823 URL: https://issues.apache.org/jira/browse/ARROW-6823 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche This might need to wait / could be enabled by the feather v2 (ARROW-5510), but thought to open a specific issue about it: do we want to support saving metadata in feather files? With Parquet files, you can have file-level metadata (which we currently use to eg store the pandas_metadata). I think it would be useful to have a similar mechanism for Feather files. A use case where this came up is in GeoPandas where we would like to store the Coordinate Reference System identifier of the geometry data inside the file, to avoid needing a sidecar file just for that. In a v2 world (using the IPC format), I suppose this could be the metadata of the Schema. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6778) [C++] Support DurationType in Cast kernel
Joris Van den Bossche created ARROW-6778: Summary: [C++] Support DurationType in Cast kernel Key: ARROW-6778 URL: https://issues.apache.org/jira/browse/ARROW-6778 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6779) [Python] Conversion from datetime.datetime to timstamp('ns') can overflow
Joris Van den Bossche created ARROW-6779: Summary: [Python] Conversion from datetime.datetime to timstamp('ns') can overflow Key: ARROW-6779 URL: https://issues.apache.org/jira/browse/ARROW-6779 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche In the python conversion of datetime scalars, there is no check for integer overflow: {code} In [32]: pa.array([datetime.datetime(3000, 1, 1)], pa.timestamp('ns')) Out[32]: [ 1830-11-23 00:50:52.580896768 ] {code} So in case the target type has nanosecond unit, this can give wrong results (I don't think the other resolutions can reach overflow, given the limited range of years of datetime.datetime). We should probably check for this case and raise an error. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6780) [C++][Parquet] Support DurationType in writing/reading parquet
Joris Van den Bossche created ARROW-6780: Summary: [C++][Parquet] Support DurationType in writing/reading parquet Key: ARROW-6780 URL: https://issues.apache.org/jira/browse/ARROW-6780 Project: Apache Arrow Issue Type: Improvement Reporter: Joris Van den Bossche Currently this is not supported: {code} In [37]: table = pa.table({'a': pa.array([1, 2], pa.duration('s'))}) In [39]: table Out[39]: pyarrow.Table a: duration[s] In [41]: pq.write_table(table, 'test_duration.parquet') ... ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: duration[s] {code} There is no direct mapping to Parquet logical types. There is an INTERVAL type, but this more matches Arrow's ( YEAR_MONTH or DAY_TIME) interval type. But, those duration values could be stored as just integers, and based on the serialized arrow schema, it could be restored when reading back in. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6877) [C++] Boost not found from the correct environment
Joris Van den Bossche created ARROW-6877: Summary: [C++] Boost not found from the correct environment Key: ARROW-6877 URL: https://issues.apache.org/jira/browse/ARROW-6877 Project: Apache Arrow Issue Type: Bug Reporter: Joris Van den Bossche My local dev build started to fail, due to cmake founding a wrong boost (it found {{-- Found Boost 1.70.0 at /home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different conda environment. I can reproduce this with creating a new conda env from scratch following our documentation. By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7431) [Python] Add dataset API to reference docs
Joris Van den Bossche created ARROW-7431: Summary: [Python] Add dataset API to reference docs Key: ARROW-7431 URL: https://issues.apache.org/jira/browse/ARROW-7431 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Add dataset to python API docs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7432) [Python] Add higher-level datasets functions
Joris Van den Bossche created ARROW-7432: Summary: [Python] Add higher-level datasets functions Key: ARROW-7432 URL: https://issues.apache.org/jira/browse/ARROW-7432 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 1.0.0 >From [~kszucs]: We need to define a more pythonic API for the dataset >bindings, because the current one is pretty low-level. One option is to provide a "open_dataset" function similar as what is available in R. A short-cut to go from a Dataset to a Table might also be useful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7430) [Python] Add more docstrings to dataset bindings
Joris Van den Bossche created ARROW-7430: Summary: [Python] Add more docstrings to dataset bindings Key: ARROW-7430 URL: https://issues.apache.org/jira/browse/ARROW-7430 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7839) [Python][Dataset] Add IPC format to python bindings
Joris Van den Bossche created ARROW-7839: Summary: [Python][Dataset] Add IPC format to python bindings Key: ARROW-7839 URL: https://issues.apache.org/jira/browse/ARROW-7839 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche The C++ / R was done in ARROW-7415, we should add bindings for it in Python as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7963) [C++][Python][Dataset] Expose listing fragments
Joris Van den Bossche created ARROW-7963: Summary: [C++][Python][Dataset] Expose listing fragments Key: ARROW-7963 URL: https://issues.apache.org/jira/browse/ARROW-7963 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset, Python Reporter: Joris Van den Bossche Assignee: Ben Kietzman It would be useful to able to list the fragments, to get their paths / partition expressions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7781) [C++][Dataset] Filtering on a non-existent column gives a segfault
Joris Van den Bossche created ARROW-7781: Summary: [C++][Dataset] Filtering on a non-existent column gives a segfault Key: ARROW-7781 URL: https://issues.apache.org/jira/browse/ARROW-7781 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Joris Van den Bossche Fix For: 1.0.0 Example with python code: {code} In [1]: import pandas as pd In [2]: df = pd.DataFrame({'a': [1, 2, 3]}) In [3]: df.to_parquet("test-filter-crash.parquet") In [4]: import pyarrow.dataset as ds In [5]: dataset = ds.dataset("test-filter-crash.parquet") In [6]: dataset.to_table(filter=ds.field('a') > 1).to_pandas() Out[6]: a 0 2 1 3 In [7]: dataset.to_table(filter=ds.field('b') > 1).to_pandas() ../src/arrow/dataset/filter.cc:929: Check failed: _s.ok() Operation failed: maybe_value.status() Bad status: Invalid: attempting to cast non-null scalar to NullScalar /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f744c)[0x7fb1390f444c] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ca)[0x7fb1390f43ca] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ec)[0x7fb1390f43ec] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(_ZN5arrow4util8ArrowLogD1Ev+0x57)[0x7fb1390f4759] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x169fc6)[0x7fb145594fc6] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x16b9be)[0x7fb1455969be] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset15VisitExpressionINS0_23InsertImplicitCastsImplEEEDTclfp0_fp_EERKNS0_10ExpressionEOT_+0x2ae)[0x7fb1455a0dee] /home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset19InsertImplicitCastsERKNS0_10ExpressionERKNS_6SchemaE+0x44)[0x7fb145596d4e] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x48286)[0x7fb1456dd286] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x49220)[0x7fb1456de220] /home/joris/miniconda3/envs/arrow-dev/bin/python(+0x170f37)[0x55e5127e1f37] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x22bd6)[0x7fb1456b7bd6] /home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x33b81)[0x7fb1456c8b81] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x305)[0x55e5127d9c75] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x5460)[0x55e512847c40] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9] /home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCodeEx+0x44)[0x55e512789064] /home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCode+0x1c)[0x55e51278908c] /home/joris/miniconda3/envs/arrow-dev/bin/python(+0x1e1650)[0x55e512852650] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9)[0x55e5127d9a59] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x48e4)[0x55e5128470c4] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x8c)[0x55e5127d99fc] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDescr_FastCallKeywords+0x4f)[0x55e5127e1fdf] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x4ddc)[0x55e5128475bc] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x416)[0x55e512842bf6] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x6f3)[0x55e512842ed3] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0x387)[0x55e5127d93e7] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x14e4)[0x55e512843cc4] /home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]
[jira] [Created] (ARROW-7677) [C++] Handle Windows file paths with backslashes in GetTargetStats
Joris Van den Bossche created ARROW-7677: Summary: [C++] Handle Windows file paths with backslashes in GetTargetStats Key: ARROW-7677 URL: https://issues.apache.org/jira/browse/ARROW-7677 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Currently, if the base path passed to {{GetTargetStats}} has backslashes, the produces FileStats also include them, resulting in some other functionality (such as splitting the path) not working. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7703) [C++][Dataset] Give more informative error message for mismatching schemas for FileSystemSources
Joris Van den Bossche created ARROW-7703: Summary: [C++][Dataset] Give more informative error message for mismatching schemas for FileSystemSources Key: ARROW-7703 URL: https://issues.apache.org/jira/browse/ARROW-7703 Project: Apache Arrow Issue Type: Bug Reporter: Joris Van den Bossche Currently, if you try to create a dataset from files with different schemes, you get this error: {code} ArrowInvalid: Unable to merge: Field a has incompatible types: int64 vs int32 {code} If you are reading a directory of files, it would be very helpful if the error message can indicate which files are involved here (eg if you have a lot of files and only one has an error). You can already inspect the schema's if you first make a SourceFactory manually, but that also only gives a list of schema's, not mapped to the original file (this last item probably relates to ARROW-7608 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches
Joris Van den Bossche created ARROW-7702: Summary: [C++][Dataset] Provide (optional) deterministic order of batches Key: ARROW-7702 URL: https://issues.apache.org/jira/browse/ARROW-7702 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset, Python Reporter: Joris Van den Bossche Example with python: {code} import pyarrow as pa import pyarrow.parquet as pq table = pa.table({'a': range(12)}) pq.write_table(table, "test_chunks.parquet", chunk_size=3) # reading with dataset import pyarrow.dataset as ds ds.dataset("test_chunks.parquet").to_table().to_pandas() {code} gives non-deterministic result (order of the row groups in the parquet file): ``` In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() Out[25]: a 00 11 22 33 44 55 66 77 88 99 10 10 11 11 In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() Out[26]: a 00 11 22 33 48 59 6 10 7 11 84 95 10 6 11 7 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7762) [Python] Exceptions in ParquetWriter get ignored
Joris Van den Bossche created ARROW-7762: Summary: [Python] Exceptions in ParquetWriter get ignored Key: ARROW-7762 URL: https://issues.apache.org/jira/browse/ARROW-7762 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche For example: {code:python} In [43]: table = pa.table({'a': [1, 2, 3]}) In [44]: pq.write_table(table, "test.parquet", version="2.2") --- ArrowExceptionTraceback (most recent call last) ArrowException: Unsupported Parquet format version Exception ignored in: 'pyarrow._parquet.ParquetWriter._set_version' pyarrow.lib.ArrowException: Unsupported Parquet format version {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7907) [Python] Conversion to pandas of empty table with timestamp type aborts
Joris Van den Bossche created ARROW-7907: Summary: [Python] Conversion to pandas of empty table with timestamp type aborts Key: ARROW-7907 URL: https://issues.apache.org/jira/browse/ARROW-7907 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Fix For: 0.16.1 Creating an empty table: {code} In [1]: table = pa.table({'a': pa.array([], type=pa.timestamp('us'))}) In [2]: table['a'] Out[2]: [ [] ] In [3]: table.to_pandas() Out[3]: Empty DataFrame Columns: [a] Index: [] {code} the above works. But the ChunkedArray still has 1 empty chunk. When filtering data, you can actually get no chunks, and this fails: {code} In [4]: table2 = table.slice(0, 0) In [5]: table2['a'] Out[5]: [ ] In [6]: table2.to_pandas() ../src/arrow/table.cc:48: Check failed: (chunks.size()) > (0) cannot construct ChunkedArray from empty vector and omitted type ... Aborted (core dumped) {code} and this seems to happen specifically for timestamp type, and specifically with non-ns unit (eg with us as above, which is the default in arrow). I noticed this when reading a parquet file of the taxi dataset, where the filter I used resulted in an empty batch. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format
Joris Van den Bossche created ARROW-7854: Summary: [C++][Dataset] Option to memory map when reading IPC format Key: ARROW-7854 URL: https://issues.apache.org/jira/browse/ARROW-7854 Project: Apache Arrow Issue Type: Improvement Components: C++ - Dataset Reporter: Joris Van den Bossche For the IPC format it would be interesting to be able to memory map the IPC files? cc [~fsaintjacques] [~bkietz] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7892) [Python] Expose FilesystemSource.format attribute
Joris Van den Bossche created ARROW-7892: Summary: [Python] Expose FilesystemSource.format attribute Key: ARROW-7892 URL: https://issues.apache.org/jira/browse/ARROW-7892 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7858) [C++][Python] Support casting an Extension type to its storage type
Joris Van den Bossche created ARROW-7858: Summary: [C++][Python] Support casting an Extension type to its storage type Key: ARROW-7858 URL: https://issues.apache.org/jira/browse/ARROW-7858 Project: Apache Arrow Issue Type: Test Components: C++, Python Reporter: Joris Van den Bossche Currently, casting an extension type will always fail: "No cast implemented from extension to ...". However, for casting, we could fall back to the storage array's casting rules? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7857) [Python] Failing test with pandas master for extension type conversion
Joris Van den Bossche created ARROW-7857: Summary: [Python] Failing test with pandas master for extension type conversion Key: ARROW-7857 URL: https://issues.apache.org/jira/browse/ARROW-7857 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche The pandas master test build has one failure {code} ___ test_conversion_extensiontype_to_extensionarray monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcd6c580bd0> def test_conversion_extensiontype_to_extensionarray(monkeypatch): # converting extension type to linked pandas ExtensionDtype/Array import pandas.core.internals as _int storage = pa.array([1, 2, 3, 4], pa.int64()) arr = pa.ExtensionArray.from_storage(MyCustomIntegerType(), storage) table = pa.table({'a': arr}) if LooseVersion(pd.__version__) < "0.26.0.dev": # ensure pandas Int64Dtype has the protocol method (for older pandas) monkeypatch.setattr( pd.Int64Dtype, '__from_arrow__', _Int64Dtype__from_arrow__, raising=False) # extension type points to Int64Dtype, which knows how to create a # pandas ExtensionArray > result = table.to_pandas() opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_pandas.py:3560: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pyarrow/ipc.pxi:559: in pyarrow.lib.read_message ??? pyarrow/table.pxi:1369: in pyarrow.lib.Table._to_pandas ??? opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:764: in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: in _table_to_blocks for item in result] opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: in for item in result] opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:723: in _reconstruct_block pd_ext_arr = pandas_dtype.__from_arrow__(arr) opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/arrays/integer.py:108: in __from_arrow__ array = array.cast(pyarrow_type) pyarrow/table.pxi:240: in pyarrow.lib.ChunkedArray.cast ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E pyarrow.lib.ArrowNotImplementedError: No cast implemented from extension to int64 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7528) [Python] The pandas.datetime class (import of datetime.datetime) is deprecated
Joris Van den Bossche created ARROW-7528: Summary: [Python] The pandas.datetime class (import of datetime.datetime) is deprecated Key: ARROW-7528 URL: https://issues.apache.org/jira/browse/ARROW-7528 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Assignee: Joris Van den Bossche Fix For: 0.16.0 The {{pd.datetime}} was actually just an import from {{datetime.datetime}}, and is being removed from pandas (to use the stdlib one directly). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7527) [Python] pandas/feather tests failing on pandas master
Joris Van den Bossche created ARROW-7527: Summary: [Python] pandas/feather tests failing on pandas master Key: ARROW-7527 URL: https://issues.apache.org/jira/browse/ARROW-7527 Project: Apache Arrow Issue Type: Test Components: Python Reporter: Joris Van den Bossche Because I merged a PR in pandas to support Period dtype, some tests in pyarrow are now failing (they were using period dtype to test "unsupported" dtypes) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7593) [CI][Python] Python datasets failing on master / not run on CI
Joris Van den Bossche created ARROW-7593: Summary: [CI][Python] Python datasets failing on master / not run on CI Key: ARROW-7593 URL: https://issues.apache.org/jira/browse/ARROW-7593 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7649) [Python] Expose dataset PartitioningFactory.inspect ?
Joris Van den Bossche created ARROW-7649: Summary: [Python] Expose dataset PartitioningFactory.inspect ? Key: ARROW-7649 URL: https://issues.apache.org/jira/browse/ARROW-7649 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche In C++, the PartitioningFactory has a {{Inspect}} method, which, given a path, will infer the schema. We could expose this in Python as well, it could eg be used to easily explore or illustrate what types are inferred from a path (int32, string) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7652) [Python] Insert implicit cast in ScannerBuilder.filter
Joris Van den Bossche created ARROW-7652: Summary: [Python] Insert implicit cast in ScannerBuilder.filter Key: ARROW-7652 URL: https://issues.apache.org/jira/browse/ARROW-7652 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche -- This message was sent by Atlassian Jira (v8.3.4#803005)