[jira] [Created] (ARROW-9105) [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field

2020-06-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9105:


 Summary: [C++] ParquetFileFragment::SplitByRowGroup doesn't handle 
filter on partition field
 Key: ARROW-9105
 URL: https://issues.apache.org/jira/browse/ARROW-9105
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


When splitting a fragment into row group fragments, filtering on the partition 
field raises an error.

Python reproducer:

```
df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]})
df.to_parquet("test_partitioned_filter", partition_cols="part", 
engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_partitioned_filter", format="parquet", 
partitioning="hive")
fragment = list(dataset.get_fragments())[0]
```

```
In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas()   

   
Out[31]: 
   dummy part
0  1A
1  1A

In [32]: fragment.split_by_row_group(ds.field("part") == "A")   

   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 fragment.split_by_row_group(ds.field("part") == "A")

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.ParquetFileFragment.split_by_row_group()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset._insert_implicit_casts()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Field named 'part' not found or not unique in the schema.
```

This is probably a "strange" thing to do, since the fragment from a partitioned 
dataset is already coming only from a single partition (so will always only 
satisfy a single equality expression). But it's still nice that as a user you 
don't have to care about only passing part of the filter down to 
{{split_by_row_groups}}.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9103) [Python] Clarify behaviour of CSV reader for non-UTF8 text data

2020-06-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9103:


 Summary: [Python] Clarify behaviour of CSV reader for non-UTF8 
text data
 Key: ARROW-9103
 URL: https://issues.apache.org/jira/browse/ARROW-9103
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


See 
https://stackoverflow.com/questions/62153229/how-does-pyarrow-read-csv-handle-different-file-encodings/62321673#62321673



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9089) [Python] A PyFileSystem handler for fsspec-based filesystems

2020-06-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9089:


 Summary: [Python] A PyFileSystem handler for fsspec-based 
filesystems
 Key: ARROW-9089
 URL: https://issues.apache.org/jira/browse/ARROW-9089
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-8766 to use this machinery to add an FSSpecHandler



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9078) [C++] Parquet writing of extension type with nested storage type fails

2020-06-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9078:


 Summary: [C++] Parquet writing of extension type with nested 
storage type fails
 Key: ARROW-9078
 URL: https://issues.apache.org/jira/browse/ARROW-9078
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


A reproducer in Python:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq


class MyStructType(pa.PyExtensionType): 
 
def __init__(self): 
pa.PyExtensionType.__init__(self, pa.struct([('left', pa.int64()), 
('right', pa.int64())])) 
 
def __reduce__(self): 
return MyStructType, () 


struct_array = pa.StructArray.from_arrays(
[
pa.array([0, 1], type="int64", from_pandas=True),
pa.array([1, 2], type="int64", from_pandas=True),
],
names=["left", "right"],
)

# works
table = pa.table({'a': struct_array})
pq.write_table(table, "test_struct.parquet")

# doesn't work
mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
table = pa.table({'a': mystruct_array})
pq.write_table(table, "test_struct.parquet")
{code}

Writing the simple StructArray nowadays works (and reading it back in as well). 

But when the struct array is the storage array of an ExtensionType, it fails 
with the following error:

{code}
ArrowException: Unknown error: data type leaf_count != builder_leaf_count1 2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9027) [Python] Split in multiple files + clean-up pyarrow.parquet tests

2020-06-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9027:


 Summary: [Python] Split in multiple files + clean-up 
pyarrow.parquet tests
 Key: ARROW-9027
 URL: https://issues.apache.org/jira/browse/ARROW-9027
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


The current {{test_parquet.py}} file is already above 4000 lines of code, and 
it is becoming a bit unwieldy to work with. 
Better structuring it, and maybe splitting it in multiple files, would help 
(separate test files could cover tests for basic reading/writing, tests for 
metadata/statistics objects, tests for multi-file datasets)





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9021) [Python] The filesystem keyword in parquet.read_table is not documented

2020-06-03 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9021:


 Summary: [Python] The filesystem keyword in parquet.read_table is 
not documented
 Key: ARROW-9021
 URL: https://issues.apache.org/jira/browse/ARROW-9021
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9009) [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files

2020-06-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-9009:


 Summary: [C++][Dataset] ARROW:schema should be removed from 
schema's metadata when reading Parquet files
 Key: ARROW-9009
 URL: https://issues.apache.org/jira/browse/ARROW-9009
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


When reading a parquet file (which was written by Arrow) with the datasets API, 
it preserves the "ARROW:schema" field in the metadata:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': [1, 2, 3]})
pq.write_table(table, "test.parquet")

dataset = ds.dataset("test.parquet", format="parquet")
{code}
In [7]: dataset.schema  

  
Out[7]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/3gQAAAKAAwABgAFAAgACgABAwAMCAAIBA' + 114

In [8]: dataset.to_table().schema   

  
Out[8]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/3gQAAAKAAwABgAFAAgACgABAwAMCAAIBA' + 114
{code}

while when reading with the `parquet` module reader, we do not preserve this 
metadata:

{code}
In [9]: pq.read_table("test.parquet").schema

  
Out[9]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
{code}

Since the "ARROW:schema" information is used to properly reconstruct the Arrow 
schema from the ParquetSchema, it is no longer needed once you already have the 
arrow schema, so it's probably not needed to keep it as metadata in the arrow 
schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8946) [Python] Add tests for parquet.write_metadata metadata_collector

2020-05-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8946:


 Summary: [Python] Add tests for parquet.write_metadata 
metadata_collector
 Key: ARROW-8946
 URL: https://issues.apache.org/jira/browse/ARROW-8946
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-8062: the PR added functionality to 
{{parquet.write_metadata}} to pass a a collection of metadata objects to be 
concatenated. We should add some specific tests for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8943) [C++] Add support for Partitioning to ParquetDatasetFactory

2020-05-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8943:


 Summary: [C++] Add support for Partitioning to 
ParquetDatasetFactory
 Key: ARROW-8943
 URL: https://issues.apache.org/jira/browse/ARROW-8943
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Follow-up on ARROW-8062: the ParquetDatasetFactory currently does not yet 
support specifying a Partitioning / inferring with a PartitioningFactory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8860) [C++] Compressed Feather file with struct array roundtrips incorrectly

2020-05-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8860:


 Summary: [C++] Compressed Feather file with struct array 
roundtrips incorrectly
 Key: ARROW-8860
 URL: https://issues.apache.org/jira/browse/ARROW-8860
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


When writing a table with a Struct typed column, this is read back with garbage 
values when using compression (which is the default):

{code:python}
>>>  table = pa.table({'col': pa.StructArray.from_arrays([[0,1,2], [1,2,3]], 
>>> names=["f1", "f2"])})
>>>  table.column("col")

[
  -- is_valid: all not null
  -- child 0 type: int64
[
  0,
  1,
  2
]
  -- child 1 type: int64
[
  1,
  2,
  3
]
]

# roundtrip through feather
>>> feather.write_feather(table, "test_struct.feather")
>>> table2 = feather.read_table("test_struct.feather")

>>> table2.column("col")

[
  -- is_valid: all not null
  -- child 0 type: int64
[
  24,
  1261641627085906436,
  1369095386551025664
]
  -- child 1 type: int64
[
  24,
  1405756815161762308,
  281479842103296
]
]
{code}

When not using compression, it is read back correctly:

{code:python}
>>> feather.write_feather(table, "test_struct.feather", 
>>> compression="uncompressed") 
>>>   
>>> table2 = feather.read_table("test_struct.feather")  
>>> 
>>>   

>>> table2.column("col")
>>> 
>>>   

[
  -- is_valid: all not null
  -- child 0 type: int64
[
  0,
  1,
  2
]
  -- child 1 type: int64
[
  1,
  2,
  3
]
]
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8802) [C++][Dataset] Schema metadata are lost when reading a subset of columns

2020-05-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8802:


 Summary: [C++][Dataset] Schema metadata are lost when reading a 
subset of columns
 Key: ARROW-8802
 URL: https://issues.apache.org/jira/browse/ARROW-8802
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Python example:

{code}
import pandas as pd 
import pyarrow.dataset as ds

  

df = pd.DataFrame({'a': [1, 2, 3]})  
df.to_parquet("test_metadata.parquet")  

dataset = ds.dataset("test_metadata.parquet")   

  
{code}

gives:
{code}
>>> dataset.to_table().schema 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 397
ARROW:schema: '/4ACAAAQAAAKAA4ABgAFAAgACgABAwAQAAAKAAwAAA' + 806

>>> dataset.to_table(columns=['a']).schema 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
{code}

So when specifying a subset of the columns, the additional metadata entries are 
lost (while those can still be informative, eg for conversion to pandas)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8799) [C++][Dataset] Reading list column as nested dictionary segfaults

2020-05-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8799:


 Summary: [C++][Dataset] Reading list column as nested dictionary 
segfaults
 Key: ARROW-8799
 URL: https://issues.apache.org/jira/browse/ARROW-8799
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Python example:

{code}
import pyarrow as pa
import pyarrow.parquet as pq  
from pyarrow.tests import util  

   


  
repeats = 10 
nunique = 5 

data = [ 
[[util.rands(10)] for i in range(nunique)] * repeats, 
] 
table = pa.table(data, names=['f0'])

   

pq.write_table(table, "test_dictionary.parquet")
{code}

Reading with the parquet code works:

{code}
>>> pq.read_table("test_dictionary.parquet", read_dictionary=['f0.list.item'])  
>>> 
>>> 
pyarrow.Table
f0: list>
  child 0, item: dictionary
{code}

but doing the same with the datasets API segfaults:

{code}
>>> fmt = 
>>> ds.ParquetFileFormat(read_options=dict(dictionary_columns=["f0.list.item"]))
>>> dataset = ds.dataset("test_dictionary.parquet", format=fmt) 
>>>   
>>> dataset.to_table()  
Segmentation fault (core dumped)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8780) [Python] A fsspec-compatible wrapper for pyarrow.fs filesystems

2020-05-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8780:


 Summary: [Python] A fsspec-compatible wrapper for pyarrow.fs 
filesystems
 Key: ARROW-8780
 URL: https://issues.apache.org/jira/browse/ARROW-8780
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


The new {{pyarrow.fs}} FileSystem objects have a limited Python API (currently 
mimicking the C++ API).

In Python, [fsspec|https://filesystem-spec.readthedocs.io/en/latest] defines a 
common API for a variety filesystem implementations. 

We could try to implement a, fsspec-compatible class wrapping the 
{{pyarrow.fs}} native filesystems. Such as class would provide the methods 
expected according to fsspec, and implement those using the actual 
{{pyarrow.fs.FileSystem}} under the hood.

This might be mainly useful for two use cases:

- {{pyarrow.fs}} filesystems can be used in settings that expect an 
fsspec-compatible filesytem object
- it provides a way to have a "richer" API around our {{pyarrow.fs}} 
filesystems (which has been requested before, cfr ARROW-7584), without 
expanding the core filesystem objects



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8766) [Python] A FileSystem implementation based on Python callbacks

2020-05-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8766:


 Summary: [Python] A FileSystem implementation based on Python 
callbacks
 Key: ARROW-8766
 URL: https://issues.apache.org/jira/browse/ARROW-8766
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


The new {{pyarrow.fs}} filesystems are now actual C++ objects, and no longer 
"just" a python interface. So they can't easily be expanded from the Python 
side, and the existing integration with {{fsspec}} filesystems is therefore 
also not working anymore. 

One possible solution is  to have a C++ filesystem that calls back into a 
python object for each of its methods (possibly similar to how you can 
implement a flight server in Python, I suppose). 

Such a FileSystem implementation would allow to make a {{pyarrow.fs}} wrapper 
for {{fsspec}} filesystems, and thus allow such filesystems to be used in 
pyarrow where new filesystems are expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8733) [C++][Dataset][Python] ParquetFileFragment should provide access to parquet FileMetadata

2020-05-07 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8733:


 Summary: [C++][Dataset][Python] ParquetFileFragment should provide 
access to parquet FileMetadata
 Key: ARROW-8733
 URL: https://issues.apache.org/jira/browse/ARROW-8733
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Related to ARROW-8062 (as there we will also need a way to expose the global 
FileMetadata). But independently, it would be useful to get access to the 
FileMetadata on each {{ParquetFileFragment}} (eg to get access to the 
statistics).

This would be relatively simple to code on the Python/R side, since we have 
access to the file path, and could read the metadata from the file backing the 
fragment, and return this as a FileMetadata object. 

I am wondering if we want to integrate this with ARROW-8062, since when the 
fragments were created from a {{_metadata}} file, a 
{{ParquetFileFragment.metadata}} attribute would not need to read it from the 
parquet file in this case, but from the global metadata (at least for eg the 
row group data).

Another question: what for a ParquetFileFragment that maps to a single row 
group?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8729) [C++][Dataset] Only selecting a partition column results in empty table

2020-05-07 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8729:


 Summary: [C++][Dataset] Only selecting a partition column results 
in empty table
 Key: ARROW-8729
 URL: https://issues.apache.org/jira/browse/ARROW-8729
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Python reproducer:

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
path = "test_dataset"

table = pa.table({'part': ['a', 'a', 'b', 'b'], 'col': [1, 2, 3, 4]})
pq.write_to_dataset(table, str(path), partition_cols=["part"])
{code}

gives

{code}
In [38]: ds.dataset(str(path), partitioning="hive").to_table().num_rows 

   
Out[38]: 4

In [39]: ds.dataset(str(path), 
partitioning="hive").to_table(columns=["part"]).num_rows


Out[39]: 0
{code}

The schema correctly only includes the "part" column, but there are no rows.

cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8693) [Python] Dataset.get_fragments is missing an implicit cast when filtering

2020-05-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8693:


 Summary: [Python] Dataset.get_fragments is missing an implicit 
cast when filtering
 Key: ARROW-8693
 URL: https://issues.apache.org/jira/browse/ARROW-8693
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


This currently segfaults:

{code}
dataset.get_fragments(filter=ds.field("col") > 1)
{code}

in case "col" is not int64 (like default inferred partition columns are int32)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8690) [Python] Clean-up dataset+parquet tests now order is determinstic

2020-05-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8690:


 Summary: [Python] Clean-up dataset+parquet tests now order is 
determinstic
 Key: ARROW-8690
 URL: https://issues.apache.org/jira/browse/ARROW-8690
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Follow-up on ARROW-8447, we should now be able to clean-up some tests.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8655) [C++][Dataset][Python][R] Preserve partitioning information for a discovered Dataset

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8655:


 Summary: [C++][Dataset][Python][R] Preserve partitioning 
information for a discovered Dataset
 Key: ARROW-8655
 URL: https://issues.apache.org/jira/browse/ARROW-8655
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Currently, we have the {{HivePartitioning}} and {{DirectoryPartitioning}} 
classes that describe a partitioning used in the discovery phase. But once a 
dataset object is created, it doesn't know any more about this, it just has 
partition expressions for the fragments. And the partition keys are added to 
the schema, but you can't directly know which columns of the schema originated 
from the partitions.

However, there can be use cases where it would be useful that a dataset still 
"knows" from what kind of partitioning it was created:

- The "read CSV write back Parquet" use case, where the CSV was already 
partitioned and you want to automatically preserve that partitioning for 
parquet (kind of roundtripping the partitioning on read/write)
- To convert the dataset to other representation, eg conversion to pandas, it 
can be useful to know what columns were partition columns (eg for pandas, those 
columns might be good candidates to be set as the index of the pandas/dask 
DataFrame). I can imagine conversions to other systems can use similar 
information.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8652) [Python] Test error message when discovering dataset with invalid files

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8652:


 Summary: [Python] Test error message when discovering dataset with 
invalid files
 Key: ARROW-8652
 URL: https://issues.apache.org/jira/browse/ARROW-8652
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


There is comment in the test_parquet.py about the Dataset API needing a better 
error message for invalid files:

https://github.com/apache/arrow/blob/ff92a6886ca77515173a50662a1949a792881222/python/pyarrow/tests/test_parquet.py#L3633-L3648

Although, this seems to work now:

{code}
import tempfile 
import pathlib
import pyarrow.dataset as ds

   

tempdir = pathlib.Path(tempfile.mkdtemp()) 

with open(str(tempdir / "data.parquet"), 'wb') as f: 
pass 

In [10]: ds.dataset(str(tempdir / "data.parquet"), format="parquet")

   
...
OSError: Could not open parquet input source '/tmp/tmp312vtjmw/data.parquet': 
Invalid: Parquet file size is 0 bytes
{code}

So we need update the test to actually test it instead of skipping.

The only difference with the python ParquetDataset implementation is that the 
datasets API raises an OSError and not an ArrowInvalid error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8651) [Python][Dataset] Support pickling of Dataset objects

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8651:


 Summary: [Python][Dataset] Support pickling of Dataset objects
 Key: ARROW-8651
 URL: https://issues.apache.org/jira/browse/ARROW-8651
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


We alraedy made several parts of a Dataset serializable (the formats, the 
expressions, the filesystem). With those, it should also be possible to pickle 
FileFragments, and with that also Dataset.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8647) [C++][Dataset] Optionally encode partition field values as dictionary type

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8647:


 Summary: [C++][Dataset] Optionally encode partition field values 
as dictionary type
 Key: ARROW-8647
 URL: https://issues.apache.org/jira/browse/ARROW-8647
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


In the Python ParquetDataset implementation, the partition fields are returned 
as dictionary type columns. 

In the new Dataset API, we now use a plain type (integer or string when 
inferred). But, you can already manually specify that the partition keys should 
be dictionary type by specifying the partitioning schema (in {{Partitioning}} 
passed to the dataset factory). 

Since using dictionary type can be more efficient (since partition keys will 
typically be repeated values in the resulting table), it might be good to still 
have an option in the DatasetFactory to use dictionary types for the partition 
fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8644) [Python] Dask integration tests failing due to change in not including partition columns

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8644:


 Summary: [Python] Dask integration tests failing due to change in 
not including partition columns
 Key: ARROW-8644
 URL: https://issues.apache.org/jira/browse/ARROW-8644
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


In ARROW-3861 (https://github.com/apache/arrow/pull/7050), I "fixed" a bug that 
the partition columns are always included even when the user did a manual 
column selection.

But apparently, this behaviour was being relied upon by dask. See the failing 
nightly integration tests: 
https://circleci.com/gh/ursa-labs/crossbow/11854?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link

So the best option might be to just keep the "old" behaviour for the legacy 
ParquetDataset, when using the new datasets API 
({{use_legacy_datasets=False}}), you get the new / correct behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8643) [Python] Tests with pandas master failing due to freq assertion

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8643:


 Summary: [Python] Tests with pandas master failing due to freq 
assertion 
 Key: ARROW-8643
 URL: https://issues.apache.org/jira/browse/ARROW-8643
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Nightly pandas master tests are failing, eg 
https://circleci.com/gh/ursa-labs/crossbow/11858?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link

This is caused by a change in pandas, see 
https://github.com/pandas-dev/pandas/pull/33815#issuecomment-620820134



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8641) [Python] Regression in feather: no longer supports permutation in column selection

2020-04-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8641:


 Summary: [Python] Regression in feather: no longer supports 
permutation in column selection
 Key: ARROW-8641
 URL: https://issues.apache.org/jira/browse/ARROW-8641
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


A quite annoying regression (original report from 
https://github.com/pandas-dev/pandas/issues/33878), is that when specifying 
{{columns}} to read, this now fails if the order of the columns is not exactly 
the same as in the file:

{code: python}
In [27]: table = pa.table([[1, 2, 3], [4, 5, 6], [7, 8, 9]], names=['a', 'b', 
'c'])

In [29]: from pyarrow import feather 

In [30]: feather.write_feather(table, "test.feather")   

# this works fine
In [32]: feather.read_table("test.feather", columns=['a', 'b']) 

   
Out[32]: 
pyarrow.Table
a: int64
b: int64

In [33]: feather.read_table("test.feather", columns=['b', 'a']) 

   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 feather.read_table("test.feather", columns=['b', 'a'])

~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, 
memory_map)
237 return reader.read_indices(columns)
238 elif all(map(lambda t: t == str, column_types)):
--> 239 return reader.read_names(columns)
240 
241 column_type_names = [t.__name__ for t in column_types]

~/scipy/repos/arrow/python/pyarrow/feather.pxi in 
pyarrow.lib.FeatherReader.read_names()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Schema at index 0 was different: 
b: int64
a: int64
vs
a: int64
b: int64
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8613) [C++][Dataset] Raise error for unparsable partition value

2020-04-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8613:


 Summary: [C++][Dataset] Raise error for unparsable partition value
 Key: ARROW-8613
 URL: https://issues.apache.org/jira/browse/ARROW-8613
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Currently, when specifying a partitioning schema, but on of the partition field 
values cannot be parsed according to the specified type, you silently get null 
values for that partition field.

Python example:
{code:python}
import pathlib  
import pyarrow.parquet as pq 
import pyarrow.datasets as d

path = pathlib.Path(".") / "dataset_partition_schema_errors" 
path.mkdir(exist_ok=True)   

   

table = pa.table({"part": ["1_2", "1_2", "3_4", "3_4"], "values": range(4)})   
pq.write_to_dataset(table, str(path), partition_cols=["part"]) 
{code}
{code:java}
In [17]: ds.dataset(path, partitioning="hive").to_table().to_pandas() 
Out[17]: 
   values part
0   0  1_2
1   1  1_2
2   2  3_4
3   3  3_4

In [18]: partitioning = ds.partitioning(pa.schema([("part", pa.int64())]), 
flavor="hive")  


In [19]: ds.dataset(path, partitioning=partitioning).to_table().to_pandas()   
Out[19]: 
   values  part
0   0   NaN
1   1   NaN
2   2   NaN
3   3   NaN
{code}

Silently ignoring such a parse error doesn't seem the best default to me (since 
partition keys are quite essential). I think raising an error might be better? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8446) [Python][Dataset] Detect and use _metadata file in a list of file paths

2020-04-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8446:


 Summary: [Python][Dataset] Detect and use _metadata file in a list 
of file paths
 Key: ARROW-8446
 URL: https://issues.apache.org/jira/browse/ARROW-8446
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


>From https://github.com/dask/dask/pull/6047#discussion_r402391318

When specifying a directory to {{ParquetDataset}}, we will detect if a 
{{_metadata}} file is present in the directory and use that to populate the 
{{metadata}} attribute (and not include this file in the list of "pieces", 
since it does not include any data).
 
However, when passing a list of files to {{ParquetDataset}}, with one being 
"_metadata", the metadata attribute is not populated, and the "_metadata" path 
is included as one of the ParquetDatasetPiece objects instead (which leads to 
an ArrowIOError during the read of that piece).

We _could_ detect it in a list of paths as well.

Note, I mentioned {{ParquetDataset}}, but if working on this, we should 
probably directly do it in the datasets API-based version.  
Also, I labeled this as Python and not C++ for now, as this might be something 
that can be handled on the Python side (once the C++ side knows how to process 
this kind of metadata -> ARROW-8062)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8442) [Python] NullType.to_pandas_dtype inconsisent with dtype returned in to_pandas/to_numpy

2020-04-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8442:


 Summary: [Python] NullType.to_pandas_dtype inconsisent with dtype 
returned in to_pandas/to_numpy
 Key: ARROW-8442
 URL: https://issues.apache.org/jira/browse/ARROW-8442
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


There is this behaviour of {{to_pandas_dtype}} returning float, while all 
actual conversions to numpy or pandas use object dtype:

{code}
In [23]: pa.null().to_pandas_dtype()

   
Out[23]: numpy.float64

In [24]: pa.array([], pa.null()).to_pandas()

   
Out[24]: Series([], dtype: object)

In [25]: pa.array([], pa.null()).to_numpy(zero_copy_only=False) 

   
Out[25]: array([], dtype=object)
{code}

So we should probably fix {{NullType.to_pandas_dtype}} to return object, which 
is used in practice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8439) [Python] Filesystem docs are outdated

2020-04-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8439:


 Summary: [Python] Filesystem docs are outdated
 Key: ARROW-8439
 URL: https://issues.apache.org/jira/browse/ARROW-8439
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8427) [C++][Dataset] Do not ignore file paths with underscore/dot when full path was specified

2020-04-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8427:


 Summary: [C++][Dataset] Do not ignore file paths with 
underscore/dot when full path was specified
 Key: ARROW-8427
 URL: https://issues.apache.org/jira/browse/ARROW-8427
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Currently, when passing a list of file path to FileSystemDatasetFactory, the 
files that have one of their parent directories with a underscore or dot are 
skipped. Since the file paths were passed as an explicit list, we should maybe 
not skip them.

For example, when specifying a directory (Selector), it will only check child 
directories to skip, not parent directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8416) [Python] Provide a "feather" alias in the dataset API

2020-04-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8416:


 Summary: [Python] Provide a "feather" alias in the dataset API
 Key: ARROW-8416
 URL: https://issues.apache.org/jira/browse/ARROW-8416
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


I don't know what the plans are on the C++ side (ARROW-7586), but for 0.17, I 
think it would be nice if we can at least support {{ds.dataset(..., 
format="feather")}} (instead of needing to tell people to do {{ds.dataset(..., 
format="ipc")}} to read feather files).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8414) [Python] Non-deterministic row order failure in test_parquet.py

2020-04-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8414:


 Summary: [Python] Non-deterministic row order failure in 
test_parquet.py
 Key: ARROW-8414
 URL: https://issues.apache.org/jira/browse/ARROW-8414
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.17.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8345) [Python] feather.read_table should not require pandas

2020-04-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8345:


 Summary: [Python] feather.read_table should not require pandas
 Key: ARROW-8345
 URL: https://issues.apache.org/jira/browse/ARROW-8345
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.17.0


We still check the pandas version, while pandas is not actually needed. Will do 
a quick fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8342) [Python] dask and kartothek integration tests are failing

2020-04-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8342:


 Summary: [Python] dask and kartothek integration tests are failing
 Key: ARROW-8342
 URL: https://issues.apache.org/jira/browse/ARROW-8342
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


The integration tests for both dask and kartothek, and for both master and 
latest released version of them, started failing the last days.

Dask latest: 
https://circleci.com/gh/ursa-labs/crossbow/10629?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link
 
Kartothek latest: 
https://circleci.com/gh/ursa-labs/crossbow/10604?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link

I think both are related to the KeyValueMetadata changes (ARROW-8079).

The kartothek one is clearly related, as it gives: TypeError: 
'pyarrow.lib.KeyValueMetadata' object does not support item assignment

And I think the dask one is related to the "pandas" key now being present 
twice, and therefore it is using the "wrong" one.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8314) [Python] Provide a method to select a subset of columns of a Table

2020-04-02 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8314:


 Summary: [Python] Provide a method to select a subset of columns 
of a Table
 Key: ARROW-8314
 URL: https://issues.apache.org/jira/browse/ARROW-8314
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Joris Van den Bossche


I looked through the open issues and in our API, but didn't directly find 
something about selecting a subset of columns of a table.

Assume you have a table like:

{code}
table = pa.table({'a': [1, 2], 'b': [.1, .2], 'c': ['a', 'b']})
{code}

You can select a single column with {{table.column('a')}} or {{table['a']}} to 
get a chunked array. You can add, append, remove and replace columns (with 
{{add_column}}, {{append_column}}, {{remove_column}}, {{set_column}}). 
But an easy way to get a subset of the columns (without the manuall removing 
the ones you don't want one by one) doesn't seem possible. 

I would propose something like:

{code}
table.select(['a', 'c'])
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8292) [Python][Dataset] Passthrough schema to Factory.finish() in dataset() function

2020-03-31 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8292:


 Summary: [Python][Dataset] Passthrough schema to Factory.finish() 
in dataset() function
 Key: ARROW-8292
 URL: https://issues.apache.org/jira/browse/ARROW-8292
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: Joris Van den Bossche


This is already a very simple fix to allow manually specifying the schema, 
without exposing any other options



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8290) [Python][Dataset] Improve ergonomy of the FileSystemDataset constructor

2020-03-31 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8290:


 Summary: [Python][Dataset] Improve ergonomy of the 
FileSystemDataset constructor
 Key: ARROW-8290
 URL: https://issues.apache.org/jira/browse/ARROW-8290
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Currently, to manually create a FileSystemDataset, you can do something like:

{code}
dataset = ds.FileSystemDataset(
schema, None, ds.ParquetFileFormat(), pa.fs.LocalFileSystem(),
["data_file1.parquet", "data_file2.parquet"],
[ds.field('file') == 1, ds.field('file') == 2])
{code}

There are some usibility improvements we can do though:

- Allow passing the arguments by name to improve readability of the calling 
code (now they all need to be passed positionally, due to the way they are 
implemented in cython as {{not None}})
- I would maybe change the order of the arguments (eg start with the paths, we 
don't need to match the order of the C++ constructor)
- Potentially allow {{partitions}} to be optional, in which case they need to 
be set to a list of ScalarExpression(True) values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8286) [Python] Creating dataset from pathlib results in UnionDataset instead of FileSystemDataset

2020-03-31 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8286:


 Summary: [Python] Creating dataset from pathlib results in 
UnionDataset instead of FileSystemDataset
 Key: ARROW-8286
 URL: https://issues.apache.org/jira/browse/ARROW-8286
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.17.0


{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': np.random.randn(10), 'b': range(10), 'c': ['a', 'b'] * 
5})
pq.write_table(table, "test.parquet")

import pathlib

ds.dataset(pathlib.Path("./test.parquet"))
# gives UnionDataset

ds.dataset(str(pathlib.Path("./test.parquet")))
# correctly gives FileSystemDataset
{code}

and since those two dataset classes have different API, this is important to 
give FileSystemDataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8276) [C++][Dataset] Scannin a Fragment does not take into account the partition columns

2020-03-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8276:


 Summary: [C++][Dataset] Scannin a Fragment does not take into 
account the partition columns
 Key: ARROW-8276
 URL: https://issues.apache.org/jira/browse/ARROW-8276
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, C++ - Dataset
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Follow-up on ARROW-8061, the {{to_table}} method doesn't work for fragments 
created from a partitioned dataset.

(will add a reproducer later)

cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8220) [Python] Make dataset FileFormat objects serializable

2020-03-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8220:


 Summary: [Python] Make dataset FileFormat objects serializable
 Key: ARROW-8220
 URL: https://issues.apache.org/jira/browse/ARROW-8220
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Similar to ARROW-8060, ARROW-8059, also the FileFormats need to be pickleable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8213) [Python][Dataste] Opening a dataset with a local incorrect path gives confusing error message

2020-03-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8213:


 Summary: [Python][Dataste] Opening a dataset with a local 
incorrect path gives confusing error message
 Key: ARROW-8213
 URL: https://issues.apache.org/jira/browse/ARROW-8213
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Even after the previous PRs related to local paths 
(https://github.com/apache/arrow/pull/6643, 
https://github.com/apache/arrow/pull/6655), I don't the user experience optimal 
in case you are working with local files, and pass a wrong, non-existent path 
(eg due to a typo).

Currently, you get this error:

{code}
>>> dataset = ds.dataset("data_with_typo.parquet", format="parquet")
...
ArrowInvalid: URI has empty scheme: 'data_with_typo.parquet'
{code}

where "URI has empty scheme" is rather confusing for the user in case of a 
non-existent path.  I think ideally we should raise a "No such file or 
directory" error.

I am not fully sure what the best solution is, as {{FileSystem.from_uri}} can 
also give other errors that we do want to propagate to the user. 
The most straightforward that I am now thinking of is checking if "URI has 
empty scheme" is in the error message, and then rewording it, but that's not 
very clean ..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8209) [Python] Accessing duplicate column of Table by name gives wrong error

2020-03-25 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8209:


 Summary: [Python] Accessing duplicate column of Table by name 
gives wrong error
 Key: ARROW-8209
 URL: https://issues.apache.org/jira/browse/ARROW-8209
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


When you have a table with duplicate column names and you try to access this 
column, you get an error about the column not existing:

{code}
>>> table = pa.table([pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([7, 8, 
>>> 9])], names=['a', 'b', 'a']) 

>>> table.column('a')   
>>> 
>>>
---
KeyError  Traceback (most recent call last)
 in 
> 1 table.column('a')

~/scipy/repos/arrow/python/pyarrow/table.pxi in pyarrow.lib.Table.column()

KeyError: 'Column a does not exist in table'
{code}

It should rather give an error message about the column name being duplicate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8196) [Python] Empty table creation from schema with nested dictionary type

2020-03-24 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8196:


 Summary: [Python] Empty table creation from schema with nested 
dictionary type
 Key: ARROW-8196
 URL: https://issues.apache.org/jira/browse/ARROW-8196
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-6872 / https://github.com/apache/arrow/pull/6698, creating 
an empty table from a schema in python ({{Schema.empty_table()}}) still fails 
with a nested dictionary type (eg a list of dictionaty type).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8186) [Python] Dataset expression != returns bool instead of expression for invalid value

2020-03-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8186:


 Summary: [Python] Dataset expression != returns bool instead of 
expression for invalid value
 Key: ARROW-8186
 URL: https://issues.apache.org/jira/browse/ARROW-8186
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


It's a bit a strange case, but eg when doing {{!= {3}}} you get a boolean 
result instead of an expression:

{code}
In [8]: ds.field('col') != 3

   
Out[8]: 

In [9]: ds.field('col') != {3}  

   
Out[9]: True
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8136) [C++][Python] Creating dataset from relative path no longer working

2020-03-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8136:


 Summary: [C++][Python] Creating dataset from relative path no 
longer working
 Key: ARROW-8136
 URL: https://issues.apache.org/jira/browse/ARROW-8136
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Since https://github.com/apache/arrow/pull/6597, local relative paths don't 
work anymore:

{code}
In [1]: import pyarrow.dataset as ds  

In [2]: ds.dataset("test.parquet")  
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 ds.dataset("test.parquet")

~/scipy/repos/arrow/python/pyarrow/dataset.py in dataset(paths_or_factories, 
filesystem, partitioning, format)
327 
328 if isinstance(paths_or_factories, str):
--> 329 return factory(paths_or_factories, **kwargs).finish()
330 
331 if not isinstance(paths_or_factories, list):

~/scipy/repos/arrow/python/pyarrow/dataset.py in factory(path_or_paths, 
filesystem, partitioning, format)
246 factories = []
247 for path in path_or_paths:
--> 248 fs, paths_or_selector = _ensure_fs_and_paths(path, filesystem)
249 factories.append(FileSystemDatasetFactory(fs, paths_or_selector,
250   format, options))

~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_fs_and_paths(path, 
filesystem)
165 from pyarrow.fs import FileType, FileSelector
166 
--> 167 filesystem, path = _ensure_fs(filesystem, _stringify_path(path))
168 infos = filesystem.get_target_infos([path])[0]
169 if infos.type == FileType.Directory:

~/scipy/repos/arrow/python/pyarrow/dataset.py in _ensure_fs(filesystem, path)
158 if filesystem is not None:
159 return filesystem, path
--> 160 return FileSystem.from_uri(path)
161 
162 

~/scipy/repos/arrow/python/pyarrow/_fs.pyx in pyarrow._fs.FileSystem.from_uri()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: URI has empty scheme: 'test.parquet'

{code}

[~apitrou] Is this something that should be fixed in 
{{FileSystemFromUriOrPath}} or rather on the python side? 
({{FileSystem.from_uri}} ensures to get the absolute path for Pathlib objects, 
but not for strings)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8088) [C++][Dataset] Partition columns with specified dictionary type result in all nulls

2020-03-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8088:


 Summary: [C++][Dataset] Partition columns with specified 
dictionary type result in all nulls
 Key: ARROW-8088
 URL: https://issues.apache.org/jira/browse/ARROW-8088
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Joris Van den Bossche


When specifying an explicit schema for the Partitioning, and when using a 
dictionary type, the materialization of the partition keys goes wrong: you 
don't get an error, but you get columns with all nulls.

Python example:

{code}
foo_keys = [0, 1]
bar_keys = ['a', 'b', 'c']
N = 30

df = pd.DataFrame({
'foo': np.array(foo_keys, dtype='i4').repeat(15),
'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
'values': np.random.randn(N)
})

pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
{code}

When reading with discovery, all is fine:

{code}
>>> ds.dataset("test_order", format="parquet", 
>>> partitioning="hive").to_table().schema
values: double
bar: string
foo: int32
>>> ds.dataset("test_order", format="parquet", 
>>> partitioning="hive").to_table().to_pandas().head(2)
 values bar  foo
0  2.505903   a0
1 -1.760135   a0
{code}

But when specifying the partition columns to be dictionary type with explicit 
{{HivePartitioning}}, you get no error but all null values:

{code}
>>> partitioning = ds.HivePartitioning(pa.schema([
... ("foo", pa.dictionary(pa.int32(), pa.int64())),
... ("bar", pa.dictionary(pa.int32(), pa.string()))
... ]))
>>> ds.dataset("test_order", format="parquet", 
>>> partitioning=partitioning).to_table().schema
values: double
foo: dictionary
bar: dictionary
>>> ds.dataset("test_order", format="parquet", 
>>> partitioning=partitioning).to_table().to_pandas().head(2)
 values  foo  bar
0  2.505903  NaN  NaN
1 -1.760135  NaN  NaN
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8087) [C++][Dataset] Order of keys with HivePartitioning is lost in resulting schema

2020-03-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8087:


 Summary: [C++][Dataset] Order of keys with HivePartitioning is 
lost in resulting schema
 Key: ARROW-8087
 URL: https://issues.apache.org/jira/browse/ARROW-8087
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Joris Van den Bossche


Currently, when reading a partitioned dataset with hive partitioning, it seems 
that the partition columns get sorted alphabetically when appending them to the 
schema (while the old ParquetDataset implementation keeps the order as it is 
present in the paths).  
For a regular partitioning this order is consistent for all fragments.

So for example for the typical NYC Taxi data example, with datasets, the schema 
ends with columns "month, year", while the ParquetDataset appends them as 
"year, month".

Python example:

{code}
foo_keys = [0, 1]
bar_keys = ['a', 'b', 'c']
N = 30

df = pd.DataFrame({
'foo': np.array(foo_keys, dtype='i4').repeat(15),
'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
'values': np.random.randn(N)
})

pq.write_to_dataset(pa.table(df), "test_order", partition_cols=['foo', 'bar'])
{code}

{code}
>>> pq.read_table("test_order").schema
values: double
foo: dictionary
bar: dictionary

>>> ds.dataset("test_order", format="parquet", partitioning="hive").schema
values: double
bar: string
foo: int32
{code}

so "foo, bar" vs "bar, foo" (the fact that it are dictionaries is something 
else)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8074) [C++][Dataset] Support for file-like objects (buffers) in FileSystemDataset?

2020-03-11 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8074:


 Summary: [C++][Dataset] Support for file-like objects (buffers) in 
FileSystemDataset?
 Key: ARROW-8074
 URL: https://issues.apache.org/jira/browse/ARROW-8074
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche


The current {{pyarrow.parquet.read_table}}/{{ParquetFile}} can work with buffer 
(reader) objects (file-like objects, pyarrow.Buffer, pyarrow.BufferReader) as 
input when dealing with single files. This functionality is for example being 
used by pandas and kartothek (in addition to being extensively used in our own 
tests as well).

While we could keep the old implementation to handle single files (which is 
different from the ParquetDataset logic), there are also some advantages of 
being able to handle this in the Datasets API.  
For example, this would enable to filtering functionality of the datasets API, 
also for this single-file buffers use case, which would be a nice enhancement 
(currently, {{read_table}} does not support {{filters}} in case of single 
files, which is eg why kartothek implements this themselves).

Would this be possible to support?

The {{arrow::dataset::FileSource}} already has PATH and BUFFER enum types 
(https://github.com/apache/arrow/blob/08f8bff05af37921ff1e5a2b630ce1e7ec1c0ede/cpp/src/arrow/dataset/file_base.h#L46-L49),
 so it seems in principle possible to create a FileSource (for a 
FileSystemDataset / FileFragment) from a buffer instead of from a path?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8063) [Python] Add user guide documentation for Datasets API

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8063:


 Summary: [Python] Add user guide documentation for Datasets API
 Key: ARROW-8063
 URL: https://issues.apache.org/jira/browse/ARROW-8063
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Joris Van den Bossche
 Fix For: 0.17.0


Currently, we only have API docs 
(https://arrow.apache.org/docs/python/api/dataset.html), but we also need prose 
docs explaining what the dataset module does with examples.

This can also include guidelines on how to use this instead of the 
ParquetDataset API (depending on how we end up doing ARROW-8039), this aspect 
is also covered by ARROW-8047



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8062) [C++][Dataset] Parquet Dataset factory from a _metadata/_common_metadata file

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8062:


 Summary: [C++][Dataset] Parquet Dataset factory from a 
_metadata/_common_metadata file
 Key: ARROW-8062
 URL: https://issues.apache.org/jira/browse/ARROW-8062
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche


Partitioned parquet datasets sometimes come with {{_metadata}} / 
{{_common_metadata}} files. Those files include information about the schema of 
the full dataset and potentially all RowGroup metadata as well (for 
{{_metadata}}).

Using those files during the creation of a parquet {{Dataset}} can give a more 
efficient factory (using the stored schema instead of inferring the schema from 
unioning the schemas of all files + using the paths to individual parquet files 
instead of crawling the directory).

Basically, based those files, the schema, list of paths and partition 
expressions (the information that is needed to create a Dataset) could be 
constructed.   
Such logic could be put in a different factory class, eg 
{{ParquetManifestFactory}} (as suggestetd by [~fsaintjacques]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8061) [C++][Dataset] Ability to specify granularity of ParquetFileFragment (support row groups)

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8061:


 Summary: [C++][Dataset] Ability to specify granularity of 
ParquetFileFragment (support row groups)
 Key: ARROW-8061
 URL: https://issues.apache.org/jira/browse/ARROW-8061
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Joris Van den Bossche


Specifically for parquet (not sure if it will be relevant for other file 
formats as well, for IPC/feather potentially ther record batch), it would be 
useful to target row groups instead of files as fragments.

Quoting the original design documents: _"In datasets consisting of many 
fragments, the dataset API must expose the granularity of fragments in a public 
way to enable parallel processing, if desired. "._   
And a comment from Wes on that: _"a single Parquet file can "export" one or 
more fragments based on settings. The default might be to split fragments based 
on row group"_

Currently, the level on which fragments are defined (at least in the typical 
partitioned parquet dataset) is "1 file == 1 fragment".

Would it be possible or desirable to make this more fine grained, where you 
could also opt to have a fragment per row group?   
We could have a ParquetFragment that has this option, and a ParquetFileFormat 
specific option to say what the granularity of a fragment is (file vs row 
group)?

cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8060) [Python] Make dataset Expression objects serializable

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8060:


 Summary: [Python] Make dataset Expression objects serializable
 Key: ARROW-8060
 URL: https://issues.apache.org/jira/browse/ARROW-8060
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


It would be good to be able to pickle pyarrow.dataset.Expression objects (eg 
for use in dask.distributed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8059) [Python] Make FileSystem objects serializable

2020-03-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-8059:


 Summary: [Python] Make FileSystem objects serializable
 Key: ARROW-8059
 URL: https://issues.apache.org/jira/browse/ARROW-8059
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


It would be good to be able to pickle {{pyarrow.fs.FileSystem}} objects (eg for 
use in dask.distributed)

cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7963) [C++][Python][Dataset] Expose listing fragments

2020-02-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7963:


 Summary: [C++][Python][Dataset] Expose listing fragments
 Key: ARROW-7963
 URL: https://issues.apache.org/jira/browse/ARROW-7963
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche
Assignee: Ben Kietzman


It would be useful to able to list the fragments, to get their paths / 
partition expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7907) [Python] Conversion to pandas of empty table with timestamp type aborts

2020-02-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7907:


 Summary: [Python] Conversion to pandas of empty table with 
timestamp type aborts
 Key: ARROW-7907
 URL: https://issues.apache.org/jira/browse/ARROW-7907
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.16.1


Creating an empty table:

{code}
In [1]: table = pa.table({'a': pa.array([], type=pa.timestamp('us'))})  

   

In [2]: table['a']  

   
Out[2]: 

[
  []
]

In [3]: table.to_pandas()   

   
Out[3]: 
Empty DataFrame
Columns: [a]
Index: []
{code}

the above works. But the ChunkedArray still has 1 empty chunk. When filtering 
data, you can actually get no chunks, and this fails:


{code}
In [4]: table2 = table.slice(0, 0)  

   

In [5]: table2['a'] 

   
Out[5]: 

[

]

In [6]: table2.to_pandas()  

   
../src/arrow/table.cc:48:  Check failed: (chunks.size()) > (0) cannot construct 
ChunkedArray from empty vector and omitted type
...
Aborted (core dumped)
{code}

and this seems to happen specifically for timestamp type, and specifically with 
non-ns unit (eg with us as above, which is the default in arrow).

I noticed this when reading a parquet file of the taxi dataset, where the 
filter I used resulted in an empty batch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7892) [Python] Expose FilesystemSource.format attribute

2020-02-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7892:


 Summary: [Python] Expose FilesystemSource.format attribute
 Key: ARROW-7892
 URL: https://issues.apache.org/jira/browse/ARROW-7892
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7858) [C++][Python] Support casting an Extension type to its storage type

2020-02-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7858:


 Summary: [C++][Python] Support casting an Extension type to its 
storage type
 Key: ARROW-7858
 URL: https://issues.apache.org/jira/browse/ARROW-7858
 Project: Apache Arrow
  Issue Type: Test
  Components: C++, Python
Reporter: Joris Van den Bossche


Currently, casting an extension type will always fail: "No cast implemented 
from extension to ...".

However, for casting, we could fall back to the storage array's casting rules?





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7857) [Python] Failing test with pandas master for extension type conversion

2020-02-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7857:


 Summary: [Python] Failing test with pandas master for extension 
type conversion
 Key: ARROW-7857
 URL: https://issues.apache.org/jira/browse/ARROW-7857
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche


The pandas master test build has one failure


{code}
___ test_conversion_extensiontype_to_extensionarray 

monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fcd6c580bd0>

def test_conversion_extensiontype_to_extensionarray(monkeypatch):
# converting extension type to linked pandas ExtensionDtype/Array
import pandas.core.internals as _int

storage = pa.array([1, 2, 3, 4], pa.int64())
arr = pa.ExtensionArray.from_storage(MyCustomIntegerType(), storage)
table = pa.table({'a': arr})

if LooseVersion(pd.__version__) < "0.26.0.dev":
# ensure pandas Int64Dtype has the protocol method (for older 
pandas)
monkeypatch.setattr(
pd.Int64Dtype, '__from_arrow__', _Int64Dtype__from_arrow__,
raising=False)

# extension type points to Int64Dtype, which knows how to create a
# pandas ExtensionArray
>   result = table.to_pandas()

opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_pandas.py:3560:
 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/ipc.pxi:559: in pyarrow.lib.read_message
???
pyarrow/table.pxi:1369: in pyarrow.lib.Table._to_pandas
???
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:764: 
in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: 
in _table_to_blocks
for item in result]
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:1102: 
in 
for item in result]
opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/pandas_compat.py:723: 
in _reconstruct_block
pd_ext_arr = pandas_dtype.__from_arrow__(arr)
opt/conda/envs/arrow/lib/python3.7/site-packages/pandas/core/arrays/integer.py:108:
 in __from_arrow__
array = array.cast(pyarrow_type)
pyarrow/table.pxi:240: in pyarrow.lib.ChunkedArray.cast
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   pyarrow.lib.ArrowNotImplementedError: No cast implemented from 
extension to int64
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7854) [C++][Dataset] Option to memory map when reading IPC format

2020-02-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7854:


 Summary: [C++][Dataset] Option to memory map when reading IPC 
format
 Key: ARROW-7854
 URL: https://issues.apache.org/jira/browse/ARROW-7854
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset
Reporter: Joris Van den Bossche


For the IPC format it would be interesting to be able to memory map the IPC 
files?

cc [~fsaintjacques] [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7839) [Python][Dataset] Add IPC format to python bindings

2020-02-12 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7839:


 Summary: [Python][Dataset] Add IPC format to python bindings
 Key: ARROW-7839
 URL: https://issues.apache.org/jira/browse/ARROW-7839
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


The C++ / R was done in ARROW-7415, we should add bindings for it in Python as 
well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7781) [C++][Dataset] Filtering on a non-existent column gives a segfault

2020-02-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7781:


 Summary: [C++][Dataset] Filtering on a non-existent column gives a 
segfault
 Key: ARROW-7781
 URL: https://issues.apache.org/jira/browse/ARROW-7781
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Example with python code:

{code}
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [1, 2, 3]})

In [3]: df.to_parquet("test-filter-crash.parquet")

In [4]: import pyarrow.dataset as ds

In [5]: dataset = ds.dataset("test-filter-crash.parquet")

In [6]: dataset.to_table(filter=ds.field('a') > 1).to_pandas()
Out[6]:
   a
0  2
1  3

In [7]: dataset.to_table(filter=ds.field('b') > 1).to_pandas()
../src/arrow/dataset/filter.cc:929:  Check failed: _s.ok() Operation failed: 
maybe_value.status()
Bad status: Invalid: attempting to cast non-null scalar to NullScalar
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f744c)[0x7fb1390f444c]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ca)[0x7fb1390f43ca]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(+0x11f73ec)[0x7fb1390f43ec]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow.so.16(_ZN5arrow4util8ArrowLogD1Ev+0x57)[0x7fb1390f4759]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x169fc6)[0x7fb145594fc6]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(+0x16b9be)[0x7fb1455969be]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset15VisitExpressionINS0_23InsertImplicitCastsImplEEEDTclfp0_fp_EERKNS0_10ExpressionEOT_+0x2ae)[0x7fb1455a0dee]
/home/joris/miniconda3/envs/arrow-dev/lib/libarrow_dataset.so.16(_ZN5arrow7dataset19InsertImplicitCastsERKNS0_10ExpressionERKNS_6SchemaE+0x44)[0x7fb145596d4e]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x48286)[0x7fb1456dd286]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x49220)[0x7fb1456de220]
/home/joris/miniconda3/envs/arrow-dev/bin/python(+0x170f37)[0x55e5127e1f37]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x22bd6)[0x7fb1456b7bd6]
/home/joris/scipy/repos/arrow/python/pyarrow/_dataset.cpython-37m-x86_64-linux-gnu.so(+0x33b81)[0x7fb1456c8b81]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x305)[0x55e5127d9c75]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x5460)[0x55e512847c40]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]
/home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCodeEx+0x44)[0x55e512789064]
/home/joris/miniconda3/envs/arrow-dev/bin/python(PyEval_EvalCode+0x1c)[0x55e51278908c]
/home/joris/miniconda3/envs/arrow-dev/bin/python(+0x1e1650)[0x55e512852650]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9)[0x55e5127d9a59]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x55e5127d9cf1]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x48e4)[0x55e5128470c4]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x1a83)[0x55e512844263]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyGen_Send+0x2a2)[0x55e5127e31a2]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDef_RawFastCallKeywords+0x8c)[0x55e5127d99fc]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyMethodDescr_FastCallKeywords+0x4f)[0x55e5127e1fdf]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x4ddc)[0x55e5128475bc]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x416)[0x55e512842bf6]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x55e5127d915b]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x6f3)[0x55e512842ed3]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyFunction_FastCallKeywords+0x387)[0x55e5127d93e7]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalFrameDefault+0x14e4)[0x55e512843cc4]
/home/joris/miniconda3/envs/arrow-dev/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x55e5127881a9]

[jira] [Created] (ARROW-7762) [Python] Exceptions in ParquetWriter get ignored

2020-02-04 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7762:


 Summary: [Python] Exceptions in ParquetWriter get ignored
 Key: ARROW-7762
 URL: https://issues.apache.org/jira/browse/ARROW-7762
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


For example:

{code:python}
In [43]: table = pa.table({'a': [1, 2, 3]}) 

In [44]: pq.write_table(table, "test.parquet", version="2.2")   

   
---
ArrowExceptionTraceback (most recent call last)
ArrowException: Unsupported Parquet format version
Exception ignored in: 'pyarrow._parquet.ParquetWriter._set_version'
pyarrow.lib.ArrowException: Unsupported Parquet format version
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7703) [C++][Dataset] Give more informative error message for mismatching schemas for FileSystemSources

2020-01-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7703:


 Summary: [C++][Dataset] Give more informative error message for 
mismatching schemas for FileSystemSources
 Key: ARROW-7703
 URL: https://issues.apache.org/jira/browse/ARROW-7703
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche


Currently, if you try to create a dataset from files with different schemes, 
you get this error:

{code}
ArrowInvalid: Unable to merge: Field a has incompatible types: int64 vs int32
{code}

If you are reading a directory of files, it would be very helpful if the error 
message can indicate which files are involved here (eg if you have a lot of 
files and only one has an error).

You can already inspect the schema's if you first make a SourceFactory 
manually, but that also only gives a list of schema's, not mapped to the 
original file (this last item probably relates to ARROW-7608 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches

2020-01-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7702:


 Summary: [C++][Dataset] Provide (optional) deterministic order of 
batches
 Key: ARROW-7702
 URL: https://issues.apache.org/jira/browse/ARROW-7702
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche


Example with python:

{code}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': range(12)}) 
pq.write_table(table, "test_chunks.parquet", chunk_size=3) 

# reading with dataset
import pyarrow.dataset as ds
ds.dataset("test_chunks.parquet").to_table().to_pandas()
{code}

gives non-deterministic result (order of the row groups in the parquet file):

```
In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas()   

   
Out[25]: 
 a
00
11
22
33
44
55
66
77
88
99
10  10
11  11

In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas()   

   
Out[26]: 
 a
00
11
22
33
48
59
6   10
7   11
84
95
10   6
11   7

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7677) [C++] Handle Windows file paths with backslashes in GetTargetStats

2020-01-24 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7677:


 Summary: [C++] Handle Windows file paths with backslashes in 
GetTargetStats
 Key: ARROW-7677
 URL: https://issues.apache.org/jira/browse/ARROW-7677
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Currently, if the base path passed to  {{GetTargetStats}} has backslashes, the 
produces FileStats also include them, resulting in some other functionality 
(such as splitting the path) not working. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7652) [Python] Insert implicit cast in ScannerBuilder.filter

2020-01-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7652:


 Summary: [Python] Insert implicit cast in ScannerBuilder.filter
 Key: ARROW-7652
 URL: https://issues.apache.org/jira/browse/ARROW-7652
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7649) [Python] Expose dataset PartitioningFactory.inspect ?

2020-01-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7649:


 Summary: [Python] Expose dataset PartitioningFactory.inspect ?
 Key: ARROW-7649
 URL: https://issues.apache.org/jira/browse/ARROW-7649
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


In C++, the PartitioningFactory has a {{Inspect}} method, which, given a path, 
will infer the schema. 

We could expose this in Python as well, it could eg be used to easily explore 
or illustrate what types are inferred from a path (int32, string)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7636) [Python] Clean-up the pyarrow.dataset.partitioning() API

2020-01-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7636:


 Summary: [Python] Clean-up the pyarrow.dataset.partitioning() API
 Key: ARROW-7636
 URL: https://issues.apache.org/jira/browse/ARROW-7636
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.16.0


A left-over review comment at 
https://github.com/apache/arrow/pull/6022#discussion_r367016454 on the API of 
{{partitioning()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7634) [Python] Dataset tests failing on Windows to parse file path

2020-01-21 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7634:


 Summary: [Python] Dataset tests failing on Windows to parse file 
path
 Key: ARROW-7634
 URL: https://issues.apache.org/jira/browse/ARROW-7634
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.16.0


See eg 
https://dev.azure.com/ursa-labs/crossbow/_build/results?buildId=5217=logs=4c86bc1b-1091-5192-4404-c74dfaad23e7=ec99a26b-0264-5e86-36fb-9cfd0ca0f9f3=4066

Failing on the backward slashes of the pathlib file paths, and clearly not run 
in CI since this was not catched.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7593) [CI][Python] Python datasets failing on master / not run on CI

2020-01-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7593:


 Summary: [CI][Python] Python datasets failing on master / not run 
on CI
 Key: ARROW-7593
 URL: https://issues.apache.org/jira/browse/ARROW-7593
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7591) [Python] DictionaryArray.to_numpy returns dict of parts instead of numpy array

2020-01-16 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7591:


 Summary: [Python] DictionaryArray.to_numpy returns dict of parts 
instead of numpy array
 Key: ARROW-7591
 URL: https://issues.apache.org/jira/browse/ARROW-7591
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Currently, the {{to_numpy}} method doesn't return an ndarray incase of 
dictionaryd type data:

{code}
In [54]: a = pa.array(pd.Categorical(["a", "b", "a"]))  

   

In [55]: a  

   
Out[55]: 


-- dictionary:
  [
"a",
"b"
  ]
-- indices:
  [
0,
1,
0
  ]

In [57]: a.to_numpy(zero_copy_only=False)   

   
Out[57]: 
{'indices': array([0, 1, 0], dtype=int8),
 'dictionary': array(['a', 'b'], dtype=object),
 'ordered': False}
{code}

This is actually just an internal representation that is passed from C++ to 
Python so on the Python side a {{pd.Categorical}} / {{CategoricalBlock}} can be 
constructed, but it's not something we should return as such to the user. 
Rather, I think we should return a decoded / dense numpy array (or at least 
error instead of returning this dict)

(also, if the user wants those parts, they are already available from the 
dictionary array as {{a.indices}}, {{a.dictionary}} and {{a.type.ordered}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7569) [Python] Add API to map Arrow types to pandas ExtensionDtypes for to_pandas conversions

2020-01-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7569:


 Summary: [Python] Add API to map Arrow types to pandas 
ExtensionDtypes for to_pandas conversions
 Key: ARROW-7569
 URL: https://issues.apache.org/jira/browse/ARROW-7569
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.16.0


ARROW-2428 was about adding such a mapping, and described three use cases (see 
this 
[comment|https://issues.apache.org/jira/browse/ARROW-2428?focusedCommentId=16914231=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16914231]
 for details):

* Basic roundtrip based on the pandas_metadata (in {{to_pandas}}, we check if 
the pandas_metadata specify pandas extension dtypes, and if so, use this as the 
target dtype for that column)
* Conversion for pyarrow extension types that can define their equivalent 
pandas extension dtype
* A way to override default conversion (eg for the built-in types, or in 
absence of pandas_metadata in the schema). This would require the user to be 
able to specify some mapping of pyarrow type or column name to the pandas 
extension dtype to use.

The PR that closed ARROW-2428 (https://github.com/apache/arrow/pull/5512) only 
covered the first two cases, and not the third case.

I think it is still interesting to also cover the third case in some way.  

An example use case are the new nullable dtypes that are introduced in pandas 
(eg the nullable integer dtype).  Assume I want to read a parquet file into a 
pandas DataFrame using this nullable integer dtype. The pyarrow Table has no 
pandas_metadata indicating to use this dtype (unless it was created from a 
pandas DataFrame that was already using this dtype, but that will often not be 
the case), and the pyarrow.int64() type is also not an extension type that can 
define its equivalent pandas extension dtype. 
Currently, the only solution is first read it into pandas DataFrame (which will 
use floats for the integers if there are nulls), and then afterwards to convert 
those floats back to a nullable integer dtype. 

A possible API for this could look like:

{code}
table.to_pandas(types_mapping={pa.int64(): pd.Int64Dtype()})
{code}

to indicate that you want to convert all columns of the pyarrow table with 
int64 type to a pandas column using the nullable Int64 dtype.
 








--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7547) [C++] [Python] [Dataset] Additional reader options in ParquetFileFormat

2020-01-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7547:


 Summary: [C++] [Python] [Dataset] Additional reader options in 
ParquetFileFormat
 Key: ARROW-7547
 URL: https://issues.apache.org/jira/browse/ARROW-7547
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Dataset, Python
Reporter: Joris Van den Bossche


[looking into using the datasets machinery in the current python parquet code]

In the current python API, we expose several options that influence reading the 
parquet file (eg {{read_dictionary}} to indicate to read certain BYTE_ARRAY 
columns directly into a dictionary type, or {{memory_map}}, {{buffer_size}}).

Those could be added to {{ParquetFileFormat}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7545) [C++] Scanning dataset with dictionary type hangs

2020-01-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7545:


 Summary: [C++] Scanning dataset with dictionary type hangs
 Key: ARROW-7545
 URL: https://issues.apache.org/jira/browse/ARROW-7545
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Dataset
Reporter: Joris Van den Bossche


I assume it is an issue on the C++ side of the datasets code, but reproducer in 
Python. 

I create a small parquet file with a single column of dictionary type. Reading 
it with {{pq.read_table}} works fine, reading it with the datasets machinery 
hangs when scanning:

{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)})
arrow_table = pa.Table.from_pandas(df)

filename = "test.parquet"
pq.write_table(arrow_table, filename)

from pyarrow.fs import LocalFileSystem
from pyarrow.dataset import ParquetFileFormat, Dataset, 
FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions

filesystem = LocalFileSystem()
format = ParquetFileFormat()
options = FileSystemDiscoveryOptions()

discovery = FileSystemDataSourceDiscovery(
filesystem, [filename], format, options)
inspected_schema = discovery.inspect()
dataset = Dataset([discovery.finish()], inspected_schema)

# dataset.schema works fine and gives correct schema
dataset.schema

scanner_builder = dataset.new_scan()
scanner = scanner_builder.finish()
# this hangs
scanner.to_table()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7528) [Python] The pandas.datetime class (import of datetime.datetime) is deprecated

2020-01-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7528:


 Summary: [Python] The pandas.datetime class (import of 
datetime.datetime) is deprecated
 Key: ARROW-7528
 URL: https://issues.apache.org/jira/browse/ARROW-7528
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 0.16.0


The {{pd.datetime}} was actually just an import from {{datetime.datetime}}, and 
is being removed from pandas (to use the stdlib one directly).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7527) [Python] pandas/feather tests failing on pandas master

2020-01-09 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7527:


 Summary: [Python] pandas/feather tests failing on pandas master
 Key: ARROW-7527
 URL: https://issues.apache.org/jira/browse/ARROW-7527
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


Because I merged a PR in pandas to support Period dtype, some tests in pyarrow 
are now failing (they were using period dtype to test "unsupported" dtypes)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7497) [Python] pandas master failures: pandas.util.testing is deprecated

2020-01-06 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7497:


 Summary: [Python] pandas master failures: pandas.util.testing is 
deprecated
 Key: ARROW-7497
 URL: https://issues.apache.org/jira/browse/ARROW-7497
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


The nightly pandas-master tests are failing (eg 
https://circleci.com/gh/ursa-labs/crossbow/6815?utm_campaign=vcs-integration-link_medium=referral_source=github-build-link)
 due to the deprecation of {{pandas.util.testing}} in pandas. 

This deprecation gives a lot of warnings (which we should solve), but also some 
errors because the deprecations was not fully done properly on the pandas side, 
opened https://github.com/pandas-dev/pandas/issues/30735 for this (will be 
fixed shortly)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7432) [Python] Add higher-level datasets functions

2019-12-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7432:


 Summary: [Python] Add higher-level datasets functions
 Key: ARROW-7432
 URL: https://issues.apache.org/jira/browse/ARROW-7432
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


>From [~kszucs]: We need to define a more pythonic API for the dataset 
>bindings, because the current one is pretty low-level.

One option is to provide a "open_dataset" function similar as what is available 
in R.

A short-cut to go from a Dataset to a Table might also be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7431) [Python] Add dataset API to reference docs

2019-12-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7431:


 Summary: [Python] Add dataset API to reference docs
 Key: ARROW-7431
 URL: https://issues.apache.org/jira/browse/ARROW-7431
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Add dataset to python API docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7430) [Python] Add more docstrings to dataset bindings

2019-12-18 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7430:


 Summary: [Python] Add more docstrings to dataset bindings
 Key: ARROW-7430
 URL: https://issues.apache.org/jira/browse/ARROW-7430
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7365) [Python] Support FixedSizeList type in conversion to numpy/pandas

2019-12-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7365:


 Summary: [Python] Support FixedSizeList type in conversion to 
numpy/pandas
 Key: ARROW-7365
 URL: https://issues.apache.org/jira/browse/ARROW-7365
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Follow-up on ARROW-7261, still need to add support for FixedSizeListType in the 
arrow -> python conversion (arrow_to_pandas.cc)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7273) [Python] Non-nullable null field is allowed / crashes when writing to parquet

2019-11-28 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7273:


 Summary: [Python] Non-nullable null field is allowed / crashes 
when writing to parquet
 Key: ARROW-7273
 URL: https://issues.apache.org/jira/browse/ARROW-7273
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Joris Van den Bossche


It seems to be possible to create a "non-nullable null field". While this does 
not make any sense (so already a reason to disallow this I think), this can 
also lead to crashed in further operations, such as writing to parquet:

{code}
In [18]: table = pa.table([pa.array([None, None], pa.null())], 
schema=pa.schema([pa.field('a', pa.null(), nullable=False)]))

In [19]: table
Out[19]:
pyarrow.Table
a: null not null

In [20]: pq.write_table(table, "test_null.parquet")
WARNING: Logging before InitGoogleLogging() is written to STDERR
F1128 14:08:30.267439 27560 column_writer.cc:837]  Check failed: (nullptr) != 
(values)
*** Check failure stack trace: ***
Aborted (core dumped)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7261) [Python] Python support for fixed size list type

2019-11-26 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7261:


 Summary: [Python] Python support for fixed size list type
 Key: ARROW-7261
 URL: https://issues.apache.org/jira/browse/ARROW-7261
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


I didn't see any issue about this, but {{FixedSizeListArray}} (ARROW-1280) is 
not yet exposed in Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7220) [CI] Docker compose (github actions) Mac Python 3 build is using Python 2

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7220:


 Summary: [CI] Docker compose (github actions) Mac Python 3 build 
is using Python 2
 Key: ARROW-7220
 URL: https://issues.apache.org/jira/browse/ARROW-7220
 Project: Apache Arrow
  Issue Type: Test
Reporter: Joris Van den Bossche


The "AMD64 MacOS 10.15 Python 3" build is also running in python 2.

Possibly related to how brew is installing python 2 / 3, or because it is using 
the system python, ... (not familiar with mac)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7218) [Python] Conversion from boolean numpy scalars not working

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7218:


 Summary: [Python] Conversion from boolean numpy scalars not working
 Key: ARROW-7218
 URL: https://issues.apache.org/jira/browse/ARROW-7218
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


In general, we are fine to accept a list of numpy scalars:

{code}
In [12]: type(list(np.array([1, 2]))[0])

   
Out[12]: numpy.int64

In [13]: pa.array(list(np.array([1, 2])))   

   
Out[13]: 

[
  1,
  2
]
{code}

But for booleans, this doesn't work:

{code}
In [14]: pa.array(list(np.array([True, False])))

   
---
ArrowInvalid  Traceback (most recent call last)
 in 
> 1 pa.array(list(np.array([True, False])))

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

ArrowInvalid: Could not convert True with type numpy.bool_: tried to convert to 
boolean
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7217) Docker compose / github actions ignores PYTHON env

2019-11-20 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7217:


 Summary: Docker compose / github actions ignores PYTHON env
 Key: ARROW-7217
 URL: https://issues.apache.org/jira/browse/ARROW-7217
 Project: Apache Arrow
  Issue Type: Test
  Components: CI
Reporter: Joris Van den Bossche


The "AMD64 Conda Python 2.7" build is actually using Python 3.6. 

This python 3.6 version is written in the conda-python.dockerfile: 
https://github.com/apache/arrow/blob/master/ci/docker/conda-python.dockerfile#L24
 
and I am not fully sure how the ENV variable overrides that or not

cc [~kszucs]




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7209) [Python] tests with pandas master are failing now __from_arrow__ support landed in pandas

2019-11-19 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7209:


 Summary: [Python] tests with pandas master are failing now 
__from_arrow__ support landed in pandas
 Key: ARROW-7209
 URL: https://issues.apache.org/jira/browse/ARROW-7209
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche


I implemented pandas <-> arrow roundtrip for pandas' integer+string dtype in 
https://github.com/pandas-dev/pandas/pull/29483, which is now merged. But our 
tests where assuming this did not yet work in pandas, and thus need to be 
updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7167) [CI][Python] Add nightly tests for older pandas versions to Github Actions

2019-11-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7167:


 Summary: [CI][Python] Add nightly tests for older pandas versions 
to Github Actions
 Key: ARROW-7167
 URL: https://issues.apache.org/jira/browse/ARROW-7167
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7154) [C++] Build error when building tests but not with snappy

2019-11-13 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7154:


 Summary: [C++] Build error when building tests but not with snappy
 Key: ARROW-7154
 URL: https://issues.apache.org/jira/browse/ARROW-7154
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Joris Van den Bossche


Since the docker-compose PR landed, I am having build errors like:
{code:java}
[361/376] Linking CXX executable debug/arrow-python-test
FAILED: debug/arrow-python-test
: && /home/joris/miniconda3/envs/arrow-dev/bin/ccache 
/home/joris/miniconda3/envs/arrow-dev/bin/x86_64-conda_cos6-linux-gnu-c++  
-Wno-noexcept-type -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 
-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong 
-fno-plt -O2 -ffunction-sections -pipe -fdiagnostics-color=always -ggdb -O0  
-Wall -Wno-conversion -Wno-sign-conversion -Wno-unused-variable -Werror 
-msse4.2  -g  -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now 
-Wl,--disable-new-dtags -Wl,--gc-sections   -rdynamic 
src/arrow/python/CMakeFiles/arrow-python-test.dir/python_test.cc.o  -o 
debug/arrow-python-test  
-Wl,-rpath,/home/joris/scipy/repos/arrow/cpp/build/debug:/home/joris/miniconda3/envs/arrow-dev/lib
 debug/libarrow_python_test_main.a debug/libarrow_python.so.100.0.0 
debug/libarrow_testing.so.100.0.0 debug/libarrow.so.100.0.0 
/home/joris/miniconda3/envs/arrow-dev/lib/libpython3.7m.so -lpthread -lpthread 
-ldl  -lutil -lrt -ldl 
/home/joris/miniconda3/envs/arrow-dev/lib/libdouble-conversion.a 
/home/joris/miniconda3/envs/arrow-dev/lib/libglog.so 
jemalloc_ep-prefix/src/jemalloc_ep/dist//lib/libjemalloc_pic.a -lrt 
/home/joris/miniconda3/envs/arrow-dev/lib/libgtest.so -pthread && :
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 warning: libboost_filesystem.so.1.68.0, needed by debug/libarrow.so.100.0.0, 
not found (try using -rpath or -rpath-link)
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 warning: libboost_system.so.1.68.0, needed by debug/libarrow.so.100.0.0, not 
found (try using -rpath or -rpath-link)
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 debug/libarrow.so.100.0.0: undefined reference to 
`boost::system::detail::generic_category_ncx()'
/home/joris/miniconda3/envs/arrow-dev/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld:
 debug/libarrow.so.100.0.0: undefined reference to 
`boost::filesystem::path::operator/=(boost::filesystem::path const&)'
collect2: error: ld returned 1 exit status
{code}
which contains warnings like "warning: libboost_filesystem.so.1.68.0, needed by 
debug/libarrow.so.100.0.0, not found" (although this is certainly present).

The error is triggered by having {{-DARROW_BUILD_TESTS=ON}}. If that is set to 
OFF, it works fine.

It also seems to be related to this specific change in the docker compose PR:
{code:java}
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index c80ac3310..3b3c9eb8f 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -266,6 +266,15 @@ endif(UNIX)
 # Set up various options
 #

-if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS)
-  # Currently the compression tests require at least these libraries; bz2 and
-  # zstd are optional. See ARROW-3984
-  set(ARROW_WITH_BROTLI ON)
-  set(ARROW_WITH_LZ4 ON)
-  set(ARROW_WITH_SNAPPY ON)
-  set(ARROW_WITH_ZLIB ON)
-endif()
-
 if(ARROW_BUILD_TESTS OR ARROW_BUILD_INTEGRATION)
   set(ARROW_JSON ON)
 endif()
{code}

If I add that back, the build works.

With only `set(ARROW_WITH_BROTLI ON)`, it still fails
 With only `set(ARROW_WITH_LZ4 ON)`, it also fails but with an error about 
liblz4 instead of libboost (but also liblz4 is actually present)
 With only `set(ARROW_WITH_SNAPPY ON)`, it works
 With only `set(ARROW_WITH_ZLIB ON)`, it also fails but with an error about 
libz.so.1 not found

So it seems that the absence of snappy causes others to fail.

In the recommended build settings in the development docs 
([https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst#build-and-test),]
 the compression libraries are enabled. But I was still building without them 
(stemming from the time they were enabled by default). So I was using:

{code}
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME -GNinja \
 -DCMAKE_INSTALL_LIBDIR=lib \
 -DARROW_PARQUET=ON \
 -DARROW_PYTHON=ON \
 -DARROW_BUILD_TESTS=ON \
 ..
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7068) [C++] Expose the offsets of a ListArray as a Int32Array

2019-11-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7068:


 Summary: [C++] Expose the offsets of a ListArray as a Int32Array
 Key: ARROW-7068
 URL: https://issues.apache.org/jira/browse/ARROW-7068
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


As follow-up on ARROW-7031 (https://github.com/apache/arrow/pull/5759), we can 
move this into C++ and use that implementation from Python.

 

Cfr [https://github.com/apache/arrow/pull/5759#discussion_r342244521,] this 
could be a \{{ListArray::value_offsets_array}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7066) [Python] support returning ChunkedArray from __arrow_array__ ?

2019-11-05 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7066:


 Summary: [Python] support returning ChunkedArray from 
__arrow_array__ ?
 Key: ARROW-7066
 URL: https://issues.apache.org/jira/browse/ARROW-7066
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


The {{\_\_arrow_array\_\_}} protocol was added so that custom objects can 
define how they should be converted to a pyarrow Array (similar to numpy's 
{{\_\_array\_\_}}). This is then also used to support converting pandas 
DataFrames with columns using pandas' ExtensionArrays to a pyarrow Table (if 
the pandas ExtensionArray, such as nullable integer type, implements this 
{{\_\_arrow_array\_\_}} method).

This last use case could also be useful for fletcher 
(https://github.com/xhochy/fletcher/, a package that implements pandas 
ExtensionArrays that wrap pyarrow arrays, so they can be stored as is in a 
pandas DataFrame).  
However, fletcher stores ChunkedArrays in ExtensionArry / the columns of a 
pandas DataFrame (to have a better mapping with a Table, where the columns also 
consist of chunked arrays). While we currently require that the return value of 
{{\_\_arrow_array\_\_}} is a pyarrow.Array.

So I was wondering: could we relax this constraint and also allow ChunkedArray 
as return value? 
However, this protocol is currently called in the {{pa.array(..)}} function, 
which probably should keep returning an Array (and not ChunkedArray in certain 
cases).

cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7031) [Python] Expose the offsets of a ListArray in python

2019-10-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7031:


 Summary: [Python] Expose the offsets of a ListArray in python
 Key: ARROW-7031
 URL: https://issues.apache.org/jira/browse/ARROW-7031
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


Assume the following ListArray:

{code}
In [1]: arr = pa.ListArray.from_arrays(offsets=[0, 3, 5], values=[1, 2, 3, 4, 
5]) 
 

In [2]: arr 

   
Out[2]: 

[
  [
1,
2,
3
  ],
  [
4,
5
  ]
]
{code}

You can get the actual values as a flat array through {{.values}} / 
{{.flatten()}}, but there is currently no easy way to get back to the offsets 
(except from interpreting the buffers manually). 

We should probably add an {{offsets}} attribute (there is actually also a TODO 
comment for that).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7027) [Python] pa.table(..) returns instead of raises error if passing invalid object

2019-10-30 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7027:


 Summary: [Python] pa.table(..) returns instead of raises error if 
passing invalid object
 Key: ARROW-7027
 URL: https://issues.apache.org/jira/browse/ARROW-7027
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


When passing eg a Series instead of a DataFrame, you get:

{code}
In [4]: df = pd.DataFrame({'a': [1, 2, 3]}) 

   

In [5]: table = pa.table(df['a'])   

   

In [6]: table   

   
Out[6]: TypeError('Expected pandas DataFrame or python dictionary')

In [7]: type(table) 

   
Out[7]: TypeError
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7023) [Python] pa.array does not use "from_pandas" semantics for pd.Index

2019-10-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7023:


 Summary: [Python] pa.array does not use "from_pandas" semantics 
for pd.Index
 Key: ARROW-7023
 URL: https://issues.apache.org/jira/browse/ARROW-7023
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
Assignee: Joris Van den Bossche
 Fix For: 1.0.0


{code}
In [15]: idx = pd.Index([1, 2, np.nan], dtype=object)   

   

In [16]: pa.array(idx)  

   
Out[16]: 

[
  1,
  2,
  nan
]

In [17]: pa.array(idx, from_pandas=True)

   
Out[17]: 

[
  1,
  2,
  null
]

In [18]: pa.array(pd.Series(idx))   

   
Out[18]: 

[
  1,
  2,
  null
]
{code}

We should probably handle Series and Index the same in this regard.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7022) [Python] __arrow_array__ does not work for ExtensionTypes in Table.from_pandas

2019-10-29 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-7022:


 Summary: [Python] __arrow_array__ does not work for ExtensionTypes 
in Table.from_pandas
 Key: ARROW-7022
 URL: https://issues.apache.org/jira/browse/ARROW-7022
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


When someone has a custom ExtensionType defined in Python, and an array class 
that gets converted to that (through {{\_\_arrow_array\_\_}}), the conversion 
in pyarrow works with the array class, but not yet for the array stored in a 
pandas DataFrame.

Eg using my definition of ArrowPeriodType in 
https://github.com/pandas-dev/pandas/pull/28371, I see:

{code}
In [15]: pd_array = pd.period_range("2012-01-01", periods=3, freq="D").array

   

In [16]: pd_array   

   
Out[16]: 

['2012-01-01', '2012-01-02', '2012-01-03']
Length: 3, dtype: period[D]

In [17]: pa.array(pd_array) 

   
Out[17]: 

[
  15340,
  15341,
  15342
]

In [18]: df = pd.DataFrame({'periods': pd_array})   

   

In [19]: pa.table(df)   

   
...
ArrowInvalid: ('Could not convert 2012-01-01 with type Period: did not 
recognize Python value type when inferring an Arrow data type', 'Conversion 
failed for column periods with type period[D]')
{code}

(this is working correctly for array objects whose {{\_\_arrow_array\_\_}} is 
returning a built-in pyarrow Array).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6974) [C++] Implement Cast kernel for time-likes with ArrayDataVisitor pattern

2019-10-23 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6974:


 Summary: [C++] Implement Cast kernel for time-likes with 
ArrayDataVisitor pattern
 Key: ARROW-6974
 URL: https://issues.apache.org/jira/browse/ARROW-6974
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


Currently, the casting for time-like data is done with the {{ShiftTime}} 
function. It _might_ be possible to simplify this with ArrayDataVisitor (to 
avoid looping / checking the bitmap).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6923) [C++] Option for Filter kernel how to handle nulls in the selection vector

2019-10-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6923:


 Summary: [C++] Option for Filter kernel how to handle nulls in the 
selection vector
 Key: ARROW-6923
 URL: https://issues.apache.org/jira/browse/ARROW-6923
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche


How nulls are handled in the boolean mask (selection vector) in a filter kernel 
varies between languages / data analytics systems (e.g. base R propagates 
nulls, dplyr R skips (sees as False), SQL generally skips them as well I think, 
Julia raises an error).

Currently, in Arrow C++ we "propagate" nulls (null in the selection vector 
gives a null in the output):

{code}
In [7]: arr = pa.array([1, 2, 3]) 

In [8]: mask = pa.array([True, False, None]) 

In [9]: arr.filter(mask) 
Out[9]: 

[
  1,
  null
]
{code}

Given the different ways this could be done (propagate, skip, error), should we 
provide an option to control this behaviour?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6922) [Python] Pandas master build is failing (MultiIndex.levels change)

2019-10-17 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6922:


 Summary: [Python] Pandas master build is failing 
(MultiIndex.levels change)
 Key: ARROW-6922
 URL: https://issues.apache.org/jira/browse/ARROW-6922
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 0.15.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6885) [Python] Remove superfluous skipped timedelta test

2019-10-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6885:


 Summary: [Python] Remove superfluous skipped timedelta test
 Key: ARROW-6885
 URL: https://issues.apache.org/jira/browse/ARROW-6885
 Project: Apache Arrow
  Issue Type: Test
  Components: Python
Reporter: Joris Van den Bossche
 Fix For: 1.0.0


Now that we support timedelta / duration type, there is an old xfailed test 
that can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6877) [C++] Boost not found from the correct environment

2019-10-14 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6877:


 Summary: [C++] Boost not found from the correct environment
 Key: ARROW-6877
 URL: https://issues.apache.org/jira/browse/ARROW-6877
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Joris Van den Bossche


My local dev build started to fail, due to cmake founding a wrong boost (it 
found {{-- Found Boost 1.70.0 at 
/home/joris/miniconda3/lib/cmake/Boost-1.70.0}} while building in a different 
conda environment.

I can reproduce this with creating a new conda env from scratch following our 
documentation.

By specifying {{-DBOOST_ROOT=/home/joris/miniconda3/envs/arrow-dev/lib}} it 
works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >