[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-15 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658921634


   -> https://github.com/apache/arrow/pull/
   
   (note that this will only impact direct users of `read_table` with 
partitioned datasets, which eg does not include dask, but I suppose might 
impact pandas users)
   
   @bkietz do you merge this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-15 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658919336


   @martindurant I am doing the follow-up PR as we speak



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-15 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658806394


   @github-actions crossbow submit test-conda-python-3.7-pandas-master 
test-conda-python-3.7-kartothek-master test-conda-python-3.7-kartothek-latest 
test-conda-python-3.7-dask-latest test-conda-python-3.6-pandas-0.23 
test-conda-python-3.8-pandas-latest



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-15 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658728008


   The existing tests are already failing (the above reproducible snippets were 
based on those), *if* the dictionary encoding gets enabled. 
   But I can write a `pyarrow.dataset`-specific test that captures the failure 
as well. Opened https://issues.apache.org/jira/browse/ARROW-9476 for the bug.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-15 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658720185


   A bit simplified example:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds 
   
   foo_keys = np.array([0, 1, 3])
   bar_keys = np.array(['a', 'b', 'c'], dtype=object)
   N = 30
   
   table = pa.table({
   'foo': foo_keys.repeat(10),
   'bar': np.tile(np.tile(bar_keys, 5), 2),
   'values': np.random.randn(N)
   })
   
   base_path = "test_partition_directories3"
   pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"])
   
   # works
   ds.dataset(base_path, partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset(base_path, partitioning=part)
   ```
   
   this also fails, with "ArrowInvalid: No dictionary provided for dictionary 
field bar: dictionary" (so slightly 
different error message)
   
   From playing with different keys for foo/bar, it seems that it might be 
trying to use the dictionary of the first field to parse the values of the 
second field (this might be a bug in my fix for HivePartitioning). 
   
   Because replacing the keys with:
   
   ```python
   foo_keys = np.array(['a', 'b', 'c'], dtype=object)
   bar_keys = np.array(['a', 'b', 'c'], dtype=object)
   ```
   
   works, while this
   
   ```python
   foo_keys = np.array(['a', 'b', 'c'], dtype=object) 
   bar_keys = np.array(['e', 'f', 'g'], dtype=object) 
   ```
   
   fails with "Dictionary supplied for field bar: dictionary does not contain 'e'"



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-15 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658714201


   When enabling dictionary encoding for string partition fields, there are 
actually a bunch of failing tests ..
   
   Eg this one (based on `test_read_partitioned_directory`):
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   foo_keys = [0, 1]
   bar_keys = ['a', 'b', 'c']
   partition_spec = [
   ['foo', foo_keys],
   ['bar', bar_keys]
   ]
   N = 30
   
   df = pd.DataFrame({
   'index': np.arange(N),
   'foo': np.array(foo_keys, dtype='i4').repeat(15),
   'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
   'values': np.random.randn(N)
   }, columns=['index', 'foo', 'bar', 'values'])
   
   from pyarrow.tests.test_parquet import _generate_partition_directories
   fs = pa.filesystem.LocalFileSystem()
   _generate_partition_directories(fs, "test_partition_directories", 
partition_spec, df)
   
   # works
   ds.dataset("test_partition_directories/", partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset("test_partition_directories/", partitioning=part)
   ```
   
   fails with 
   
   ```
   ArrowInvalid: Dictionary supplied for field bar: dictionary does not contain 'c'
   In ../src/arrow/dataset/partition.cc, line 55, code: 
(_error_or_value13).status()
   In ../src/arrow/dataset/discovery.cc, line 243, code: 
(_error_or_value16).status()
   ```
   
   Another reproducible example (based on 
`test_write_to_dataset_with_partitions`) giving a similar error:
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds 
   
   output_df = pd.DataFrame({'group1': list('aaaccc'),
   'group2': list('eefeffgeee'),
   'num': list(range(10)),
   'nan': [np.nan] * 10,
   'date': np.arange('2017-01-01', '2017-01-11',
   dtype='datetime64[D]')})
   cols = output_df.columns.tolist()
   partition_by = ['group1', 'group2']
   output_table = pa.Table.from_pandas(output_df, safe=False,
   preserve_index=False)
   filesystem = pa.filesystem.LocalFileSystem() 
   base_path = "test_partition_directories2/"
   pq.write_to_dataset(output_table, base_path, partition_by,
   filesystem=filesystem)
   
   # works
   ds.dataset("test_partition_directories2/", partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset("test_partition_directories2/", partitioning=part)
   ```
   
   I couldn't yet figure out what is the reason it is failing in those cases, 
though. 
   
   
   I should have tested the dictionary encoding feature more thoroughly, 
earlier, sorry about that. 
   But with the current state (unless someone can fix it today, but I don't 
have much time), it seems the choice is quite simple: merge as is without 
dictionary encoding, or delay until after 1.0
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-15 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658662358


   To clarify:
   
   - The current PR right now doesn't use dictionary encoding for any type of 
partition fields, so also not for strings
   - For strings I could rather easily add it (it's an option in the datasets 
API that can be set)
   - For ints it's not actually possible, as long as the datasets API doesn't 
support it (dictionary encoding the ints after reading is possible, but won't 
necessarily give you all unique values in the dictionary if you applied a 
filter)
   
   I will at least quickly experiment with enabling the dictionary encoding, or 
providing an option for it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-14 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658396423


   Note that the new option with the datasets API only dictionary encodes 
string partition fields, and not integer partition field. So it would still not 
keep exactly the same behaviour as we had before ..



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-14 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658004350


   > I'm not sure what to say about the handling of the partition fields. Is it 
ok to accept things as they are for now?
   
   I think that might be OK, yes (it means there is a change in behaviour to no 
longer return partition fields as dictionary type). But, maybe we should at 
least provide the option to get back the old behaviour? With `pyarrow.dataset` 
this is possible, but with the current PR this is not possible in `read_table`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-12 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-657186524


   Rebased. 
   
   This depends on https://github.com/apache/arrow/pull/7704 (ARROW-9297) for 
fixing the large_memory failure noted above 
(https://github.com/apache/arrow/pull/7545#issuecomment-649631989).
   
   In addition, we should probably also decide on whether we want to use 
dictionary type for the (string) partition fields or not. Right now we do 
(actually not only for strings, but also for integers). But the default with 
the datasets API is to use the plain (string or int) type. But we can specify 
an option to keep the existing behaviour for `parquet.read_table` (although 
that also creates an inconsistency between `pyarrow.datasets` and 
`pyarrow.parquet` using datasets).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-02 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-652971061


   I opened https://issues.apache.org/jira/browse/ARROW-9297 for the 
large_memory issue (BinaryArray overflow to chunked array works for 
parquet.read_table, but not in dataset scanner)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-07-02 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-652875909


   Additional question: do we need to care about the case where the dataset 
module is not built? (which actually is the case on the Ursabot builds) 
   
   Right now, you can use `pq.read_table` without the `pyarrow.dataset` module 
being built. WIth this PR, that will stop working. However, for single files 
without advanced features (eg filter specified), we could simply fall back to 
`ParquetFile(..).read()` to keep `pq.read_table` working for this simple case 
without `pyarrow.dataset` being available.
   
   cc @wesm 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

2020-06-25 Thread GitBox


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-649631989


   Still some work:
   
   Need to add tests for the different filesystems that can be passed.
   
   There are still some skipped tests:
   
   * `ARROW:schema` is not yet removed from the metadata -> ARROW-9009
   * Partition fields as dictionary keys
   * Specifying `metadata` object (not very important IMO)
   
   One of the `large_memory` tests is also failing 
(`test_binary_array_overflow_to_chunked`):
   
   ```
   $ pytest python/pyarrow/tests/test_parquet.py -v -r s -m large_memory 
--enable-large_memory
   
===
 test session starts 
===
   platform linux -- Python 3.7.3, pytest-5.2.1, py-1.8.0, pluggy-0.12.0 -- 
/home/joris/miniconda3/envs/arrow-dev/bin/python
   cachedir: .pytest_cache
   hypothesis profile 'dev' -> max_examples=10, 
database=DirectoryBasedExampleDatabase('/home/joris/scipy/repos/arrow/.hypothesis/examples')
   rootdir: /home/joris/scipy/repos/arrow/python, inifile: setup.cfg
   plugins: hypothesis-4.47.5, lazy-fixture-0.6.1
   collected 277 items / 273 deselected / 4 selected

 
   
   python/pyarrow/tests/test_parquet.py::test_large_table_int32_overflow PASSED 

   [ 25%]
   python/pyarrow/tests/test_parquet.py::test_byte_array_exactly_2gb PASSED 

   [ 50%]
   python/pyarrow/tests/test_parquet.py::test_binary_array_overflow_to_chunked 
FAILED  
[ 75%]
   python/pyarrow/tests/test_parquet.py::test_list_of_binary_large_cell PASSED  

   [100%]
   
   

 FAILURES 
=
   
__
 test_binary_array_overflow_to_chunked 
__
   
   assert t.equals(result)
   
   
   @pytest.mark.pandas
   @pytest.mark.large_memory
   def test_binary_array_overflow_to_chunked():
   # ARROW-3762
   
   # 2^31 + 1 bytes
   values = [b'x'] + [
   b'x' * (1 << 20)
   ] * 2 * (1 << 10)
   >   df = pd.DataFrame({'byte_col': values})
   
   python/pyarrow/tests/test_parquet.py:3043: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   python/pyarrow/tests/test_parquet.py:3010: in _simple_table_roundtrip
   stream = pa.BufferOutputStream()
   python/pyarrow/tests/test_parquet.py:82: in _read_table
   return pq.read_table(*args, **kwargs)
   python/pyarrow/parquet.py:1555: in read_table
   raise ValueError(
   python/pyarrow/parquet.py:1468: in read
   use_threads=use_threads
   pyarrow/_dataset.pyx:403: in pyarrow._dataset.Dataset.to_table
   ???
   pyarrow/_dataset.pyx:1893: in pyarrow._dataset.Scanner.to_table
   ???
   pyarrow/error.pxi:122: in pyarrow.lib.pyarrow_internal_check_status
   ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate 
chunked arrays
   
   pyarrow/error.pxi:105: ArrowNotImplementedError
   
= 1 
failed, 3 passed, 273 deselected in 512.87s (0:08:32) 
=
   ```
   
   
   
   
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at: