[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658921634 -> https://github.com/apache/arrow/pull/ (note that this will only impact direct users of `read_table` with partitioned datasets, which eg does not include dask, but I suppose might impact pandas users) @bkietz do you merge this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658919336 @martindurant I am doing the follow-up PR as we speak This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658806394 @github-actions crossbow submit test-conda-python-3.7-pandas-master test-conda-python-3.7-kartothek-master test-conda-python-3.7-kartothek-latest test-conda-python-3.7-dask-latest test-conda-python-3.6-pandas-0.23 test-conda-python-3.8-pandas-latest This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658728008 The existing tests are already failing (the above reproducible snippets were based on those), *if* the dictionary encoding gets enabled. But I can write a `pyarrow.dataset`-specific test that captures the failure as well. Opened https://issues.apache.org/jira/browse/ARROW-9476 for the bug. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658720185 A bit simplified example: ```python import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds foo_keys = np.array([0, 1, 3]) bar_keys = np.array(['a', 'b', 'c'], dtype=object) N = 30 table = pa.table({ 'foo': foo_keys.repeat(10), 'bar': np.tile(np.tile(bar_keys, 5), 2), 'values': np.random.randn(N) }) base_path = "test_partition_directories3" pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"]) # works ds.dataset(base_path, partitioning="hive") # fails part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1) ds.dataset(base_path, partitioning=part) ``` this also fails, with "ArrowInvalid: No dictionary provided for dictionary field bar: dictionary" (so slightly different error message) From playing with different keys for foo/bar, it seems that it might be trying to use the dictionary of the first field to parse the values of the second field (this might be a bug in my fix for HivePartitioning). Because replacing the keys with: ```python foo_keys = np.array(['a', 'b', 'c'], dtype=object) bar_keys = np.array(['a', 'b', 'c'], dtype=object) ``` works, while this ```python foo_keys = np.array(['a', 'b', 'c'], dtype=object) bar_keys = np.array(['e', 'f', 'g'], dtype=object) ``` fails with "Dictionary supplied for field bar: dictionary does not contain 'e'" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658714201 When enabling dictionary encoding for string partition fields, there are actually a bunch of failing tests .. Eg this one (based on `test_read_partitioned_directory`): ```python import pandas as pd import pyarrow as pa import pyarrow.dataset as ds foo_keys = [0, 1] bar_keys = ['a', 'b', 'c'] partition_spec = [ ['foo', foo_keys], ['bar', bar_keys] ] N = 30 df = pd.DataFrame({ 'index': np.arange(N), 'foo': np.array(foo_keys, dtype='i4').repeat(15), 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), 'values': np.random.randn(N) }, columns=['index', 'foo', 'bar', 'values']) from pyarrow.tests.test_parquet import _generate_partition_directories fs = pa.filesystem.LocalFileSystem() _generate_partition_directories(fs, "test_partition_directories", partition_spec, df) # works ds.dataset("test_partition_directories/", partitioning="hive") # fails part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1) ds.dataset("test_partition_directories/", partitioning=part) ``` fails with ``` ArrowInvalid: Dictionary supplied for field bar: dictionary does not contain 'c' In ../src/arrow/dataset/partition.cc, line 55, code: (_error_or_value13).status() In ../src/arrow/dataset/discovery.cc, line 243, code: (_error_or_value16).status() ``` Another reproducible example (based on `test_write_to_dataset_with_partitions`) giving a similar error: ```python import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds output_df = pd.DataFrame({'group1': list('aaaccc'), 'group2': list('eefeffgeee'), 'num': list(range(10)), 'nan': [np.nan] * 10, 'date': np.arange('2017-01-01', '2017-01-11', dtype='datetime64[D]')}) cols = output_df.columns.tolist() partition_by = ['group1', 'group2'] output_table = pa.Table.from_pandas(output_df, safe=False, preserve_index=False) filesystem = pa.filesystem.LocalFileSystem() base_path = "test_partition_directories2/" pq.write_to_dataset(output_table, base_path, partition_by, filesystem=filesystem) # works ds.dataset("test_partition_directories2/", partitioning="hive") # fails part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1) ds.dataset("test_partition_directories2/", partitioning=part) ``` I couldn't yet figure out what is the reason it is failing in those cases, though. I should have tested the dictionary encoding feature more thoroughly, earlier, sorry about that. But with the current state (unless someone can fix it today, but I don't have much time), it seems the choice is quite simple: merge as is without dictionary encoding, or delay until after 1.0 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658662358 To clarify: - The current PR right now doesn't use dictionary encoding for any type of partition fields, so also not for strings - For strings I could rather easily add it (it's an option in the datasets API that can be set) - For ints it's not actually possible, as long as the datasets API doesn't support it (dictionary encoding the ints after reading is possible, but won't necessarily give you all unique values in the dictionary if you applied a filter) I will at least quickly experiment with enabling the dictionary encoding, or providing an option for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658396423 Note that the new option with the datasets API only dictionary encodes string partition fields, and not integer partition field. So it would still not keep exactly the same behaviour as we had before .. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658004350 > I'm not sure what to say about the handling of the partition fields. Is it ok to accept things as they are for now? I think that might be OK, yes (it means there is a change in behaviour to no longer return partition fields as dictionary type). But, maybe we should at least provide the option to get back the old behaviour? With `pyarrow.dataset` this is possible, but with the current PR this is not possible in `read_table`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-657186524 Rebased. This depends on https://github.com/apache/arrow/pull/7704 (ARROW-9297) for fixing the large_memory failure noted above (https://github.com/apache/arrow/pull/7545#issuecomment-649631989). In addition, we should probably also decide on whether we want to use dictionary type for the (string) partition fields or not. Right now we do (actually not only for strings, but also for integers). But the default with the datasets API is to use the plain (string or int) type. But we can specify an option to keep the existing behaviour for `parquet.read_table` (although that also creates an inconsistency between `pyarrow.datasets` and `pyarrow.parquet` using datasets). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-652971061 I opened https://issues.apache.org/jira/browse/ARROW-9297 for the large_memory issue (BinaryArray overflow to chunked array works for parquet.read_table, but not in dataset scanner) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-652875909 Additional question: do we need to care about the case where the dataset module is not built? (which actually is the case on the Ursabot builds) Right now, you can use `pq.read_table` without the `pyarrow.dataset` module being built. WIth this PR, that will stop working. However, for single files without advanced features (eg filter specified), we could simply fall back to `ParquetFile(..).read()` to keep `pq.read_table` working for this simple case without `pyarrow.dataset` being available. cc @wesm This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default
jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-649631989 Still some work: Need to add tests for the different filesystems that can be passed. There are still some skipped tests: * `ARROW:schema` is not yet removed from the metadata -> ARROW-9009 * Partition fields as dictionary keys * Specifying `metadata` object (not very important IMO) One of the `large_memory` tests is also failing (`test_binary_array_overflow_to_chunked`): ``` $ pytest python/pyarrow/tests/test_parquet.py -v -r s -m large_memory --enable-large_memory === test session starts === platform linux -- Python 3.7.3, pytest-5.2.1, py-1.8.0, pluggy-0.12.0 -- /home/joris/miniconda3/envs/arrow-dev/bin/python cachedir: .pytest_cache hypothesis profile 'dev' -> max_examples=10, database=DirectoryBasedExampleDatabase('/home/joris/scipy/repos/arrow/.hypothesis/examples') rootdir: /home/joris/scipy/repos/arrow/python, inifile: setup.cfg plugins: hypothesis-4.47.5, lazy-fixture-0.6.1 collected 277 items / 273 deselected / 4 selected python/pyarrow/tests/test_parquet.py::test_large_table_int32_overflow PASSED [ 25%] python/pyarrow/tests/test_parquet.py::test_byte_array_exactly_2gb PASSED [ 50%] python/pyarrow/tests/test_parquet.py::test_binary_array_overflow_to_chunked FAILED [ 75%] python/pyarrow/tests/test_parquet.py::test_list_of_binary_large_cell PASSED [100%] FAILURES = __ test_binary_array_overflow_to_chunked __ assert t.equals(result) @pytest.mark.pandas @pytest.mark.large_memory def test_binary_array_overflow_to_chunked(): # ARROW-3762 # 2^31 + 1 bytes values = [b'x'] + [ b'x' * (1 << 20) ] * 2 * (1 << 10) > df = pd.DataFrame({'byte_col': values}) python/pyarrow/tests/test_parquet.py:3043: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ python/pyarrow/tests/test_parquet.py:3010: in _simple_table_roundtrip stream = pa.BufferOutputStream() python/pyarrow/tests/test_parquet.py:82: in _read_table return pq.read_table(*args, **kwargs) python/pyarrow/parquet.py:1555: in read_table raise ValueError( python/pyarrow/parquet.py:1468: in read use_threads=use_threads pyarrow/_dataset.pyx:403: in pyarrow._dataset.Dataset.to_table ??? pyarrow/_dataset.pyx:1893: in pyarrow._dataset.Scanner.to_table ??? pyarrow/error.pxi:122: in pyarrow.lib.pyarrow_internal_check_status ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked arrays pyarrow/error.pxi:105: ArrowNotImplementedError = 1 failed, 3 passed, 273 deselected in 512.87s (0:08:32) = ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: