[jira] [Commented] (ARROW-16526) [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET

Joris Van den Bossche (Jira) Thu, 12 May 2022 01:59:51 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535984#comment-17535984
 ]


Joris Van den Bossche commented on ARROW-16526:
-----------------------------------------------

Indeed, we currently should mark those tests appropriately, because we allow 
building pyarrow with parquet but without dataset or the other way around. But 
because we no longer have a regular build of pyarrow without datasets, we often 
forget to add this mark and have to fix it afterwards (I think in the past we 
had an ursabot build without dataset).

> [Python] test_partitioned_dataset fails when building with PARQUET but 
> without DATASET
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-16526
>                 URL: https://issues.apache.org/jira/browse/ARROW-16526
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>            Reporter: Raúl Cumplido
>            Priority: Minor
>              Labels: good-first-issue, pull-request-available
>             Fix For: 9.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Our current [minimal_build 
> examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
>  for python build with -{*}DARROW_PARQUET=ON{*} but without {*}DATASET{*}. 
> This produces the following failure:
> {code:java}
>  _________________________________________________________ 
> test_partitioned_dataset[True] 
> _________________________________________________________tempdir = 
> PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), 
> use_legacy_dataset = True    @pytest.mark.pandas
>     @parametrize_legacy_dataset
>     def test_partitioned_dataset(tempdir, use_legacy_dataset):
>         # ARROW-3208: Segmentation fault when reading a Parquet partitioned 
> dataset
>         # to a Parquet file
>         path = tempdir / "ARROW-3208"
>         df = pd.DataFrame({
>             'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
>             'two': [-1, 10, 2, 100, 1000, 1, 11],
>             'three': [0, 0, 0, 0, 0, 0, 0]
>         })
>         table = pa.Table.from_pandas(df)
> >       pq.write_to_dataset(table, root_path=str(path),
>                             partition_cols=['one', 
> 'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/parquet/__init__.py:3110: in write_to_dataset
>     import pyarrow.dataset as ds
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     
> """Dataset is currently unstable. APIs subject to change without notice."""
>     
>     import pyarrow as pa
>     from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
>     
> >   from pyarrow._dataset import (  # noqa
>         CsvFileFormat,
>         CsvFragmentScanOptions,
>         Dataset,
>         DatasetFactory,
>         DirectoryPartitioning,
>         FilenamePartitioning,
>         FileFormat,
>         FileFragment,
>         FileSystemDataset,
>         FileSystemDatasetFactory,
>         FileSystemFactoryOptions,
>         FileWriteOptions,
>         Fragment,
>         FragmentScanOptions,
>         HivePartitioning,
>         IpcFileFormat,
>         IpcFileWriteOptions,
>         InMemoryDataset,
>         Partitioning,
>         PartitioningFactory,
>         Scanner,
>         TaggedRecordBatch,
>         UnionDataset,
>         UnionDatasetFactory,
>         _get_partition_keys,
>         _filesystemdataset_write,
>     )
> E   ModuleNotFoundError: No module named 'pyarrow._dataset'
> {code}
> This can be reproduced via running the minimal_build examples:
> {code:java}
> $ cd arrow/python/examples/minimal_build
> $ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
> or via building arrow and pyarrow with PARQUET but without DATASET.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16526) [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET

Reply via email to