[jira] [Updated] (ARROW-16526) [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET

Jira Wed, 11 May 2022 03:12:15 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Raúl Cumplido updated ARROW-16526:
----------------------------------
    Description: 
Our current [minimal_build 
examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
 for python build with:
{code:java}
  -DARROW_PARQUET=ON \{code}
but without DATASET.

This produces the following failure:
{code:java}
 _________________________________________________________ 
test_partitioned_dataset[True] 
_________________________________________________________tempdir = 
PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), 
use_legacy_dataset = True    @pytest.mark.pandas
    @parametrize_legacy_dataset
    def test_partitioned_dataset(tempdir, use_legacy_dataset):
        # ARROW-3208: Segmentation fault when reading a Parquet partitioned 
dataset
        # to a Parquet file
        path = tempdir / "ARROW-3208"
        df = pd.DataFrame({
            'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
            'two': [-1, 10, 2, 100, 1000, 1, 11],
            'three': [0, 0, 0, 0, 0, 0, 0]
        })
        table = pa.Table.from_pandas(df)
>       pq.write_to_dataset(table, root_path=str(path),
                            partition_cols=['one', 
'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/parquet/__init__.py:3110: in write_to_dataset
    import pyarrow.dataset as ds
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     
"""Dataset is currently unstable. APIs subject to change without notice."""
    
    import pyarrow as pa
    from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
    
>   from pyarrow._dataset import (  # noqa
        CsvFileFormat,
        CsvFragmentScanOptions,
        Dataset,
        DatasetFactory,
        DirectoryPartitioning,
        FilenamePartitioning,
        FileFormat,
        FileFragment,
        FileSystemDataset,
        FileSystemDatasetFactory,
        FileSystemFactoryOptions,
        FileWriteOptions,
        Fragment,
        FragmentScanOptions,
        HivePartitioning,
        IpcFileFormat,
        IpcFileWriteOptions,
        InMemoryDataset,
        Partitioning,
        PartitioningFactory,
        Scanner,
        TaggedRecordBatch,
        UnionDataset,
        UnionDatasetFactory,
        _get_partition_keys,
        _filesystemdataset_write,
    )
E   ModuleNotFoundError: No module named 'pyarrow._dataset'
{code}
This can be reproduced via running the minimal_build examples:
{code:java}
$ cd arrow/python/examples/minimal_build
$ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
or via building arrow and pyarrow with PARQUET but without DATASET.

  was:
Our current [minimal_build 
examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
 for python build with:
{code:java}
 -DARROW_PARQUET=ON \{code}
but without DATASET.

These produces the following failure:
{code:java}
 _________________________________________________________ 
test_partitioned_dataset[True] 
_________________________________________________________tempdir = 
PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), 
use_legacy_dataset = True    @pytest.mark.pandas
    @parametrize_legacy_dataset
    def test_partitioned_dataset(tempdir, use_legacy_dataset):
        # ARROW-3208: Segmentation fault when reading a Parquet partitioned 
dataset
        # to a Parquet file
        path = tempdir / "ARROW-3208"
        df = pd.DataFrame({
            'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
            'two': [-1, 10, 2, 100, 1000, 1, 11],
            'three': [0, 0, 0, 0, 0, 0, 0]
        })
        table = pa.Table.from_pandas(df)
>       pq.write_to_dataset(table, root_path=str(path),
                            partition_cols=['one', 
'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/parquet/__init__.py:3110: in write_to_dataset
    import pyarrow.dataset as ds
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     
"""Dataset is currently unstable. APIs subject to change without notice."""
    
    import pyarrow as pa
    from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
    
>   from pyarrow._dataset import (  # noqa
        CsvFileFormat,
        CsvFragmentScanOptions,
        Dataset,
        DatasetFactory,
        DirectoryPartitioning,
        FilenamePartitioning,
        FileFormat,
        FileFragment,
        FileSystemDataset,
        FileSystemDatasetFactory,
        FileSystemFactoryOptions,
        FileWriteOptions,
        Fragment,
        FragmentScanOptions,
        HivePartitioning,
        IpcFileFormat,
        IpcFileWriteOptions,
        InMemoryDataset,
        Partitioning,
        PartitioningFactory,
        Scanner,
        TaggedRecordBatch,
        UnionDataset,
        UnionDatasetFactory,
        _get_partition_keys,
        _filesystemdataset_write,
    )
E   ModuleNotFoundError: No module named 'pyarrow._dataset'
{code}
This can be reproduced via running the minimal_build examples:
{code:java}
$ cd arrow/python/examples/minimal_build
$ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
or via building arrow and pyarrow with PARQUET but without DATASET.


> [Python] test_partitioned_dataset fails when building with PARQUET but 
> without DATASET
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-16526
>                 URL: https://issues.apache.org/jira/browse/ARROW-16526
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>            Reporter: Raúl Cumplido
>            Priority: Minor
>              Labels: good-first-issue
>             Fix For: 9.0.0
>
>
> Our current [minimal_build 
> examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
>  for python build with:
> {code:java}
>   -DARROW_PARQUET=ON \{code}
> but without DATASET.
> This produces the following failure:
> {code:java}
>  _________________________________________________________ 
> test_partitioned_dataset[True] 
> _________________________________________________________tempdir = 
> PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), 
> use_legacy_dataset = True    @pytest.mark.pandas
>     @parametrize_legacy_dataset
>     def test_partitioned_dataset(tempdir, use_legacy_dataset):
>         # ARROW-3208: Segmentation fault when reading a Parquet partitioned 
> dataset
>         # to a Parquet file
>         path = tempdir / "ARROW-3208"
>         df = pd.DataFrame({
>             'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
>             'two': [-1, 10, 2, 100, 1000, 1, 11],
>             'three': [0, 0, 0, 0, 0, 0, 0]
>         })
>         table = pa.Table.from_pandas(df)
> >       pq.write_to_dataset(table, root_path=str(path),
>                             partition_cols=['one', 
> 'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/parquet/__init__.py:3110: in write_to_dataset
>     import pyarrow.dataset as ds
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     
> """Dataset is currently unstable. APIs subject to change without notice."""
>     
>     import pyarrow as pa
>     from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
>     
> >   from pyarrow._dataset import (  # noqa
>         CsvFileFormat,
>         CsvFragmentScanOptions,
>         Dataset,
>         DatasetFactory,
>         DirectoryPartitioning,
>         FilenamePartitioning,
>         FileFormat,
>         FileFragment,
>         FileSystemDataset,
>         FileSystemDatasetFactory,
>         FileSystemFactoryOptions,
>         FileWriteOptions,
>         Fragment,
>         FragmentScanOptions,
>         HivePartitioning,
>         IpcFileFormat,
>         IpcFileWriteOptions,
>         InMemoryDataset,
>         Partitioning,
>         PartitioningFactory,
>         Scanner,
>         TaggedRecordBatch,
>         UnionDataset,
>         UnionDatasetFactory,
>         _get_partition_keys,
>         _filesystemdataset_write,
>     )
> E   ModuleNotFoundError: No module named 'pyarrow._dataset'
> {code}
> This can be reproduced via running the minimal_build examples:
> {code:java}
> $ cd arrow/python/examples/minimal_build
> $ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
> or via building arrow and pyarrow with PARQUET but without DATASET.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (ARROW-16526) [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET

Reply via email to