[ 
https://issues.apache.org/jira/browse/ARROW-16526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534805#comment-17534805
 ] 

Raúl Cumplido commented on ARROW-16526:
---------------------------------------

[~alenkaf] [~jorisvandenbossche] should we fix this case? Is building with 
PARQUET but without DATASET something supported? Do you think we should change 
the minimal build examples to build with DATASET?

> [Python] test_partitioned_dataset fails when building with PARQUET but 
> without DATASET
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-16526
>                 URL: https://issues.apache.org/jira/browse/ARROW-16526
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 8.0.0
>            Reporter: Raúl Cumplido
>            Priority: Minor
>              Labels: good-first-issue
>             Fix For: 9.0.0
>
>
> Our current [minimal_build 
> examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
>  for python build with -{*}DARROW_PARQUET=ON{*} but without {*}DATASET{*}. 
> This produces the following failure:
> {code:java}
>  _________________________________________________________ 
> test_partitioned_dataset[True] 
> _________________________________________________________tempdir = 
> PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), 
> use_legacy_dataset = True    @pytest.mark.pandas
>     @parametrize_legacy_dataset
>     def test_partitioned_dataset(tempdir, use_legacy_dataset):
>         # ARROW-3208: Segmentation fault when reading a Parquet partitioned 
> dataset
>         # to a Parquet file
>         path = tempdir / "ARROW-3208"
>         df = pd.DataFrame({
>             'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
>             'two': [-1, 10, 2, 100, 1000, 1, 11],
>             'three': [0, 0, 0, 0, 0, 0, 0]
>         })
>         table = pa.Table.from_pandas(df)
> >       pq.write_to_dataset(table, root_path=str(path),
>                             partition_cols=['one', 
> 'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/parquet/__init__.py:3110: in write_to_dataset
>     import pyarrow.dataset as ds
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     
> """Dataset is currently unstable. APIs subject to change without notice."""
>     
>     import pyarrow as pa
>     from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
>     
> >   from pyarrow._dataset import (  # noqa
>         CsvFileFormat,
>         CsvFragmentScanOptions,
>         Dataset,
>         DatasetFactory,
>         DirectoryPartitioning,
>         FilenamePartitioning,
>         FileFormat,
>         FileFragment,
>         FileSystemDataset,
>         FileSystemDatasetFactory,
>         FileSystemFactoryOptions,
>         FileWriteOptions,
>         Fragment,
>         FragmentScanOptions,
>         HivePartitioning,
>         IpcFileFormat,
>         IpcFileWriteOptions,
>         InMemoryDataset,
>         Partitioning,
>         PartitioningFactory,
>         Scanner,
>         TaggedRecordBatch,
>         UnionDataset,
>         UnionDatasetFactory,
>         _get_partition_keys,
>         _filesystemdataset_write,
>     )
> E   ModuleNotFoundError: No module named 'pyarrow._dataset'
> {code}
> This can be reproduced via running the minimal_build examples:
> {code:java}
> $ cd arrow/python/examples/minimal_build
> $ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
> or via building arrow and pyarrow with PARQUET but without DATASET.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to