[
https://issues.apache.org/jira/browse/ARROW-16526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534849#comment-17534849
]
David Li commented on ARROW-16526:
----------------------------------
I guess the challenge is that we only want to apply the mark/skip the test when
we're not using the legacy parquet dataset.
> [Python] test_partitioned_dataset fails when building with PARQUET but
> without DATASET
> --------------------------------------------------------------------------------------
>
> Key: ARROW-16526
> URL: https://issues.apache.org/jira/browse/ARROW-16526
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 8.0.0
> Reporter: Raúl Cumplido
> Priority: Minor
> Labels: good-first-issue
> Fix For: 9.0.0
>
>
> Our current [minimal_build
> examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
> for python build with -{*}DARROW_PARQUET=ON{*} but without {*}DATASET{*}.
> This produces the following failure:
> {code:java}
> _________________________________________________________
> test_partitioned_dataset[True]
> _________________________________________________________tempdir =
> PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'),
> use_legacy_dataset = True @pytest.mark.pandas
> @parametrize_legacy_dataset
> def test_partitioned_dataset(tempdir, use_legacy_dataset):
> # ARROW-3208: Segmentation fault when reading a Parquet partitioned
> dataset
> # to a Parquet file
> path = tempdir / "ARROW-3208"
> df = pd.DataFrame({
> 'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
> 'two': [-1, 10, 2, 100, 1000, 1, 11],
> 'three': [0, 0, 0, 0, 0, 0, 0]
> })
> table = pa.Table.from_pandas(df)
> > pq.write_to_dataset(table, root_path=str(path),
> partition_cols=['one',
> 'two'])pyarrow/tests/parquet/test_dataset.py:1544:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> pyarrow/parquet/__init__.py:3110: in write_to_dataset
> import pyarrow.dataset as ds
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> """Dataset is currently unstable. APIs subject to change without notice."""
>
> import pyarrow as pa
> from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
>
> > from pyarrow._dataset import ( # noqa
> CsvFileFormat,
> CsvFragmentScanOptions,
> Dataset,
> DatasetFactory,
> DirectoryPartitioning,
> FilenamePartitioning,
> FileFormat,
> FileFragment,
> FileSystemDataset,
> FileSystemDatasetFactory,
> FileSystemFactoryOptions,
> FileWriteOptions,
> Fragment,
> FragmentScanOptions,
> HivePartitioning,
> IpcFileFormat,
> IpcFileWriteOptions,
> InMemoryDataset,
> Partitioning,
> PartitioningFactory,
> Scanner,
> TaggedRecordBatch,
> UnionDataset,
> UnionDatasetFactory,
> _get_partition_keys,
> _filesystemdataset_write,
> )
> E ModuleNotFoundError: No module named 'pyarrow._dataset'
> {code}
> This can be reproduced via running the minimal_build examples:
> {code:java}
> $ cd arrow/python/examples/minimal_build
> $ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
> or via building arrow and pyarrow with PARQUET but without DATASET.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)