Raúl Cumplido created ARROW-16526:
-------------------------------------
Summary: [Python] test_partitioned_dataset fails when building
with PARQUET but without DATASET
Key: ARROW-16526
URL: https://issues.apache.org/jira/browse/ARROW-16526
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 8.0.0
Reporter: Raúl Cumplido
Fix For: 9.0.0
Our current [minimal_build
examples|https://github.com/apache/arrow/tree/master/python/examples/minimal_build]
for python build with:
{code:java}
-DARROW_PARQUET=ON \{code}
but without DATASET.
These produces the following failure:
{code:java}
_________________________________________________________
test_partitioned_dataset[True]
_________________________________________________________tempdir =
PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'),
use_legacy_dataset = True @pytest.mark.pandas
@parametrize_legacy_dataset
def test_partitioned_dataset(tempdir, use_legacy_dataset):
# ARROW-3208: Segmentation fault when reading a Parquet partitioned
dataset
# to a Parquet file
path = tempdir / "ARROW-3208"
df = pd.DataFrame({
'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
'two': [-1, 10, 2, 100, 1000, 1, 11],
'three': [0, 0, 0, 0, 0, 0, 0]
})
table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path=str(path),
partition_cols=['one',
'two'])pyarrow/tests/parquet/test_dataset.py:1544:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pyarrow/parquet/__init__.py:3110: in write_to_dataset
import pyarrow.dataset as ds
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
"""Dataset is currently unstable. APIs subject to change without notice."""
import pyarrow as pa
from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
> from pyarrow._dataset import ( # noqa
CsvFileFormat,
CsvFragmentScanOptions,
Dataset,
DatasetFactory,
DirectoryPartitioning,
FilenamePartitioning,
FileFormat,
FileFragment,
FileSystemDataset,
FileSystemDatasetFactory,
FileSystemFactoryOptions,
FileWriteOptions,
Fragment,
FragmentScanOptions,
HivePartitioning,
IpcFileFormat,
IpcFileWriteOptions,
InMemoryDataset,
Partitioning,
PartitioningFactory,
Scanner,
TaggedRecordBatch,
UnionDataset,
UnionDatasetFactory,
_get_partition_keys,
_filesystemdataset_write,
)
E ModuleNotFoundError: No module named 'pyarrow._dataset'
{code}
This can be reproduced via running the minimal_build examples:
{code:java}
$ cd arrow/python/examples/minimal_build
$ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . {code}
or via building arrow and pyarrow with PARQUET but without DATASET.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)