Joris Van den Bossche created ARROW-10247:
---------------------------------------------
Summary: [C++][Dataset] Cannot write dataset with dictionary
column as partition field
Key: ARROW-10247
URL: https://issues.apache.org/jira/browse/ARROW-10247
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Joris Van den Bossche
Fix For: 2.0.0
When the column to use for partitioning is dictionary encoded, we get this
error:
{code}
In [9]: import pyarrow.dataset as ds
In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
...: table = pa.table([
...: pa.array(range(len(part))),
...: pa.array(part).dictionary_encode(),
...: ], names=['col', 'part'])
In [11]: part = ds.partitioning(table.select(["part"]).schema)
In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet",
partitioning=part)
---------------------------------------------------------------------------
ArrowTypeError Traceback (most recent call last)
<ipython-input-12-c7b81c9b0bda> in <module>
----> 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet",
partitioning=part)
~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, base_dir,
basename_template, format, partitioning, schema, filesystem, file_options,
use_threads)
773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
--> 775 filesystem, partitioning, file_options, use_threads,
776 )
~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in
pyarrow._dataset._filesystemdataset_write()
~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowTypeError: scalar xxx (of type string) is invalid for part:
dictionary<values=string, indices=int32, ordered=0>
In ../src/arrow/dataset/filter.cc, line 1082, code:
VisitConjunctionMembers(*and_.left_operand(), visitor)
In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, [&](const
std::string& name, const std::shared_ptr<Scalar>& value) { auto&&
_error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do {
::arrow::Status __s =
::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if
((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s);
_st.AddContextLine("../src/arrow/dataset/partition.cc", 257,
"(_error_or_value28).status()"); return _st; } } while (0); } while (false);
auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const
auto& field = schema_->field(match[0]); if
(!value->type->Equals(field->type())) { return Status::TypeError("scalar ",
value->ToString(), " (of type ", *value->type, ") is invalid for ",
field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); })
In ../src/arrow/dataset/file_base.cc, line 321, code:
(_error_or_value24).status()
In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
{code}
While this seems a quit normal use case, as this column will typically be
repeated many times (and we also support reading it as such with dictionary
type, so a roundtrip is currently not possible in that case)
I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't
yet look into how easy it would be to fix.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)