[
https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265940#comment-17265940
]
Joris Van den Bossche commented on ARROW-10247:
-----------------------------------------------
bq. how would you generally go about finding the array of values?
Well, that's up to you .. (parsing it from the file paths, storing that
information somewhere, ..). But, so, my hunch is that we shouldn't actually
_require_ the user to pass it, since pyarrow can infer that itself from parsing
the file paths if not provided. I opened ARROW-11260 for this.
> [C++][Dataset] Cannot write dataset with dictionary column as partition field
> -----------------------------------------------------------------------------
>
> Key: ARROW-10247
> URL: https://issues.apache.org/jira/browse/ARROW-10247
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Ben Kietzman
> Priority: Major
> Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
> Time Spent: 5h 40m
> Remaining Estimate: 0h
>
> When the column to use for partitioning is dictionary encoded, we get this
> error:
> {code}
> In [9]: import pyarrow.dataset as ds
> In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
> ...: table = pa.table([
> ...: pa.array(range(len(part))),
> ...: pa.array(part).dictionary_encode(),
> ...: ], names=['col', 'part'])
> In [11]: part = ds.partitioning(table.select(["part"]).schema)
> In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet",
> partitioning=part)
> ---------------------------------------------------------------------------
> ArrowTypeError Traceback (most recent call last)
> <ipython-input-12-c7b81c9b0bda> in <module>
> ----> 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet",
> partitioning=part)
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data,
> base_dir, basename_template, format, partitioning, schema, filesystem,
> file_options, use_threads)
> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> --> 775 filesystem, partitioning, file_options, use_threads,
> 776 )
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in
> pyarrow._dataset._filesystemdataset_write()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: scalar xxx (of type string) is invalid for part:
> dictionary<values=string, indices=int32, ordered=0>
> In ../src/arrow/dataset/filter.cc, line 1082, code:
> VisitConjunctionMembers(*and_.left_operand(), visitor)
> In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr,
> [&](const std::string& name, const std::shared_ptr<Scalar>& value) { auto&&
> _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do {
> ::arrow::Status __s =
> ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if
> ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s);
> _st.AddContextLine("../src/arrow/dataset/partition.cc", 257,
> "(_error_or_value28).status()"); return _st; } } while (0); } while (false);
> auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const
> auto& field = schema_->field(match[0]); if
> (!value->type->Equals(field->type())) { return Status::TypeError("scalar ",
> value->ToString(), " (of type ", *value->type, ") is invalid for ",
> field->ToString()); } values[match[0]] = value.get(); } return Status::OK();
> })
> In ../src/arrow/dataset/file_base.cc, line 321, code:
> (_error_or_value24).status()
> In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
> {code}
> While this seems a quit normal use case, as this column will typically be
> repeated many times (and we also support reading it as such with dictionary
> type, so a roundtrip is currently not possible in that case)
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't
> yet look into how easy it would be to fix.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)