[jira] [Commented] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

Joris Van den Bossche (Jira) Fri, 15 Jan 2021 03:26:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265940#comment-17265940
 ]


Joris Van den Bossche commented on ARROW-10247:
-----------------------------------------------

bq. how would you generally go about finding the array of values?

Well, that's up to you .. (parsing it from the file paths, storing that 
information somewhere, ..). But, so, my hunch is that we shouldn't actually 
_require_ the user to pass it, since pyarrow can infer that itself from parsing 
the file paths if not provided. I opened ARROW-11260 for this.

> [C++][Dataset] Cannot write dataset with dictionary column as partition field
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10247
>                 URL: https://issues.apache.org/jira/browse/ARROW-10247
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Kietzman
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> When the column to use for partitioning is dictionary encoded, we get this 
> error:
> {code}
> In [9]: import pyarrow.dataset as ds
> In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
>     ...: table = pa.table([
>     ...:     pa.array(range(len(part))),
>     ...:     pa.array(part).dictionary_encode(),
>     ...: ], names=['col', 'part'])
> In [11]: part = ds.partitioning(table.select(["part"]).schema)
> In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ---------------------------------------------------------------------------
> ArrowTypeError                            Traceback (most recent call last)
> <ipython-input-12-c7b81c9b0bda> in <module>
> ----> 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, 
> base_dir, basename_template, format, partitioning, schema, filesystem, 
> file_options, use_threads)
>     773     _filesystemdataset_write(
>     774         data, base_dir, basename_template, schema,
> --> 775         filesystem, partitioning, file_options, use_threads,
>     776     )
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: scalar xxx (of type string) is invalid for part: 
> dictionary<values=string, indices=int32, ordered=0>
> In ../src/arrow/dataset/filter.cc, line 1082, code: 
> VisitConjunctionMembers(*and_.left_operand(), visitor)
> In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, 
> [&](const std::string& name, const std::shared_ptr<Scalar>& value) { auto&& 
> _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
> ::arrow::Status __s = 
> ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
> ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
> _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
> "(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
> auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
> auto& field = schema_->field(match[0]); if 
> (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
> value->ToString(), " (of type ", *value->type, ") is invalid for ", 
> field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); 
> })
> In ../src/arrow/dataset/file_base.cc, line 321, code: 
> (_error_or_value24).status()
> In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
> {code}
> While this seems a quit normal use case, as this column will typically be 
> repeated many times (and we also support reading it as such with dictionary 
> type, so a roundtrip is currently not possible in that case)
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

Reply via email to