[
https://issues.apache.org/jira/browse/ARROW-13813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408162#comment-17408162
]
David Li commented on ARROW-13813:
----------------------------------
{quote}Currently we don't support "transforms" for the partitioning column
(something we maybe should? so that you could also say "year(date_column)" to
partition on), which means that you need to calculate such URL encoded column
up front, which is not necessarily ideal, both performance/memory wise
(although in a (lazy/batched) query execution context, this might not matter)
and from a usability context.
{quote}
I agree, this would be useful to support. I assume once we have a node for
writing, this will be easier.
{quote}should we write this silently? Or should we actually check when creating
the file path that the value inserted for the partition field is a "valid"
string for file paths (so eg no /, not an empty string, ..), and raise an error
instead of creating a wrong dataset? Or should we automatically encode those?
{quote}
On Windows there are more invalid path characters to consider, but I agree, we
should raise an error instead of writing unreadable data. (I think for
auto-encoding values, that can be done by bindings and doesn't necessarily have
to be done in C++? e.g. the Python/R bindings could offer an option to wrap
string partition fields in a URL-encode pass.)
> [C++][Dataset] Support URL encoding of partition field values for the file
> path
> -------------------------------------------------------------------------------
>
> Key: ARROW-13813
> URL: https://issues.apache.org/jira/browse/ARROW-13813
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: dataset
>
> In ARROW-12644, we added support for _decoding_ the file paths when reading
> datasets. So a valid follow-up question: should we also support _encoding_
> when writing datasets?
> (see also https://github.com/apache/arrow/issues/11027)
> Rereading ARROW-12644, there wasn't yet much discussion on that aspect.
> cc [~westonpace] [~lidavidm]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)