[jira] [Commented] (ARROW-13813) [C++][Dataset] Support URL encoding of partition field values for the file path

David Li (Jira) Wed, 01 Sep 2021 06:21:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408162#comment-17408162
 ]


David Li commented on ARROW-13813:
----------------------------------

{quote}Currently we don't support "transforms" for the partitioning column 
(something we maybe should? so that you could also say "year(date_column)" to 
partition on), which means that you need to calculate such URL encoded column 
up front, which is not necessarily ideal, both performance/memory wise 
(although in a (lazy/batched) query execution context, this might not matter) 
and from a usability context.
{quote}
I agree, this would be useful to support. I assume once we have a node for 
writing, this will be easier.
{quote}should we write this silently? Or should we actually check when creating 
the file path that the value inserted for the partition field is a "valid" 
string for file paths (so eg no /, not an empty string, ..), and raise an error 
instead of creating a wrong dataset? Or should we automatically encode those?
{quote}
On Windows there are more invalid path characters to consider, but I agree, we 
should raise an error instead of writing unreadable data. (I think for 
auto-encoding values, that can be done by bindings and doesn't necessarily have 
to be done in C++? e.g. the Python/R bindings could offer an option to wrap 
string partition fields in a URL-encode pass.)

> [C++][Dataset] Support URL encoding of partition field values for the file 
> path
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-13813
>                 URL: https://issues.apache.org/jira/browse/ARROW-13813
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> In ARROW-12644, we added support for _decoding_ the file paths when reading 
> datasets. So a valid follow-up question: should we also support _encoding_ 
> when writing datasets?
> (see also https://github.com/apache/arrow/issues/11027)
> Rereading ARROW-12644, there wasn't yet much discussion on that aspect.
> cc [~westonpace] [~lidavidm]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13813) [C++][Dataset] Support URL encoding of partition field values for the file path

Reply via email to