[GitHub] [arrow] westonpace commented on issue #11027: PyArrow Parquet column partitioning

GitBox Mon, 30 Aug 2021 20:23:55 -0700


westonpace commented on issue #11027:
URL: https://github.com/apache/arrow/issues/11027#issuecomment-908867010



   I think ARROW-12644 fixes something different.  My gut reaction would be to 
not do this.  It seems reasonable to expect that partition columns only contain 
filesystem-safe paths.
   
   Spark URL encodes non-safe characters (I'm not sure if it does this in all 
cases or just when using timestamps as a partition column) and ARROW-12644 was 
making sure we could read these but, as discussed in the JIRA, it isn't clear 
that we should support writing such paths.
   
   `/'date=2021/08/30'/somedata.parquet` is not going to be a safe path on all 
filesystems so I don't think that is a viable alternative.  If we were to URL 
encode paths and you would get `2021%2F08%2F30` which is an odd thing to have 
in the filesystem but it should at least work.  Perhaps we need a URL encoding 
kernel and then you could partition on that column projected with URL encoding 
(although I don't think projection support is quite there yet).
   
   Today, as a possible workaround, you could use pyarrow compute to do a 
replace on `/` with the character of your choosing: 
https://arrow.apache.org/docs/cpp/compute.html#string-transforms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #11027: PyArrow Parquet column partitioning

Reply via email to