[
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510462#comment-17510462
]
David Li edited comment on ARROW-16000 at 3/22/22, 12:30 PM:
-------------------------------------------------------------
AFAIK, support for encodings isn't done in C++ at all (the C++ CSV reader has
no such option, nor does Datasets) - it's done in the bindings by providing a
file object wrapper that converts the encoding on the fly (e.g.
[https://github.com/apache/arrow/blob/778b1772fd20766e52b2bdccbd37668726f67e0c/r/src/io.cpp#L368-L374]).
This works for reading individual files, but probably doesn't work so well for
datasets…I wonder if what we need is a filesystem wrapper that'll decode
different encodings.
Or I suppose we could just hardcode this into the CSV Datasets support.
was (Author: lidavidm):
AFAIK, support for encodings isn't done in C++ at all (the C++ CSV reader has
no such option, nor does Datasets) - it's done in the bindings by providing a
file object wrapper that converts the encoding on the fly (e.g.
[https://github.com/apache/arrow/blob/778b1772fd20766e52b2bdccbd37668726f67e0c/r/src/io.cpp#L368-L374]).
This works for reading individual files, but probably doesn't work so well for
datasets…I wonder if what we need is a filesystem wrapper that'll decode
different encodings.
> [C++][Dataset] Support Latin-1 encoding
> ---------------------------------------
>
> Key: ARROW-16000
> URL: https://issues.apache.org/jira/browse/ARROW-16000
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nicola Crane
> Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with
> Latin-1 encoding. I had a look through the docs for the Dataset API and I
> don't think this is currently supported.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)