[ 
https://issues.apache.org/jira/browse/ARROW-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510462#comment-17510462
 ] 

David Li edited comment on ARROW-16000 at 3/22/22, 12:30 PM:
-------------------------------------------------------------

AFAIK, support for encodings isn't done in C++ at all (the C++ CSV reader has 
no such option, nor does Datasets) - it's done in the bindings by providing a 
file object wrapper that converts the encoding on the fly (e.g. 
[https://github.com/apache/arrow/blob/778b1772fd20766e52b2bdccbd37668726f67e0c/r/src/io.cpp#L368-L374]).
 This works for reading individual files, but probably doesn't work so well for 
datasets…I wonder if what we need is a filesystem wrapper that'll decode 
different encodings.

Or I suppose we could just hardcode this into the CSV Datasets support.


was (Author: lidavidm):
AFAIK, support for encodings isn't done in C++ at all (the C++ CSV reader has 
no such option, nor does Datasets) - it's done in the bindings by providing a 
file object wrapper that converts the encoding on the fly (e.g. 
[https://github.com/apache/arrow/blob/778b1772fd20766e52b2bdccbd37668726f67e0c/r/src/io.cpp#L368-L374]).
 This works for reading individual files, but probably doesn't work so well for 
datasets…I wonder if what we need is a filesystem wrapper that'll decode 
different encodings.

> [C++][Dataset] Support Latin-1 encoding
> ---------------------------------------
>
>                 Key: ARROW-16000
>                 URL: https://issues.apache.org/jira/browse/ARROW-16000
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>
> In ARROW-15992 a user is reporting issues with trying to read in files with 
> Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
> don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to