[
https://issues.apache.org/jira/browse/ARROW-14644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454307#comment-17454307
]
Weston Pace commented on ARROW-14644:
-------------------------------------
[~willjones127] Sorry for the assignment dance. I hadn't realized you had
assigned this as I had earmarked it this morning. I did a bit of
investigation. I agree it is a C++ issue. The following python reproduction
runs into the same problem:
{noformat}
import pyarrow.csv as csv
import pyarrow.dataset as ds
with open('/tmp/my_dataset/blah.csv', mode='wb') as f:
f.write(b'\xef\xbb\xbfa,b\n1,2\n3,4\n')
print(csv.read_csv('/tmp/my_dataset/blah.csv').to_pydict())
dataset = ds.dataset('/tmp/my_dataset', format='csv')
print(dataset.to_table().to_pydict())
print(dataset.to_table())
{noformat}
I had thought that maybe it was a streaming / file reader issue but I tested
both the streaming and file readers and they seem to be properly skipping the
BOM. Here are those tests if you want them:
https://github.com/apache/arrow/compare/master...westonpace:experiment/ARROW-14644-investigation?expand=1
However, that didn't seem to be the problem either. So then I thought that
maybe it was the fact that the datasets API will specify the schema when
reading the data (instead of inferring it) but some initial testing there
wasn't very fruitful either.
So I'll let you take this, I'm still not quite sure the root cause. My guess
would be that it is still some option getting passed when the reader is called
from the datasets API but I don't know which one it would be.
> [C++] open_dataset doesn't ignore BOM in csv file
> -------------------------------------------------
>
> Key: ARROW-14644
> URL: https://issues.apache.org/jira/browse/ARROW-14644
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 6.0.0
> Environment: macOS Mojave, R 4.1.1
> Reporter: Andy Teucher
> Assignee: Will Jones
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> DragosMG: I believe this is a bug that should be fixed in the C++ code as
> there isn't an option we could leverage on the R side.
> I have draft PR with a failing test, but it's identical to Andy's
> _reproducible example_ below.
> Original description below:
> ======================
> When a CSV file starts with byte order mark, {{arrow::open_dataset()}} reads
> the file but populates the first column with {{NA}} values. It appears a
> similar issue was raised and fixed here:
> https://issues.apache.org/jira/browse/ARROW-5413. {{read_csv_arrow()}} deals
> with the BOM correctly.
> Reproducible Example:
> {code:java}
> library(arrow)
> library(dplyr)
> writeLines('\xef\xbb\xbfa,b\n1,2\n', con = "testfile.csv")
> read_csv_arrow("testfile.csv") # works
> #> # A tibble: 1 × 2
> #> a b
> #> <int> <int>
> #> 1 1 2
> open_dataset("testfile.csv", format = "csv") |>
> collect()
> #> # A tibble: 1 × 2
> #> a b
> #> <int> <int>
> #> 1 NA 2 {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)