[
https://issues.apache.org/jira/browse/ARROW-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931666#comment-16931666
]
Neal Richardson commented on ARROW-6582:
----------------------------------------
Thanks for the report. A few thoughts:
1. "embedded nul in string" is an error coming from R. Since the error is being
thrown in {{Table__to_dataframe}}, that means the Parquet file was already read
into Arrow memory successfully, and R is failing to read it from Arrow. That
helps isolate the issue.
2. Given that, you could play around with the {{col_select}} argument to
{{read_parquet}} and identify which column it is that has the nul, if you don't
already know. If you don't happen to need this column for whatever you're
trying to do, you could omit it from there and proceed.
3. If you can identify the offending column, it would be interesting to know
what Arrow type it is. To do that, do something like
{code:r}
tab <- read_parquet(file, as_tibble=FALSE)
tab$schema
{code}
and report back what type that column is.
4. Check your system locale and encoding and make sure it aligns with the data
in the file. [Googling the error
message|https://www.google.com/search?q=embedded+nul+in+string] points to
encoding often being implicated.
5. How are these Parquet files generated? Same host? Or different system,
platform, etc.? Does that tell you something useful about the locale/encoding
you need to set in R to read the data?
6. If any of this leads you to a place where you can write out a sufficiently
anonymized/obfuscated file that reproduces the error, that would of course be
most helpful.
> R's read_parquet() fails with embedded nuls in strings
> ------------------------------------------------------
>
> Key: ARROW-6582
> URL: https://issues.apache.org/jira/browse/ARROW-6582
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.14.1
> Environment: Windows 10
> R 3.4.4
> Reporter: John Cassil
> Priority: Major
>
> Apologies if this issue isn't categorized or documented appropriately.
> Please be gentle! :)
> As a heavy R user that normally interacts with parquet files using SparklyR,
> I have recently decided to try to use arrow::read_parquet() on a few parquet
> files that were on my local machine rather than in hadoop. I was not able to
> proceed after several various attempts due to embedded nuls. For example:
> try({df <- read_parquet('out_2019-09_data_1.snappy.parquet') })
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) :
> embedded nul in string: 'INSTALL BOTH LEFT FRONT AND RIGHT FRONT TORQUE
> ARMS\0 ARMS'
> Is there a solution to this?
> I have also hit roadblocks with embedded nuls in the past with csvs using
> data.table::fread(), but readr::read_delim() seems to handle them gracefully
> with just a warning after proceeding.
> Apologies that I do not have a handy reprex. I don't know if I can even
> recreate a parquet file with embedded nuls using arrow if it won't let me
> read one in, and I can't share this file due to company restrictions.
> Please let me know how I can be of any more help!
--
This message was sent by Atlassian Jira
(v8.3.2#803003)