[
https://issues.apache.org/jira/browse/ARROW-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933608#comment-16933608
]
Neal Richardson commented on ARROW-6582:
----------------------------------------
My guess is that (if you needed to solve this problem, which it sounds like you
don't), you could try setting different encodings in your R session and see if
that handles the string column correctly. Or there's probably a way to get at
that column and dump it as is to disk so that you could use some other means of
stripping out the nuls. I'm guessing the ultimate fix is to fix the data
generating/ETL process so that there aren't nuls there to begin with, though I
recognize that that's not always an option.
I'll keep this open for a bit and think about if there are ways we can make it
easier to dump that data from Arrow to a plain text format without going
through R first so that one might be able to debug when they get bad data like
this, but ultimately the error is coming from R, not arrow.
> R's read_parquet() fails with embedded nuls in strings
> ------------------------------------------------------
>
> Key: ARROW-6582
> URL: https://issues.apache.org/jira/browse/ARROW-6582
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.14.1
> Environment: Windows 10
> R 3.4.4
> Reporter: John Cassil
> Priority: Major
>
> Apologies if this issue isn't categorized or documented appropriately.
> Please be gentle! :)
> As a heavy R user that normally interacts with parquet files using SparklyR,
> I have recently decided to try to use arrow::read_parquet() on a few parquet
> files that were on my local machine rather than in hadoop. I was not able to
> proceed after several various attempts due to embedded nuls. For example:
> try({df <- read_parquet('out_2019-09_data_1.snappy.parquet') })
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) :
> embedded nul in string: 'INSTALL BOTH LEFT FRONT AND RIGHT FRONT TORQUE
> ARMS\0 ARMS'
> Is there a solution to this?
> I have also hit roadblocks with embedded nuls in the past with csvs using
> data.table::fread(), but readr::read_delim() seems to handle them gracefully
> with just a warning after proceeding.
> Apologies that I do not have a handy reprex. I don't know if I can even
> recreate a parquet file with embedded nuls using arrow if it won't let me
> read one in, and I can't share this file due to company restrictions.
> Please let me know how I can be of any more help!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)