[jira] [Commented] (ARROW-6582) R's read_parquet() fails with embedded nuls in strings

Neal Richardson (Jira) Tue, 17 Sep 2019 10:13:46 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931666#comment-16931666
 ]


Neal Richardson commented on ARROW-6582:
----------------------------------------

Thanks for the report. A few thoughts:

1. "embedded nul in string" is an error coming from R. Since the error is being 
thrown in {{Table__to_dataframe}}, that means the Parquet file was already read 
into Arrow memory successfully, and R is failing to read it from Arrow. That 
helps isolate the issue.

2. Given that, you could play around with the {{col_select}} argument to 
{{read_parquet}} and identify which column it is that has the nul, if you don't 
already know. If you don't happen to need this column for whatever you're 
trying to do, you could omit it from there and proceed.

3. If you can identify the offending column, it would be interesting to know 
what Arrow type it is. To do that, do something like

{code:r}
tab <- read_parquet(file, as_tibble=FALSE)
tab$schema
{code}

and report back what type that column is.

4. Check your system locale and encoding and make sure it aligns with the data 
in the file. [Googling the error 
message|https://www.google.com/search?q=embedded+nul+in+string] points to 
encoding often being implicated.

5. How are these Parquet files generated? Same host? Or different system, 
platform, etc.? Does that tell you something useful about the locale/encoding 
you need to set in R to read the data?

6. If any of this leads you to a place where you can write out a sufficiently 
anonymized/obfuscated file that reproduces the error, that would of course be 
most helpful.

> R's read_parquet() fails with embedded nuls in strings
> ------------------------------------------------------
>
>                 Key: ARROW-6582
>                 URL: https://issues.apache.org/jira/browse/ARROW-6582
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.14.1
>         Environment: Windows 10
> R 3.4.4
>            Reporter: John Cassil
>            Priority: Major
>
> Apologies if this issue isn't categorized or documented appropriately.  
> Please be gentle! :)
> As a heavy R user that normally interacts with parquet files using SparklyR, 
> I have recently decided to try to use arrow::read_parquet() on a few parquet 
> files that were on my local machine rather than in hadoop.  I was not able to 
> proceed after several various attempts due to embedded nuls.  For example:
> try({df <- read_parquet('out_2019-09_data_1.snappy.parquet') })
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>   embedded nul in string: 'INSTALL BOTH LEFT FRONT AND RIGHT FRONT  TORQUE 
> ARMS\0 ARMS'
> Is there a solution to this?
> I have also hit roadblocks with embedded nuls in the past with csvs using 
> data.table::fread(), but readr::read_delim() seems to handle them gracefully 
> with just a warning after proceeding.
> Apologies that I do not have a handy reprex. I don't know if I can even 
> recreate a parquet file with embedded nuls using arrow if it won't let me 
> read one in, and I can't share this file due to company restrictions.
> Please let me know how I can be of any more help!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6582) R's read_parquet() fails with embedded nuls in strings

Reply via email to