[jira] [Commented] (ARROW-6582) R's read_parquet() fails with embedded nuls in strings

John Cassil (Jira) Tue, 17 Sep 2019 11:22:17 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931731#comment-16931731
 ]


John Cassil commented on ARROW-6582:
------------------------------------

Thanks Neal!

I had previously been able to set as_tibble to FALSE and successfully create 
the <Object containing active binding>, but didn't know I could use schema to 
see the columns. Actually, my whole purpose for reading in this particular file 
to begin with was to see the column names in the file, so problem solved! haha

{code:R}
> df$schema
arrow::Schema 
squishedVin: string
hashedVin: string
tranDate: date32[day]
rawFieldValue: string
displayText: string
odometerReading: double 
{code}


I know based on the error that the string is appearing in the rawFieldValue 
column which it appears that arrow is interpreting as a string type.  

As far as encoding issues, I am not strong in understanding encoding issues. To 
back up just a bit, this happens to be a dump of the raw text that any of 
hundreds of thousands of sources has given us, so I am guessing that the 
embedded nul originated due to encoding issues on a partner's server somewhere 
perhaps in a galaxy far away, many years ago.   The file was actually created 
by a java process that one of our teams built to export it from hadoop.  I know 
very little beyond that, and unfortunately am not sure that I could create 
anything to reproduce the error yet. 

I am curious if you know of any way to use a function to turn this into a 
dataframe that would bypass Table__to_dataframe...

Thanks so much again!


> R's read_parquet() fails with embedded nuls in strings
> ------------------------------------------------------
>
>                 Key: ARROW-6582
>                 URL: https://issues.apache.org/jira/browse/ARROW-6582
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.14.1
>         Environment: Windows 10
> R 3.4.4
>            Reporter: John Cassil
>            Priority: Major
>
> Apologies if this issue isn't categorized or documented appropriately.  
> Please be gentle! :)
> As a heavy R user that normally interacts with parquet files using SparklyR, 
> I have recently decided to try to use arrow::read_parquet() on a few parquet 
> files that were on my local machine rather than in hadoop.  I was not able to 
> proceed after several various attempts due to embedded nuls.  For example:
> try({df <- read_parquet('out_2019-09_data_1.snappy.parquet') })
> Error in Table__to_dataframe(x, use_threads = option_use_threads()) : 
>   embedded nul in string: 'INSTALL BOTH LEFT FRONT AND RIGHT FRONT  TORQUE 
> ARMS\0 ARMS'
> Is there a solution to this?
> I have also hit roadblocks with embedded nuls in the past with csvs using 
> data.table::fread(), but readr::read_delim() seems to handle them gracefully 
> with just a warning after proceeding.
> Apologies that I do not have a handy reprex. I don't know if I can even 
> recreate a parquet file with embedded nuls using arrow if it won't let me 
> read one in, and I can't share this file due to company restrictions.
> Please let me know how I can be of any more help!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (ARROW-6582) R's read_parquet() fails with embedded nuls in strings

Reply via email to