[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

David Wales (Jira) Wed, 31 Mar 2021 00:14:05 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-12162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Wales updated ARROW-12162:
--------------------------------
    Description: 
h2. EDIT:

I've found a solution for my specific use case. If I add the argument 
`encoding="latin1"` to the `DBI::dbConnect` function, then everything works! 
This issue might still be valid for other cases where Parquet tries to save 
invalid data though. It would be nice to get an error on write, rather than on 
read!
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
 

What I would really like is a way to tell arrow "This data is latin1 encoded. 
Please convert it to UTF-8 before you save it as a Parquet file".

Or alternatively "This Parquet file contains latin1 encoded data".
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
h2. Possibly related issues

https://issues.apache.org/jira/browse/ARROW-12007

  was:
h2. Background

I am using the R arrow library.

I am reading from an SQL Server database with the `latin1` encoding using 
`dbplyr` and saving the output as a parquet file: 
{code:java}
# Assume `con` is a previously established connection to the database created 
with DBI::dbConnect
tbl(con, in_schema("dbo", "latin1_table")) %>%

  collect() %>%

  write_parquet("output.parquet")
{code}
 

However, when I try to read the file back, I get the error "Invalid UTF8 
payload": 
{code:java}
> read_parquet("output.parquet")

Error: Invalid: Invalid UTF8 payload
{code}
 

What I would really like is a way to tell arrow "This data is latin1 encoded. 
Please convert it to UTF-8 before you save it as a Parquet file".

Or alternatively "This Parquet file contains latin1 encoded data".
h2. Minimal Reproducible Example

I have isolated this issue to a minimal reproducible example.

If the database table contains the latin1 single quote character, then it will 
trigger the error.

I have attached a `.rds` file which contains an example tibble.

To reproduce, run the following: 
{code:java}
readRDS(file.path(data_dir, "bad_char.rds")) %>% 
write_parquet(file.path(data_dir, "bad_char.parquet"))

read_parquet(file.path(data_dir, "bad_char.parquet"))
{code}
h2. Possibly related issues

https://issues.apache.org/jira/browse/ARROW-12007


> [R] read_parquet returns Invalid UTF8 payload
> ---------------------------------------------
>
>                 Key: ARROW-12162
>                 URL: https://issues.apache.org/jira/browse/ARROW-12162
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 3.0.0
>         Environment: Windows 10
> R 4.0.3
> arrow 3.0.0
> dbplyr 2.0.0
> dplyr 1.0.2
>            Reporter: David Wales
>            Priority: Major
>         Attachments: bad_char.rds
>
>
> h2. EDIT:
> I've found a solution for my specific use case. If I add the argument 
> `encoding="latin1"` to the `DBI::dbConnect` function, then everything works! 
> This issue might still be valid for other cases where Parquet tries to save 
> invalid data though. It would be nice to get an error on write, rather than 
> on read!
> h2. Background
> I am using the R arrow library.
> I am reading from an SQL Server database with the `latin1` encoding using 
> `dbplyr` and saving the output as a parquet file: 
> {code:java}
> # Assume `con` is a previously established connection to the database created 
> with DBI::dbConnect
> tbl(con, in_schema("dbo", "latin1_table")) %>%
>   collect() %>%
>   write_parquet("output.parquet")
> {code}
>  
> However, when I try to read the file back, I get the error "Invalid UTF8 
> payload": 
> {code:java}
> > read_parquet("output.parquet")
> Error: Invalid: Invalid UTF8 payload
> {code}
>  
> What I would really like is a way to tell arrow "This data is latin1 encoded. 
> Please convert it to UTF-8 before you save it as a Parquet file".
> Or alternatively "This Parquet file contains latin1 encoded data".
> h2. Minimal Reproducible Example
> I have isolated this issue to a minimal reproducible example.
> If the database table contains the latin1 single quote character, then it 
> will trigger the error.
> I have attached a `.rds` file which contains an example tibble.
> To reproduce, run the following: 
> {code:java}
> readRDS(file.path(data_dir, "bad_char.rds")) %>% 
> write_parquet(file.path(data_dir, "bad_char.parquet"))
> read_parquet(file.path(data_dir, "bad_char.parquet"))
> {code}
> h2. Possibly related issues
> https://issues.apache.org/jira/browse/ARROW-12007



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-12162) [R] read_parquet returns Invalid UTF8 payload

Reply via email to