[jira] [Updated] (ARROW-16010) write_parquet alters value

Riaz Arbi (Jira) Wed, 23 Mar 2022 07:08:08 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Riaz Arbi updated ARROW-16010:
------------------------------
    Description: 
When we write a dataframe column of type `<dttm>` to parquet using the arrow 
package, subsequent reading in of the parquet file to dataframe returns a 
slightly different value.

This behaviour does not replicate with columns of type `<double>`

 

Reprex:

 
{code:java}
 

#Create sample dataframe
n <-  1631494810.376999855041503906250000000000000000000000000000000000
df <- data.frame(x = "a",
                 n = n,
                 t = as.POSIXct(n, origin = "1970-01-01"))
#Write to disk
df %>% write_parquet("/tmp/tmp.parquet")


#Extract time-based cols
dft <- df %>% 
  filter(x == "a") %>% 
  pull(t) %>% 
  as.numeric 

pqt <- read_parquet("/tmp/tmp.parquet") %>% 
  filter(x == "a") %>% 
  pull(t) %>% 
  as.numeric 
dft == pqt
sprintf("%.54f",dft)
sprintf("%.54f",pqt)

#Extract numeric cols
dfn <- df %>% 
  filter(x == "a") %>% 
  pull(n) %>% 
  as.numeric 

pqn <- read_parquet("/tmp/tmp.parquet") %>% 
  filter(x == "a") %>% 
  pull(n) %>% 
  as.numeric 
dfn == pqn
sprintf("%.54f",dfn)
sprintf("%.54f",pqn) {code}
 

The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn` 
returns TRUE.

 

Why is this a problem? We use `arrow` to store dataframes to disk. When we want 
to update these parquet files, we first check whether any data has actually 
changed and put in place tripwires to ensure that if a significant proportion 
of the data has changed the pipeline fails and is flagged for manual review.

 

With the current behaviour, above, all of the dataframes that contain `<dttm>` 
type columns are failing.

  was:
When we write a dataframe column of type `<dttm>` to parquet using the arrow 
package, subsequent reading in of the parquet file to dataframe returns a 
slightly different value.

This behaviour does not replicate with columns of type `<double>`

 

Reprex:

 
{code:java}
 

Create sample dataframe
n <-  1631494810.376999855041503906250000000000000000000000000000000000
df <- data.frame(x = "a",
                 n = n,
                 t = as.POSIXct(n, origin = "1970-01-01"))
Write to disk
df %>% write_parquet("/tmp/tmp.parquet")


Extract time-based cols
dft <- df %>% 
  filter(x == "a") %>% 
  pull(t) %>% 
  as.numeric 

pqt <- read_parquet("/tmp/tmp.parquet") %>% 
  filter(x == "a") %>% 
  pull(t) %>% 
  as.numeric 
dft == pqt
sprintf("%.54f",dft)
sprintf("%.54f",pqt)

Extract numeric cols
dfn <- df %>% 
  filter(x == "a") %>% 
  pull %>% 
  as.numeric 

pqn <- read_parquet("/tmp/tmp.parquet") %>% 
  filter(x == "a") %>% 
  pull %>% 
  as.numeric 
dfn == pqn
sprintf("%.54f",dfn)
sprintf("%.54f",pqn) {code}
 

The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn` 
returns TRUE.

 

Why is this a problem? We use `arrow` to store dataframes to disk. When we want 
to update these parquet files, we first check whether any data has actually 
changed and put in place tripwires to ensure that if a significant proportion 
of the data has changed the pipeline fails and is flagged for manual review.

 

With the current behaviour, above, all of the dataframes that contain `<dttm>` 
type columns are failing.


> write_parquet alters <dttm> value
> ---------------------------------
>
>                 Key: ARROW-16010
>                 URL: https://issues.apache.org/jira/browse/ARROW-16010
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.0
>         Environment: Ubuntu focal
> R 4.1.1
> RStudio 1.4.1772
>            Reporter: Riaz Arbi
>            Priority: Minor
>
> When we write a dataframe column of type `<dttm>` to parquet using the arrow 
> package, subsequent reading in of the parquet file to dataframe returns a 
> slightly different value.
> This behaviour does not replicate with columns of type `<double>`
>  
> Reprex:
>  
> {code:java}
>  
> #Create sample dataframe
> n <-  1631494810.376999855041503906250000000000000000000000000000000000
> df <- data.frame(x = "a",
>                  n = n,
>                  t = as.POSIXct(n, origin = "1970-01-01"))
> #Write to disk
> df %>% write_parquet("/tmp/tmp.parquet")
> #Extract time-based cols
> dft <- df %>% 
>   filter(x == "a") %>% 
>   pull(t) %>% 
>   as.numeric 
> pqt <- read_parquet("/tmp/tmp.parquet") %>% 
>   filter(x == "a") %>% 
>   pull(t) %>% 
>   as.numeric 
> dft == pqt
> sprintf("%.54f",dft)
> sprintf("%.54f",pqt)
> #Extract numeric cols
> dfn <- df %>% 
>   filter(x == "a") %>% 
>   pull(n) %>% 
>   as.numeric 
> pqn <- read_parquet("/tmp/tmp.parquet") %>% 
>   filter(x == "a") %>% 
>   pull(n) %>% 
>   as.numeric 
> dfn == pqn
> sprintf("%.54f",dfn)
> sprintf("%.54f",pqn) {code}
>  
> The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn` 
> returns TRUE.
>  
> Why is this a problem? We use `arrow` to store dataframes to disk. When we 
> want to update these parquet files, we first check whether any data has 
> actually changed and put in place tripwires to ensure that if a significant 
> proportion of the data has changed the pipeline fails and is flagged for 
> manual review.
>  
> With the current behaviour, above, all of the dataframes that contain 
> `<dttm>` type columns are failing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-16010) write_parquet alters value

Reply via email to