Riaz Arbi created ARROW-16010:
---------------------------------

             Summary: write_parquet alters <dttm> value
                 Key: ARROW-16010
                 URL: https://issues.apache.org/jira/browse/ARROW-16010
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 6.0.0
         Environment: Ubuntu focal
R 4.1.1
RStudio 1.4.1772
            Reporter: Riaz Arbi


When we write a dataframe column of type `<dttm>` to parquet using the arrow 
package, subsequent reading in of the parquet file to dataframe returns a 
slightly different value.

This behaviour does not replicate with columns of type `<double>`

 

Reprex:

 

```

# Create sample dataframe
n <-  1631494810.376999855041503906250000000000000000000000000000000000
df <- data.frame(x = "a",
                 n = n,
                 t = as.POSIXct(n, origin = "1970-01-01"))
# Write to disk
df %>% write_parquet("/tmp/tmp.parquet")

# Extract time-based cols
dft <- df %>% 
  filter(x == "a") %>% 
  pull(t) %>% 
  as.numeric 

pqt <- read_parquet("/tmp/tmp.parquet") %>% 
  filter(x == "a") %>% 
  pull(t) %>% 
  as.numeric 

dft == pqt
sprintf("%.54f",dft)
sprintf("%.54f",pqt)

# Extract numeric cols
dfn <- df %>% 
  filter(x == "a") %>% 
  pull(n) %>% 
  as.numeric 

pqn <- read_parquet("/tmp/tmp.parquet") %>% 
  filter(x == "a") %>% 
  pull(n) %>% 
  as.numeric 

dfn == pqn
sprintf("%.54f",dfn)
sprintf("%.54f",pqn)

```

 

The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn` 
returns TRUE.

 

Why is this a problem? We use `arrow` to store dataframes to disk. When we want 
to update these parquet files, we first check whether any data has actually 
changed and put in place tripwires to ensure that if a significant proportion 
of the data has changed the pipeline fails and is flagged for manual review.

 

With the current behaviour, above, all of the dataframes that contain `<dttm>` 
type columns are failing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to