[
https://issues.apache.org/jira/browse/ARROW-16010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Riaz Arbi updated ARROW-16010:
------------------------------
Description:
When we write a dataframe column of type `<dttm>` to parquet using the arrow
package, subsequent reading in of the parquet file to dataframe returns a
slightly different value.
This behaviour does not replicate with columns of type `<double>`
Reprex:
{code:java}
#Create sample dataframe
n <- 1631494810.376999855041503906250000000000000000000000000000000000
df <- data.frame(x = "a",
n = n,
t = as.POSIXct(n, origin = "1970-01-01"))
#Write to disk
df %>% write_parquet("/tmp/tmp.parquet")
#Extract time-based cols
dft <- df %>%
filter(x == "a") %>%
pull(t) %>%
as.numeric
pqt <- read_parquet("/tmp/tmp.parquet") %>%
filter(x == "a") %>%
pull(t) %>%
as.numeric
dft == pqt
sprintf("%.54f",dft)
sprintf("%.54f",pqt)
#Extract numeric cols
dfn <- df %>%
filter(x == "a") %>%
pull(n) %>%
as.numeric
pqn <- read_parquet("/tmp/tmp.parquet") %>%
filter(x == "a") %>%
pull(n) %>%
as.numeric
dfn == pqn
sprintf("%.54f",dfn)
sprintf("%.54f",pqn) {code}
The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn`
returns TRUE.
Why is this a problem? We use `arrow` to store dataframes to disk. When we want
to update these parquet files, we first check whether any data has actually
changed and put in place tripwires to ensure that if a significant proportion
of the data has changed the pipeline fails and is flagged for manual review.
With the current behaviour, above, all of the dataframes that contain `<dttm>`
type columns are failing.
was:
When we write a dataframe column of type `<dttm>` to parquet using the arrow
package, subsequent reading in of the parquet file to dataframe returns a
slightly different value.
This behaviour does not replicate with columns of type `<double>`
Reprex:
{code:java}
Create sample dataframe
n <- 1631494810.376999855041503906250000000000000000000000000000000000
df <- data.frame(x = "a",
n = n,
t = as.POSIXct(n, origin = "1970-01-01"))
Write to disk
df %>% write_parquet("/tmp/tmp.parquet")
Extract time-based cols
dft <- df %>%
filter(x == "a") %>%
pull(t) %>%
as.numeric
pqt <- read_parquet("/tmp/tmp.parquet") %>%
filter(x == "a") %>%
pull(t) %>%
as.numeric
dft == pqt
sprintf("%.54f",dft)
sprintf("%.54f",pqt)
Extract numeric cols
dfn <- df %>%
filter(x == "a") %>%
pull %>%
as.numeric
pqn <- read_parquet("/tmp/tmp.parquet") %>%
filter(x == "a") %>%
pull %>%
as.numeric
dfn == pqn
sprintf("%.54f",dfn)
sprintf("%.54f",pqn) {code}
The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn`
returns TRUE.
Why is this a problem? We use `arrow` to store dataframes to disk. When we want
to update these parquet files, we first check whether any data has actually
changed and put in place tripwires to ensure that if a significant proportion
of the data has changed the pipeline fails and is flagged for manual review.
With the current behaviour, above, all of the dataframes that contain `<dttm>`
type columns are failing.
> write_parquet alters <dttm> value
> ---------------------------------
>
> Key: ARROW-16010
> URL: https://issues.apache.org/jira/browse/ARROW-16010
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 6.0.0
> Environment: Ubuntu focal
> R 4.1.1
> RStudio 1.4.1772
> Reporter: Riaz Arbi
> Priority: Minor
>
> When we write a dataframe column of type `<dttm>` to parquet using the arrow
> package, subsequent reading in of the parquet file to dataframe returns a
> slightly different value.
> This behaviour does not replicate with columns of type `<double>`
>
> Reprex:
>
> {code:java}
>
> #Create sample dataframe
> n <- 1631494810.376999855041503906250000000000000000000000000000000000
> df <- data.frame(x = "a",
> n = n,
> t = as.POSIXct(n, origin = "1970-01-01"))
> #Write to disk
> df %>% write_parquet("/tmp/tmp.parquet")
> #Extract time-based cols
> dft <- df %>%
> filter(x == "a") %>%
> pull(t) %>%
> as.numeric
> pqt <- read_parquet("/tmp/tmp.parquet") %>%
> filter(x == "a") %>%
> pull(t) %>%
> as.numeric
> dft == pqt
> sprintf("%.54f",dft)
> sprintf("%.54f",pqt)
> #Extract numeric cols
> dfn <- df %>%
> filter(x == "a") %>%
> pull(n) %>%
> as.numeric
> pqn <- read_parquet("/tmp/tmp.parquet") %>%
> filter(x == "a") %>%
> pull(n) %>%
> as.numeric
> dfn == pqn
> sprintf("%.54f",dfn)
> sprintf("%.54f",pqn) {code}
>
> The critical issue is that `dft == pqt` returns `FALSE` while `dfn == pqn`
> returns TRUE.
>
> Why is this a problem? We use `arrow` to store dataframes to disk. When we
> want to update these parquet files, we first check whether any data has
> actually changed and put in place tripwires to ensure that if a significant
> proportion of the data has changed the pipeline fails and is flagged for
> manual review.
>
> With the current behaviour, above, all of the dataframes that contain
> `<dttm>` type columns are failing.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)