Hi, using the R arrow package version 14.0.2.1, I'm stumped by something seemingly simple. For date columns, I like to use R's Date class, which is stored internally as a number but prints as a YYYY-MM-DD string.
In most cases arrow handles these Date columns nicely. The exception is when I partition on a Date column, as in column "d1" in my example below. When I read my data back in with open_dataset(), the d1 column is now a string instead of Date. In contrast, the types of all the other columns are preserved, including my "d2" Date column, because I did not partition on that one. It sort of makes sense that d1 is now a string, because the directory names on disk really are strings like "2024-01-01". But I'd really like to convert it back to the Date class format! In plain R that's easy, but with the Dataset mmap-ed on disk, I don't know how to do it. What should I do to get arrow to convert the partitioned d1 column to Arrow's date32[day] type, and thus back to R's Date class? Can I somehow do this directly on the Dataset object itself, WITHOUT first converting it to ArrowTabular or data.frame? Thanks for your help! Example follows: -------------------------------------------------- require("arrow") my.dir <- "/tmp/arrow" # Example data with some Date-class columns: aa <- do.call("rbind" ,lapply(split(iris ,iris$Species) ,function(xx){ cbind(head(xx ,5) ,d1=(as.Date('2024-01-01') + 0:4) ,d2=(as.Date('1980-01-01') + 0:4)) })); rownames(aa) <- NULL arrow::write_dataset(aa ,my.dir ,partitioning=c('d1') ,hive_style=FALSE ,format="feather" ,codec=Codec$create("LZ4_FRAME")) bb <- arrow::open_dataset(my.dir ,format="feather" ,unify_schemas=TRUE ,partitioning=c('d1')) # Unfortunately the "d1" column is now a string. > dim(aa) [1] 15 7 > class(aa) [1] "data.frame" > sapply(aa ,class) Sepal.Length Sepal.Width Petal.Length Petal.Width Species d1 d2 "numeric" "numeric" "numeric" "numeric" "factor" "Date" "Date" > sapply(aa ,storage.mode) Sepal.Length Sepal.Width Petal.Length Petal.Width Species d1 d2 "double" "double" "double" "double" "integer" "double" "double" > dim(bb) [1] 15 7 > class(bb) [1] "FileSystemDataset" "Dataset" "ArrowObject" "R6" > bb$schema$d1 Field d1: string > bb$schema$d2 Field d2: date32[day] > bb FileSystemDataset with 5 Feather files Sepal.Length: double Sepal.Width: double Petal.Length: double Petal.Width: double Species: dictionary<values=string, indices=int8> d2: date32[day] d1: string See $metadata for additional Schema metadata > sapply(arrow:::as.data.frame.ArrowTabular(bb$NewScan()$Finish()$ToTable()) > ,class) Sepal.Length Sepal.Width Petal.Length Petal.Width Species d2 d1 "numeric" "numeric" "numeric" "numeric" "factor" "Date" "character" -------------------------------------------------- -- Andrew Piskorski <a...@piskorski.com>