[
https://issues.apache.org/jira/browse/ARROW-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephanie Hazlitt updated ARROW-11587:
--------------------------------------
Description:
Fixed-width files are a common data provisioning format for (very) large,
administrative data files. We have been converting provisioned fwf files to
`.parquet` and then leveraging `arrow::open_dataset()` with good success.
However, we still run into RAM issues with the read-in step and are keen to try
new approaches to this in-memory RAM issue (ideally without chunking files etc).
A simple, example workflow looks like this:
{code:java}
sample_data <-
"https://github.com/bcgov/dipr/raw/master/inst/extdata/starwars-fwf.dat.gz"
vroom::vroom_fwf(sample_data,
col_positions = vroom::fwf_positions(
c(1, 22, 25, 31),
c(21, 24, 30, 35),
c("name", "height", "mass", "has_hair")
),
col_types = ("cnnl")
) %>%
dplyr::group_by(has_hair) %>%
arrow::write_dataset(path = "starwars_parquet",
format = "parquet"){code}
With an \{arrow} fixed-width reader, we could perhaps leverage
`arrow::open_dataset(as_data_frame = FALSE)` directly on a large fwf file and
then convert to partitioned `.parquet` files with arrow::write_dataset()?
was:
Fixed-width files are a common data provisioning format for (very) large,
administrative data files. We have been converting provisioned fwf files to
`.parquet` and then leveraging `arrow::open_dataset()` with good success.
However, we still run into RAM issues with the read-in step and are keen to try
new approaches to this in-memory RAM issue (ideally without chunking files
etc).
A simple, example workflow looks like this:
```
sample_data <-
"https://github.com/bcgov/dipr/raw/master/inst/extdata/starwars-fwf.dat.gz"
vroom::vroom_fwf(sample_data,
col_positions = vroom::fwf_positions(
c(1, 22, 25, 31),
c(21, 24, 30, 35),
c("name", "height", "mass", "has_hair")
),
col_types = ("cnnl")
) %>%
dplyr::group_by(has_hair) %>%
arrow::write_dataset(path = "starwars_parquet",
format = "parquet")
```
With an \{arrow} fixed-width reader, we could perhaps leverage
`arrow::open_dataset(as_data_frame = FALSE)` directly on a large fwf file and
then convert to partitioned `.parquet` files with arrow::write_dataset()?
> [C++] Implement a fixed-width file reader
> ------------------------------------------
>
> Key: ARROW-11587
> URL: https://issues.apache.org/jira/browse/ARROW-11587
> Project: Apache Arrow
> Issue Type: Wish
> Components: C++, R
> Reporter: Stephanie Hazlitt
> Priority: Major
>
> Fixed-width files are a common data provisioning format for (very) large,
> administrative data files. We have been converting provisioned fwf files to
> `.parquet` and then leveraging `arrow::open_dataset()` with good success.
> However, we still run into RAM issues with the read-in step and are keen to
> try new approaches to this in-memory RAM issue (ideally without chunking
> files etc).
> A simple, example workflow looks like this:
> {code:java}
> sample_data <-
> "https://github.com/bcgov/dipr/raw/master/inst/extdata/starwars-fwf.dat.gz"
> vroom::vroom_fwf(sample_data,
> col_positions = vroom::fwf_positions(
> c(1, 22, 25, 31),
> c(21, 24, 30, 35),
> c("name", "height", "mass", "has_hair")
> ),
> col_types = ("cnnl")
> ) %>%
> dplyr::group_by(has_hair) %>%
> arrow::write_dataset(path = "starwars_parquet",
> format = "parquet"){code}
>
> With an \{arrow} fixed-width reader, we could perhaps leverage
> `arrow::open_dataset(as_data_frame = FALSE)` directly on a large fwf file and
> then convert to partitioned `.parquet` files with arrow::write_dataset()?
--
This message was sent by Atlassian Jira
(v8.20.1#820001)