[ 
https://issues.apache.org/jira/browse/ARROW-11587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephanie Hazlitt updated ARROW-11587:
--------------------------------------
    Description: 
Fixed-width files are a common data provisioning format for (very) large, 
administrative data files. We have been converting provisioned fwf files to 
`.parquet` and then leveraging `arrow::open_dataset()` with good success. 
However, we still run into RAM issues with the read-in step and are keen to try 
new approaches to this in-memory RAM issue (ideally without chunking files etc).

A simple, example workflow looks like this:
{code:java}
 sample_data <- 
"https://github.com/bcgov/dipr/raw/master/inst/extdata/starwars-fwf.dat.gz";
vroom::vroom_fwf(sample_data,
   col_positions = vroom::fwf_positions(
     c(1, 22, 25, 31),
     c(21, 24, 30, 35),
     c("name", "height", "mass", "has_hair")
     ),
     col_types = ("cnnl")
) %>%
    dplyr::group_by(has_hair) %>%
    arrow::write_dataset(path = "starwars_parquet",
     format = "parquet"){code}
 

With an \{arrow} fixed-width reader, we could perhaps leverage 
`arrow::open_dataset(as_data_frame = FALSE)` directly on a large fwf file and 
then convert to partitioned `.parquet` files with arrow::write_dataset()?

  was:
Fixed-width files are a common data provisioning format for (very) large, 
administrative data files. We have been converting provisioned fwf files to 
`.parquet` and then leveraging `arrow::open_dataset()` with good success. 
However, we still run into RAM issues with the read-in step and are keen to try 
new approaches to this in-memory RAM issue (ideally without chunking files 
etc). 

A simple, example workflow looks like this:

```
sample_data <- 
"https://github.com/bcgov/dipr/raw/master/inst/extdata/starwars-fwf.dat.gz";

vroom::vroom_fwf(sample_data,
   col_positions = vroom::fwf_positions(
     c(1, 22, 25, 31),
     c(21, 24, 30, 35),
     c("name", "height", "mass", "has_hair")
     ),
     col_types = ("cnnl")
) %>%
    dplyr::group_by(has_hair) %>%
    arrow::write_dataset(path = "starwars_parquet",
     format = "parquet")
```

With an \{arrow} fixed-width reader, we could perhaps leverage 
`arrow::open_dataset(as_data_frame = FALSE)` directly on a large fwf file and 
then convert to partitioned `.parquet` files with arrow::write_dataset()?


> [C++] Implement a fixed-width file reader 
> ------------------------------------------
>
>                 Key: ARROW-11587
>                 URL: https://issues.apache.org/jira/browse/ARROW-11587
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++, R
>            Reporter: Stephanie Hazlitt
>            Priority: Major
>
> Fixed-width files are a common data provisioning format for (very) large, 
> administrative data files. We have been converting provisioned fwf files to 
> `.parquet` and then leveraging `arrow::open_dataset()` with good success. 
> However, we still run into RAM issues with the read-in step and are keen to 
> try new approaches to this in-memory RAM issue (ideally without chunking 
> files etc).
> A simple, example workflow looks like this:
> {code:java}
>  sample_data <- 
> "https://github.com/bcgov/dipr/raw/master/inst/extdata/starwars-fwf.dat.gz";
> vroom::vroom_fwf(sample_data,
>    col_positions = vroom::fwf_positions(
>      c(1, 22, 25, 31),
>      c(21, 24, 30, 35),
>      c("name", "height", "mass", "has_hair")
>      ),
>      col_types = ("cnnl")
> ) %>%
>     dplyr::group_by(has_hair) %>%
>     arrow::write_dataset(path = "starwars_parquet",
>      format = "parquet"){code}
>  
> With an \{arrow} fixed-width reader, we could perhaps leverage 
> `arrow::open_dataset(as_data_frame = FALSE)` directly on a large fwf file and 
> then convert to partitioned `.parquet` files with arrow::write_dataset()?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to