[ 
https://issues.apache.org/jira/browse/ARROW-15123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

N D updated ARROW-15123:
------------------------
    Description: 
In `arrow` 6.0.0+ for R, when I read in a CSV file using a schema where the 
order of the columns in the schema doesn't match the order of columns in the 
CSV, the data is read in incorrectly.

The header is included as an observation in the read-in dataset. The columns 
are renamed *but not reordered* to match the schema. So I end up with the 
"quantile" column called "location", etc, as below.
{code:java}
[1] "last few obs in sorted order with arrow"
# A tibble: 6 × 7
  forecast_date target       target_end_date location type       quantile value 
  <chr>         <chr>        <chr>           <chr>    <chr>      <chr>    <chr> 
1 2021-12-12    9 day ahead… 2021-12-21      0.99     946.43313… 06       quant…
2 2021-12-12    9 day ahead… 2021-12-21      0.99     956.43294… 39       quant…
3 2021-12-12    9 day ahead… 2021-12-21      0.99     97.948144… 41       quant…
4 2021-12-12    9 day ahead… 2021-12-21      0.99     98.573545… 49       quant…
5 2021-12-12    9 day ahead… 2021-12-21      0.99     98.978636… 33       quant…
6 forecast_date target       target_end_date quantile value      location type 
{code}
The last line ("forecast_date target...") is the original header.

The [file in 
question|[https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/JHUAPL-Gecko/2021-12-12-JHUAPL-Gecko.csv|https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/JHUAPL-Gecko/2021-12-12-JHUAPL-Gecko.csv)]]
 has 45360 observations + 1 line for the header. But the read-in dataset has
{code:java}
[1] "dimensions with arrow"
[1] 45361     7  {code}
Reprex attached with working (`packageVersion("arrow") == 4.0.1`; 5.0.0 also 
works) and non-working (`packageVersion("arrow") == 6.0.1`) examples. Run 
examples using `make run-broken` and `make run-works`.

  was:
In `arrow` 6.0.0+ for R, when I read in a CSV file using a schema where the 
order of the columns in the schema doesn't match the order of columns in the 
CSV, the data is read in incorrectly.

The header is included as an observation in the read-in dataset. The columns 
are renamed *but not reordered* to match the schema. So I end up with the 
"quantile" column called "location", etc, as below.
{code:java}
[1] "last few obs in sorted order with arrow"
# A tibble: 6 × 7
  forecast_date target       target_end_date location type       quantile value 
  <chr>         <chr>        <chr>           <chr>    <chr>      <chr>    <chr> 
1 2021-12-12    9 day ahead… 2021-12-21      0.99     946.43313… 06       quant…
2 2021-12-12    9 day ahead… 2021-12-21      0.99     956.43294… 39       quant…
3 2021-12-12    9 day ahead… 2021-12-21      0.99     97.948144… 41       quant…
4 2021-12-12    9 day ahead… 2021-12-21      0.99     98.573545… 49       quant…
5 2021-12-12    9 day ahead… 2021-12-21      0.99     98.978636… 33       quant…
6 forecast_date target       target_end_date quantile value      location type  
[1] "dimensions with arrow"
[1] 45361     7 {code}
The [file in 
question|[https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/JHUAPL-Gecko/2021-12-12-JHUAPL-Gecko.csv|https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/JHUAPL-Gecko/2021-12-12-JHUAPL-Gecko.csv)]]
 has 45360 observations + 1 line for the header.

Reprex attached with working (`packageVersion("arrow") == 4.0.1`; 5.0.0 also 
works) and non-working (`packageVersion("arrow") == 6.0.1`) examples. Run 
examples using `make run-broken` and `make run-works`.


> [R] Schema order not respected and file header ignored
> ------------------------------------------------------
>
>                 Key: ARROW-15123
>                 URL: https://issues.apache.org/jira/browse/ARROW-15123
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.0, 6.0.1
>            Reporter: N D
>            Priority: Major
>              Labels: schema
>         Attachments: reprex-arrow-6-read.tar.gz
>
>
> In `arrow` 6.0.0+ for R, when I read in a CSV file using a schema where the 
> order of the columns in the schema doesn't match the order of columns in the 
> CSV, the data is read in incorrectly.
> The header is included as an observation in the read-in dataset. The columns 
> are renamed *but not reordered* to match the schema. So I end up with the 
> "quantile" column called "location", etc, as below.
> {code:java}
> [1] "last few obs in sorted order with arrow"
> # A tibble: 6 × 7
>   forecast_date target       target_end_date location type       quantile 
> value 
>   <chr>         <chr>        <chr>           <chr>    <chr>      <chr>    
> <chr> 
> 1 2021-12-12    9 day ahead… 2021-12-21      0.99     946.43313… 06       
> quant…
> 2 2021-12-12    9 day ahead… 2021-12-21      0.99     956.43294… 39       
> quant…
> 3 2021-12-12    9 day ahead… 2021-12-21      0.99     97.948144… 41       
> quant…
> 4 2021-12-12    9 day ahead… 2021-12-21      0.99     98.573545… 49       
> quant…
> 5 2021-12-12    9 day ahead… 2021-12-21      0.99     98.978636… 33       
> quant…
> 6 forecast_date target       target_end_date quantile value      location 
> type {code}
> The last line ("forecast_date target...") is the original header.
> The [file in 
> question|[https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/JHUAPL-Gecko/2021-12-12-JHUAPL-Gecko.csv|https://raw.githubusercontent.com/reichlab/covid19-forecast-hub/master/data-processed/JHUAPL-Gecko/2021-12-12-JHUAPL-Gecko.csv)]]
>  has 45360 observations + 1 line for the header. But the read-in dataset has
> {code:java}
> [1] "dimensions with arrow"
> [1] 45361     7  {code}
> Reprex attached with working (`packageVersion("arrow") == 4.0.1`; 5.0.0 also 
> works) and non-working (`packageVersion("arrow") == 6.0.1`) examples. Run 
> examples using `make run-broken` and `make run-works`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to