[
https://issues.apache.org/jira/browse/ARROW-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Guillermo Duran updated ARROW-17432:
------------------------------------
Description:
This is a weird issue that creates new rows when importing a large csv (56 GB)
into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT didn't
occurr with the Python Arrow library 9.0.0. Due to the large size of the
original csv it's difficult to create a reproducible example, but I share the
code and outputs.
The code I use in R to import the csv:
{code:java}
library(arrow)
library(dplyr)
csv_file <- "/ebird_erd2021/full/obs.csv"
dest <- "/ebird_erd2021/full/obs_parquet/"
sch = arrow::schema(checklist_id = float32(),
species_code = string(),
exotic_category = float32(),
obs_count = float32(),
only_presence_reported = float32(),
only_slash_reported = float32(),
valid = float32(),
reviewed = float32(),
has_media = float32()
)
csv_stream <- open_dataset(csv_file, format = "csv",
schema = sch, skip_rows = 1)
write_dataset(csv_stream, dest, format = "parquet",
max_rows_per_file=1000000L,
hive_style = TRUE,
existing_data_behavior = "overwrite"){code}
When I load the dataset and check one random _checklist_id_ I get duplicates
that are not part of the _obs.csv_ file. There shouldn't be duplicated species
in a checklist ({_}amerob{_} for example)... also note that the duplicated
species have different {_}obs_count{_}. 50 species in total in that specific
{_}checklist_id{_}. (I show below how it looks on the original csv file)
{code:java}
parquet_arrow <- open_dataset(dest, format = "parquet")
parquet_arrow |>
filter(checklist_id == 18543372) |>
arrange(species_code) |>
collect()
# A tibble: 50 × 3
checklist_id species_code obs_count
<dbl> <chr> <dbl>
1 18543372 altori 3
2 18543372 amekes 1
3 18543372 amered 40
4 18543372 amerob 30
5 18543372 amerob 9
6 18543372 balori 9
7 18543372 blkter 9
8 18543372 blkvul 20
9 18543372 buggna 1
10 18543372 buwwar 1
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows{code}
If I use awk to check on the csv_file that same checklist id, I get something
different:
{code:java}
$ awk -F "," '{ if ($1 == 18543372) { print } }' obs.csv
18543372.0,rewbla,,60.0,0.0,0.0,1.0,0.0,0.0
18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0
18543372.0,robgro,,2.0,0.0,0.0,1.0,0.0,0.0
18543372.0,eastow,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,sedwre1,,2.0,0.0,0.0,1.0,0.0,0.0
18543372.0,ovenbi1,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,buggna,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,reshaw,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,turvul,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,gowwar,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,balori,,9.0,0.0,0.0,1.0,0.0,0.0
18543372.0,buwwar,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,cangoo,,6.0,0.0,0.0,1.0,0.0,0.0
18543372.0,houwre,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,amered,,40.0,0.0,0.0,1.0,1.0,0.0
18543372.0,norwat,,2.0,0.0,0.0,1.0,0.0,0.0{code}
17 different species and no repetitions... Look _amerob_ on the 2nd line only,
with 30 _obs_count_
If I import the csv into parquet using the Python Arrow library as:
{code:java}
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc
import pandas as pd
test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv",
nrows = 1000)
sch = pa.Schema.from_pandas(test_rows_csv)
csv_file = ds.dataset("/ebird_erd2021/full/obs.csv",
schema = sch,
format = "csv")
ds.write_dataset(csv_file,
"ebird_erd2021/full/obs_parquet_py/",
format = "parquet",
schema = sch,
use_threads = True,
max_rows_per_file = 1000000,
max_rows_per_group = 1000000,
existing_data_behavior = "error"){code}
And then load it in R doing the same search on that checklist:
{code:java}
parquet_py <- "/ebird_erd2021/full/obs_parquet_py/"
parquet_arrow <- open_dataset(parquet_py, format = "parquet")
parquet_arrow |>
filter(checklist_id == 18543372) |>
arrange(species_code) |>
select(checklist_id, species_code, obs_count) |>
collect()
# A tibble: 17 × 3
checklist_id species_code obs_count
<dbl> <chr> <dbl>
1 18543372 amered 40
2 18543372 amerob 30
3 18543372 balori 9
4 18543372 buggna 1
5 18543372 buwwar 1
6 18543372 cangoo 6
7 18543372 eastow 1
8 18543372 gowwar 1
9 18543372 grycat 1
10 18543372 houwre 1
11 18543372 norwat 2
12 18543372 ovenbi1 1
13 18543372 reshaw 1
14 18543372 rewbla 60
15 18543372 robgro 2
16 18543372 sedwre1 2
17 18543372 turvul 1{code}
I get exactly what I should get. No species_code repeated (as in the original
csv).
Due to these differences I guess something weird must be happening in the R
arrow library.
was:
This is a weird issue that creates new rows when importing a large csv (56 GB)
into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT didn't
occurred with the Python Arrow library 9.0.0. Due to the large size of the
original csv it's difficult to create a reproducible example, but I share the
code and outputs.
The code I use in R to import the csv:
{code:java}
library(arrow)
library(dplyr)
csv_file <- "/ebird_erd2021/full/obs.csv"
dest <- "/ebird_erd2021/full/obs_parquet/"
sch = arrow::schema(checklist_id = float32(),
species_code = string(),
exotic_category = float32(),
obs_count = float32(),
only_presence_reported = float32(),
only_slash_reported = float32(),
valid = float32(),
reviewed = float32(),
has_media = float32()
)
csv_stream <- open_dataset(csv_file, format = "csv",
schema = sch, skip_rows = 1)
write_dataset(csv_stream, dest, format = "parquet",
max_rows_per_file=1000000L,
hive_style = TRUE,
existing_data_behavior = "overwrite"){code}
When I load the dataset and check one random _checklist_id_ I get duplicates
that are not part of the _obs.csv_ file. There shouldn't be duplicated species
in a checklist ({_}amerob{_} for example)... also note that the duplicated
species have different {_}obs_count{_}. 50 species in total in that specific
{_}checklist_id{_}. (I show below how it looks on the original csv file)
{code:java}
parquet_arrow <- open_dataset(dest, format = "parquet")
parquet_arrow |>
filter(checklist_id == 18543372) |>
arrange(species_code) |>
collect()
# A tibble: 50 × 3
checklist_id species_code obs_count
<dbl> <chr> <dbl>
1 18543372 altori 3
2 18543372 amekes 1
3 18543372 amered 40
4 18543372 amerob 30
5 18543372 amerob 9
6 18543372 balori 9
7 18543372 blkter 9
8 18543372 blkvul 20
9 18543372 buggna 1
10 18543372 buwwar 1
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows{code}
If I use awk to check on the csv_file that same checklist id, I get something
different:
{code:java}
$ awk -F "," '{ if ($1 == 18543372) { print } }' obs.csv
18543372.0,rewbla,,60.0,0.0,0.0,1.0,0.0,0.0
18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0
18543372.0,robgro,,2.0,0.0,0.0,1.0,0.0,0.0
18543372.0,eastow,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,sedwre1,,2.0,0.0,0.0,1.0,0.0,0.0
18543372.0,ovenbi1,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,buggna,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,reshaw,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,turvul,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,gowwar,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,balori,,9.0,0.0,0.0,1.0,0.0,0.0
18543372.0,buwwar,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,cangoo,,6.0,0.0,0.0,1.0,0.0,0.0
18543372.0,houwre,,1.0,0.0,0.0,1.0,0.0,0.0
18543372.0,amered,,40.0,0.0,0.0,1.0,1.0,0.0
18543372.0,norwat,,2.0,0.0,0.0,1.0,0.0,0.0{code}
17 different species and no repetitions... Look _amerob_ on the 2nd line only,
with 30 _obs_count_
If I import the csv into parquet using the Python Arrow library as:
{code:java}
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.compute as pc
import pandas as pd
test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv",
nrows = 1000)
sch = pa.Schema.from_pandas(test_rows_csv)
csv_file = ds.dataset("/ebird_erd2021/full/obs.csv",
schema = sch,
format = "csv")
ds.write_dataset(csv_file,
"ebird_erd2021/full/obs_parquet_py/",
format = "parquet",
schema = sch,
use_threads = True,
max_rows_per_file = 1000000,
max_rows_per_group = 1000000,
existing_data_behavior = "error"){code}
And then load it in R doing the same search on that checklist:
{code:java}
parquet_py <- "/ebird_erd2021/full/obs_parquet_py/"
parquet_arrow <- open_dataset(parquet_py, format = "parquet")
parquet_arrow |>
filter(checklist_id == 18543372) |>
arrange(species_code) |>
select(checklist_id, species_code, obs_count) |>
collect()
# A tibble: 17 × 3
checklist_id species_code obs_count
<dbl> <chr> <dbl>
1 18543372 amered 40
2 18543372 amerob 30
3 18543372 balori 9
4 18543372 buggna 1
5 18543372 buwwar 1
6 18543372 cangoo 6
7 18543372 eastow 1
8 18543372 gowwar 1
9 18543372 grycat 1
10 18543372 houwre 1
11 18543372 norwat 2
12 18543372 ovenbi1 1
13 18543372 reshaw 1
14 18543372 rewbla 60
15 18543372 robgro 2
16 18543372 sedwre1 2
17 18543372 turvul 1{code}
I get exactly what I should get. No species_code repeated (as in the original
csv).
Due to these differences I guess something weird must be happening in the R
arrow library.
> [R] duplicated rows when importing large csv into parquet
> ---------------------------------------------------------
>
> Key: ARROW-17432
> URL: https://issues.apache.org/jira/browse/ARROW-17432
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 8.0.0, 9.0.0
> Environment: R version 4.2.1
> Running in Arch Linux - EndeavourOS
> arrow_info()
> Arrow package version: 9.0.0
> Capabilities:
>
> dataset TRUE
> substrait FALSE
> parquet TRUE
> json TRUE
> s3 TRUE
> gcs TRUE
> utf8proc TRUE
> re2 TRUE
> snappy TRUE
> gzip TRUE
> brotli TRUE
> zstd TRUE
> lz4 TRUE
> lz4_frame TRUE
> lzo FALSE
> bz2 TRUE
> jemalloc TRUE
> mimalloc TRUE
> Memory:
>
> Allocator jemalloc
> Current 49.31 Kb
> Max 1.63 Mb
> Runtime:
>
> SIMD Level avx2
> Detected SIMD Level avx2
> Build:
>
> C++ Library Version 9.0.0
> C++ Compiler GNU
> C++ Compiler Version 7.5.0
> ####
> print(pa.__version__)
> 9.0.0
> Reporter: Guillermo Duran
> Priority: Major
>
> This is a weird issue that creates new rows when importing a large csv (56
> GB) into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT
> didn't occurr with the Python Arrow library 9.0.0. Due to the large size of
> the original csv it's difficult to create a reproducible example, but I share
> the code and outputs.
> The code I use in R to import the csv:
> {code:java}
> library(arrow)
> library(dplyr)
>
> csv_file <- "/ebird_erd2021/full/obs.csv"
> dest <- "/ebird_erd2021/full/obs_parquet/"
> sch = arrow::schema(checklist_id = float32(),
> species_code = string(),
> exotic_category = float32(),
> obs_count = float32(),
> only_presence_reported = float32(),
> only_slash_reported = float32(),
> valid = float32(),
> reviewed = float32(),
> has_media = float32()
> )
> csv_stream <- open_dataset(csv_file, format = "csv",
> schema = sch, skip_rows = 1)
> write_dataset(csv_stream, dest, format = "parquet",
> max_rows_per_file=1000000L,
> hive_style = TRUE,
> existing_data_behavior = "overwrite"){code}
> When I load the dataset and check one random _checklist_id_ I get duplicates
> that are not part of the _obs.csv_ file. There shouldn't be duplicated
> species in a checklist ({_}amerob{_} for example)... also note that the
> duplicated species have different {_}obs_count{_}. 50 species in total in
> that specific {_}checklist_id{_}. (I show below how it looks on the original
> csv file)
> {code:java}
> parquet_arrow <- open_dataset(dest, format = "parquet")
> parquet_arrow |>
> filter(checklist_id == 18543372) |>
> arrange(species_code) |>
> collect()
> # A tibble: 50 × 3
> checklist_id species_code obs_count
> <dbl> <chr> <dbl>
> 1 18543372 altori 3
> 2 18543372 amekes 1
> 3 18543372 amered 40
> 4 18543372 amerob 30
> 5 18543372 amerob 9
> 6 18543372 balori 9
> 7 18543372 blkter 9
> 8 18543372 blkvul 20
> 9 18543372 buggna 1
> 10 18543372 buwwar 1
> # … with 40 more rows
> # ℹ Use `print(n = ...)` to see more rows{code}
> If I use awk to check on the csv_file that same checklist id, I get something
> different:
> {code:java}
> $ awk -F "," '{ if ($1 == 18543372) { print } }' obs.csv
> 18543372.0,rewbla,,60.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,robgro,,2.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,eastow,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,sedwre1,,2.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,ovenbi1,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,buggna,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,reshaw,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,turvul,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,gowwar,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,balori,,9.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,buwwar,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,cangoo,,6.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,houwre,,1.0,0.0,0.0,1.0,0.0,0.0
> 18543372.0,amered,,40.0,0.0,0.0,1.0,1.0,0.0
> 18543372.0,norwat,,2.0,0.0,0.0,1.0,0.0,0.0{code}
> 17 different species and no repetitions... Look _amerob_ on the 2nd line
> only, with 30 _obs_count_
>
> If I import the csv into parquet using the Python Arrow library as:
> {code:java}
> import pyarrow as pa
> import pyarrow.dataset as ds
> import pyarrow.compute as pc
> import pandas as pd
> test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv",
> nrows = 1000)
> sch = pa.Schema.from_pandas(test_rows_csv)
> csv_file = ds.dataset("/ebird_erd2021/full/obs.csv",
> schema = sch,
> format = "csv")
> ds.write_dataset(csv_file,
> "ebird_erd2021/full/obs_parquet_py/",
> format = "parquet",
> schema = sch,
> use_threads = True,
> max_rows_per_file = 1000000,
> max_rows_per_group = 1000000,
> existing_data_behavior = "error"){code}
> And then load it in R doing the same search on that checklist:
> {code:java}
> parquet_py <- "/ebird_erd2021/full/obs_parquet_py/"
> parquet_arrow <- open_dataset(parquet_py, format = "parquet")
> parquet_arrow |>
> filter(checklist_id == 18543372) |>
> arrange(species_code) |>
> select(checklist_id, species_code, obs_count) |>
> collect()
> # A tibble: 17 × 3
> checklist_id species_code obs_count
> <dbl> <chr> <dbl>
> 1 18543372 amered 40
> 2 18543372 amerob 30
> 3 18543372 balori 9
> 4 18543372 buggna 1
> 5 18543372 buwwar 1
> 6 18543372 cangoo 6
> 7 18543372 eastow 1
> 8 18543372 gowwar 1
> 9 18543372 grycat 1
> 10 18543372 houwre 1
> 11 18543372 norwat 2
> 12 18543372 ovenbi1 1
> 13 18543372 reshaw 1
> 14 18543372 rewbla 60
> 15 18543372 robgro 2
> 16 18543372 sedwre1 2
> 17 18543372 turvul 1{code}
> I get exactly what I should get. No species_code repeated (as in the original
> csv).
>
> Due to these differences I guess something weird must be happening in the R
> arrow library.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)