[
https://issues.apache.org/jira/browse/ARROW-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicola Crane updated ARROW-14743:
---------------------------------
Description:
If partitioned data is read back in and a schema is used (containing the
partitioning variable), there is an error - see below. The error occurs
whether or not the argument {partitioning} is specified or not.
{code:r}
library(arrow)
library(dplyr)
data(diamonds, package='ggplot2')
write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')
diamond_schema <- schema(
carat=float64(),
cut=string(),
color=string(),
clarity=string(),
depth=float64(),
table=float64(),
price=float64(),
x=float64(),
y=float64(),
z=float64(),
)
open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning =
"cut") %>%
collect()
# Error: Invalid: Could not open CSV input source
'/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error:
Row #1: Expected 10 columns, got 9:
"carat","color","clarity","depth","table","price","x","y","z"
{code}
was:
If partitioned data is read back in and a schema is used (containing the
partitioning variable), there is an error - see below. The error occurs
whether or not the argument {partitioning} is specified or not.
{code: r}
library(arrow)
library(dplyr)
data(diamonds, package='ggplot2')
write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')
diamond_schema <- schema(
carat=float64(),
cut=string(),
color=string(),
clarity=string(),
depth=float64(),
table=float64(),
price=float64(),
x=float64(),
y=float64(),
z=float64(),
)
open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning =
"cut") %>%
collect()
# Error: Invalid: Could not open CSV input source
'/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error:
Row #1: Expected 10 columns, got 9:
"carat","color","clarity","depth","table","price","x","y","z"
{code}
> [C++] Error reading in dataset when partitioning variable in schema
> -------------------------------------------------------------------
>
> Key: ARROW-14743
> URL: https://issues.apache.org/jira/browse/ARROW-14743
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Nicola Crane
> Priority: Major
>
> If partitioned data is read back in and a schema is used (containing the
> partitioning variable), there is an error - see below. The error occurs
> whether or not the argument {partitioning} is specified or not.
> {code:r}
> library(arrow)
> library(dplyr)
> data(diamonds, package='ggplot2')
> write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')
> diamond_schema <- schema(
> carat=float64(),
> cut=string(),
> color=string(),
> clarity=string(),
> depth=float64(),
> table=float64(),
> price=float64(),
> x=float64(),
> y=float64(),
> z=float64(),
> )
> open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning =
> "cut") %>%
> collect()
> # Error: Invalid: Could not open CSV input source
> '/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error:
> Row #1: Expected 10 columns, got 9:
> "carat","color","clarity","depth","table","price","x","y","z"
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)