[jira] [Updated] (ARROW-14743) [C++] Error reading in dataset when partitioning variable in schema

Nicola Crane (Jira) Wed, 17 Nov 2021 14:03:07 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-14743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicola Crane updated ARROW-14743:
---------------------------------
    Description: 
If partitioned data is read back in and a schema is used (containing the 
partitioning variable), there is an error - see below.  The error occurs 
whether or not the argument {partitioning} is specified or not.

{code:r}
library(arrow)
library(dplyr)

data(diamonds, package='ggplot2')
write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')

diamond_schema <- schema(
    carat=float64(),
    cut=string(),
    color=string(),
    clarity=string(),
    depth=float64(),
    table=float64(),
    price=float64(),
    x=float64(),
    y=float64(),
    z=float64(),
)

open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning = 
"cut") %>%
  collect()

# Error: Invalid: Could not open CSV input source 
'/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error: 
Row #1: Expected 10 columns, got 9: 
"carat","color","clarity","depth","table","price","x","y","z"

{code}

  was:
If partitioned data is read back in and a schema is used (containing the 
partitioning variable), there is an error - see below.  The error occurs 
whether or not the argument {partitioning} is specified or not.

{code: r}
library(arrow)
library(dplyr)

data(diamonds, package='ggplot2')
write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')

diamond_schema <- schema(
    carat=float64(),
    cut=string(),
    color=string(),
    clarity=string(),
    depth=float64(),
    table=float64(),
    price=float64(),
    x=float64(),
    y=float64(),
    z=float64(),
)

open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning = 
"cut") %>%
  collect()

# Error: Invalid: Could not open CSV input source 
'/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error: 
Row #1: Expected 10 columns, got 9: 
"carat","color","clarity","depth","table","price","x","y","z"

{code}


> [C++] Error reading in dataset when partitioning variable in schema
> -------------------------------------------------------------------
>
>                 Key: ARROW-14743
>                 URL: https://issues.apache.org/jira/browse/ARROW-14743
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Nicola Crane
>            Priority: Major
>
> If partitioned data is read back in and a schema is used (containing the 
> partitioning variable), there is an error - see below.  The error occurs 
> whether or not the argument {partitioning} is specified or not.
> {code:r}
> library(arrow)
> library(dplyr)
> data(diamonds, package='ggplot2')
> write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')
> diamond_schema <- schema(
>     carat=float64(),
>     cut=string(),
>     color=string(),
>     clarity=string(),
>     depth=float64(),
>     table=float64(),
>     price=float64(),
>     x=float64(),
>     y=float64(),
>     z=float64(),
> )
> open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning = 
> "cut") %>%
>   collect()
> # Error: Invalid: Could not open CSV input source 
> '/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error: 
> Row #1: Expected 10 columns, got 9: 
> "carat","color","clarity","depth","table","price","x","y","z"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14743) [C++] Error reading in dataset when partitioning variable in schema

Reply via email to