[jira] [Commented] (ARROW-12080) [Python][Dataset] The first table schema becomes a common schema for the full Dataset

Joris Van den Bossche (Jira) Tue, 13 Apr 2021 04:35:08 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320095#comment-17320095
 ]


Joris Van den Bossche commented on ARROW-12080:
-----------------------------------------------

[~banderlog] sorry for the slow response. 

bq. The first table schema becomes a common schema for the full Dataset

That's indeed the current behaviour (but I see that this should be documented 
better). See ARROW-8221 for a general issue about expanding this (to eg 
unifying the schema across all files). 

A workaround for now is to manually specify the schema (of course, in case of 
CSV you actually need to parse the data to get the schema ..). You could read 
once a bigger chunk to get the proper schema, and then use that schema to pass 
to {{ds.dataset(..)}}. Or if you know the schema of the file, you can create it 
manually with {{pa.schema(..)}} (similarly as you pass a dict of types to 
pandas.read_csv).

In your specific case, you can actually already specify the scheme in {{table = 
pa.Table.from_pandas(chunk)}} before writing to parquet. By doing that, you can 
ensure that the parquet files have the proper types, and then subsequent 
reading of the Parquet dataset will work fine without needing to specify the 
schema manually.

> [Python][Dataset] The first table schema becomes a common schema for the full 
> Dataset
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-12080
>                 URL: https://issues.apache.org/jira/browse/ARROW-12080
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>            Reporter: Borys Kabakov
>            Priority: Major
>              Labels: dataset, datasets
>
> The first table schema becomes a common schema for the full Dataset. It could 
> cause problems with sparse data.
> Consider example below, when first chunks is full of NA, pyarrow ignores 
> dtypes from pandas for a whole dataset:
> {code:java}
> # get dataset
> !wget https://physionet.org/files/mimiciii-demo/1.4/D_ITEMS.csv
> import pandas as pd 
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pyarrow.dataset as ds
> import shutil
> from pathlib import Path
> def foo(input_csv='D_ITEMS.csv', output='tmp.parquet', chunksize=1000):
>     if Path(output).exists():
>         shutil.rmtree(output)    # write dataset
>     d_items = pd.read_csv(input_csv, index_col='row_id',
>                       usecols=['row_id', 'itemid', 'label', 'dbsource', 
> 'category', 'param_type'],
>                       dtype={'row_id': int, 'itemid': int, 'label': str, 
> 'dbsource': str,
>                              'category': str, 'param_type': str}, 
> chunksize=chunksize)    for i, chunk in enumerate(d_items):
>         table = pa.Table.from_pandas(chunk)
>         if i == 0:
>             schema1 = pa.Schema.from_pandas(chunk)
>             schema2 = table.schema
> #         print(table.field('param_type'))
>         pq.write_to_dataset(table, root_path=output)
>     
>     # read dataset
>     dataset = ds.dataset(output)
>     
>     # compare schemas
>     print('Schemas are equal: ', dataset.schema == schema1 == schema2)
>     print(dataset.schema.types)
>     print('Should be string', dataset.schema.field('param_type'))    
>     return dataset
> {code}
> {code:java}
> dataset = foo()
> dataset.to_table()
> >>>Schemas are equal:  False
> [DataType(int64), DataType(string), DataType(string), DataType(null), 
> DataType(null), DataType(int64)]
> Should be string pyarrow.Field<param_type: null>
> ---------------------------------------------------------------------------
> ArrowTypeError: fields had matching names but differing types. From: 
> category: string To: category: null{code}
> If you do schemas listing, you'll see that almost all parquet files ignored 
> pandas dtypes:
> {code:java}
> import os
> for i in os.listdir('tmp.parquet/'):
>     print(ds.dataset(os.path.join('tmp.parquet/', 
> i)).schema.field('param_type'))
> >>>pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> {code}
> But if we will get bigger chunk of data, that contains non NA values, 
> everything is OK:
> {code:java}
> dataset = foo(chunksize=10000)
> dataset.to_table()
> >>>Schemas are equal:  True
> [DataType(int64), DataType(string), DataType(string), DataType(string), 
> DataType(string), DataType(int64)]
> Should be string pyarrow.Field<param_type: string>
> pyarrow.Table
> itemid: int64
> label: string
> dbsource: string
> category: string
> param_type: string
> row_id: int64
> {code}
> Check NA in data:
> {code:java}
> pd.read_csv('D_ITEMS.csv', nrows=1000)['param_type'].unique()
> >>>array([nan])
> pd.read_csv('D_ITEMS.csv', nrows=10000)['param_type'].unique()
> >>>array([nan, 'Numeric', 'Text', 'Date time', 'Solution', 'Process',
>        'Checkbox'], dtype=object)
> {code}
>  
>  PS: switching issues reporting from github to Jira is outstanding move
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12080) [Python][Dataset] The first table schema becomes a common schema for the full Dataset

Reply via email to