[jira] [Commented] (ARROW-12080) The first table schema becomes a common schema for the full Dataset

Borys Kabakov (Jira) Wed, 24 Mar 2021 15:18:23 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308207#comment-17308207
 ]


Borys Kabakov commented on ARROW-12080:
---------------------------------------

It affects not only datasets, but writing to a single file too:
{code:java}
import pandas as pd 
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path


def bar(input_csv='D_ITEMS.csv', output='tmp.parquet', chunksize=1000):
    Path(output).unlink(missing_ok=True)
    pqwriter = None
    
    if Path(output).exists():
        shutil.rmtree(output)    # write file
    d_items = pd.read_csv(input_csv, index_col='row_id',
                      usecols=['row_id', 'itemid', 'label', 'dbsource', 
'category', 'param_type'],
                      dtype={'row_id': int, 'itemid': int, 'label': str, 
'dbsource': str,
                             'category': str, 'param_type': str}, 
chunksize=chunksize)    for i, chunk in enumerate(d_items):
        table = pa.Table.from_pandas(chunk)
        if i == 0:
            # create a parquet write object giving it an output file
            pqwriter = pq.ParquetWriter(output, table.schema)
        pqwriter.write_table(table)
        
    # close the parquet writer
    if pqwriter:
        pqwriter.close()
        
    df = pd.read_parquet(output)
    return df


# all will be fine
# returned dataframe equal to 'D_ITEMS.csv'
df = bar(chunksize=10000)

# it will crash
# same reason as before -- only NAs in the first chunk
df = bar(chunksize=1000)

>>>
---------------------------------------------------------------------------
ValueError: Table schema does not match schema used to create file: 
table:
itemid: int64
label: string
dbsource: string
category: null
param_type: null
row_id: int64
-- schema metadata --
pandas: '{"index_columns": ["row_id"], "column_indexes": [{"name": null, ' + 
877 vs. 
file:
itemid: int64
label: string
dbsource: string
category: string
param_type: null
row_id: int64
-- schema metadata --
pandas: '{"index_columns": ["row_id"], "column_indexes": [{"name": null, ' + 879
{code}

> The first table schema becomes a common schema for the full Dataset
> -------------------------------------------------------------------
>
>                 Key: ARROW-12080
>                 URL: https://issues.apache.org/jira/browse/ARROW-12080
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>            Reporter: Borys Kabakov
>            Priority: Major
>
> The first table schema becomes a common schema for the full Dataset. It could 
> cause problems with sparse data.
> Consider example below, when first chunks is full of NA, pyarrow ignores 
> dtypes from pandas for a whole dataset:
> {code:java}
> # get dataset
> !wget https://physionet.org/files/mimiciii-demo/1.4/D_ITEMS.csv
> import pandas as pd 
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pyarrow.dataset as ds
> import shutil
> from pathlib import Path
> def foo(input_csv='D_ITEMS.csv', output='tmp.parquet', chunksize=1000):
>     if Path(output).exists():
>         shutil.rmtree(output)    # write dataset
>     d_items = pd.read_csv(input_csv, index_col='row_id',
>                       usecols=['row_id', 'itemid', 'label', 'dbsource', 
> 'category', 'param_type'],
>                       dtype={'row_id': int, 'itemid': int, 'label': str, 
> 'dbsource': str,
>                              'category': str, 'param_type': str}, 
> chunksize=chunksize)    for i, chunk in enumerate(d_items):
>         table = pa.Table.from_pandas(chunk)
>         if i == 0:
>             schema1 = pa.Schema.from_pandas(chunk)
>             schema2 = table.schema
> #         print(table.field('param_type'))
>         pq.write_to_dataset(table, root_path=output)
>     
>     # read dataset
>     dataset = ds.dataset(output)
>     
>     # compare schemas
>     print('Schemas are equal: ', dataset.schema == schema1 == schema2)
>     print(dataset.schema.types)
>     print('Should be string', dataset.schema.field('param_type'))    
>     return dataset
> {code}
> {code:java}
> dataset = foo()
> dataset.to_table()
> >>>Schemas are equal:  False
> [DataType(int64), DataType(string), DataType(string), DataType(null), 
> DataType(null), DataType(int64)]
> Should be string pyarrow.Field<param_type: null>
> ---------------------------------------------------------------------------
> ArrowTypeError: fields had matching names but differing types. From: 
> category: string To: category: null{code}
> If you do schemas listing, you'll see that almost all parquet files ignored 
> pandas dtypes:
> {code:java}
> import os
> for i in os.listdir('tmp.parquet/'):
>     print(ds.dataset(os.path.join('tmp.parquet/', 
> i)).schema.field('param_type'))
> >>>pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> {code}
> But if we will get bigger chunk of data, that contains non NA values, 
> everything is OK:
> {code:java}
> dataset = foo(chunksize=10000)
> dataset.to_table()
> >>>Schemas are equal:  True
> [DataType(int64), DataType(string), DataType(string), DataType(string), 
> DataType(string), DataType(int64)]
> Should be string pyarrow.Field<param_type: string>
> pyarrow.Table
> itemid: int64
> label: string
> dbsource: string
> category: string
> param_type: string
> row_id: int64
> {code}
> Check NA in data:
> {code:java}
> pd.read_csv('D_ITEMS.csv', nrows=1000)['param_type'].unique()
> >>>array([nan])
> pd.read_csv('D_ITEMS.csv', nrows=10000)['param_type'].unique()
> >>>array([nan, 'Numeric', 'Text', 'Date time', 'Solution', 'Process',
>        'Checkbox'], dtype=object)
> {code}
>  
>  PS: switching issues reporting from github to Jira is outstanding move
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12080) The first table schema becomes a common schema for the full Dataset

Reply via email to