[GitHub] [arrow] njzhl93 opened a new issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

GitBox Sun, 22 Nov 2020 17:51:15 -0800


njzhl93 opened a new issue #8741:
URL: https://github.com/apache/arrow/issues/8741



   I have to partition DataFrame into several parquets, and read them by using 
from_paths.
   
   for example, I get two DataFrames like this:
   ```
   MDDate=20201113/20201113.parquet
        0  1      2     3    MDDate
   0  1.1  1  [1.1]     a  20201113
   1  2.2  2  [2.2]  None  20201113
   2  3.3  3  [3.3]     c  20201113
   
   MDDate=20201114/20201114.parquet
         0  1      2     3    MDDate
   0  None  1  [1.1]     a  20201114
   1  None  2   None  None  20201114
   2  None  3  [3.3]     c  20201114 
   ```
   I try to read them by FileSystemDataset.from_paths
   
   ```
   import pyarrow as pa
   import pyarrow.dataset as ds
   file_list = ['MDDate=20201113/20201113.parquet', 
'MDDate=20201114/20201114.parquet']
   schemas = pa.schema([('0', pa.float64()), ('1', pa.int64()), ('2', 
pa.list_(pa.float64())), ('3', pa.string())])
   partitions = [ds.field("MDDate") == '20201113', ds.field("MDDate") == 
'20201114']
   dataset = ds.FileSystemDataset.from_paths(file_list,
                                             schema=schemas,
                                             format=ds.ParquetFileFormat(),
                                             filesystem=fs.SubTreeFileSystem(
                                                 'test_read',
                                                 fs.LocalFileSystem()),
                                             partitions=partitions)
   df = dataset.to_table().to_pandas()
   ```
   Then I get an error likes this
   
   ```
   Traceback (most recent call last):
     File "/app/mount/code/test_save_factor/test_read_from_path.py", line 36, 
in <module>
       df = dataset.to_table().to_pandas()
     File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table
     File "pyarrow/_dataset.pyx", line 1994, in 
pyarrow._dataset.Scanner.to_table
     File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
     File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: fields had matching names but differing types. 
From: 0: null To: 0: double
   I think this error is about schema. I set column 0 pa.float64(). but in 
MDDate=20201114/20201114.parquet, the schema of column 0 is pa.null(), because 
all rows are None, and can't convert it to pa.float64() automatically.
   ```
   
   Is it possible to read parquets by using from_paths when some columns don't 
have same schemas?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] njzhl93 opened a new issue #8741: Is it possible that using pyarrow.dataset.FileSystemDataset.from_paths() to read dataframes with different schemas

Reply via email to