[GitHub] [arrow] AlenkaF commented on issue #33845: [Python] _prefixed column names change `write_to_dataset` behavior

via GitHub Mon, 23 Jan 2023 23:44:14 -0800


AlenkaF commented on issue #33845:
URL: https://github.com/apache/arrow/issues/33845#issuecomment-1401504187


   Your example, with some typo corrections as the code is buggy, works well 
for me:
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   df=pd.DataFrame({'a':[1,2],'b':[3,4],'_part_col':[5,6]})
   df
   #    a  b  _part_col
   # 0  1  3          5
   # 1  2  4          6
   table = pa.Table.from_pandas(df)
   table
   # pyarrow.Table
   # a: int64
   # b: int64
   # _part_col: int64
   # ----
   # a: [[1,2]]
   # b: [[3,4]]
   # _part_col: [[5,6]]
   
   # Try writing with parquet module as in your example
   pq.write_to_dataset(table, root_path='example_pq_ds', 
use_legacy_dataset=False)
   
   # First read the dataset with dataset module
   dataset_pq = ds.dataset('example_pq_ds')
   dataset_pq.to_table()
   # pyarrow.Table
   # a: int64
   # b: int64
   # _part_col: int64
   # ----
   # a: [[1,2]]
   # b: [[3,4]]
   # _part_col: [[5,6]]
   
   # Then read the dataset with the parquet module
   pq.read_table('example_pq_ds', use_legacy_dataset=False)
   # pyarrow.Table
   # a: int64
   # b: int64
   # _part_col: int64
   # ----
   # a: [[1,2]]
   # b: [[3,4]]
   # _part_col: [[5,6]]
   pq.read_table('example_pq_ds', use_legacy_dataset=True)
   # <stdin>:1: FutureWarning: Passing 'use_legacy_dataset=True' to get the 
legacy behaviour is deprecated as of pyarrow 8.0.0, and the legacy 
implementation will be removed in a future version.
   # pyarrow.Table
   # a: int64
   # b: int64
   # _part_col: int64
   # ----
   # a: [[1,2]]
   # b: [[3,4]]
   # _part_col: [[5,6]]
   
   ```
   It also works using `parquet` and `dataset` modules in other ways:
   
   ```python
   # Try with writing to single file
   # and reading with pq.read_table
   pq.write_table(table, 'example.parquet')
   pq.read_table('example.parquet')
   # pyarrow.Table
   # a: int64
   # b: int64
   # _part_col: int64
   # ----
   # a: [[1,2]]
   # b: [[3,4]]
   # _part_col: [[5,6]]
   
   # Try reading a single file with the dataset module
   dataset = ds.dataset('example.parquet', format="parquet")
   dataset.to_table()
   # pyarrow.Table
   # a: int64
   # b: int64
   # _part_col: int64
   # ----
   # a: [[1,2]]
   # b: [[3,4]]
   # _part_col: [[5,6]]
   
   ```
   
   I am running this on the latest master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] AlenkaF commented on issue #33845: [Python] _prefixed column names change `write_to_dataset` behavior

Reply via email to