[GitHub] [arrow] AlenkaF commented on issue #14025: Is it possible to provide a partial schema to pyarrow to_parquet(), and infer everything else?

GitBox Thu, 01 Sep 2022 23:40:11 -0700


AlenkaF commented on issue #14025:
URL: https://github.com/apache/arrow/issues/14025#issuecomment-1235126469


   The schema must include all the columns that are needed in the output.
   
   For example if you have a Pandas dataframe
   ```python
   >>> import pandas as pd
   >>> df = pd.DataFrame(
   ...     {
   ...         "A": [1, 2, 3],
   ...         "B": ["a", "b", "c"],
   ...         "C": pd.date_range("2016-01-01", freq="d", periods=3),
   ...     },
   ...     index=pd.Index(range(3), name="idx"),
   ... )
   
   >>> df
        A  B          C
   idx                 
   0    1  a 2016-01-01
   1    2  b 2016-01-02
   2    3  c 2016-01-03
   ```
   
   And you define only two columns in the schema:
   
   ```python
   import pyarrow as pa
   >>> schema=pa.schema([
   ...     ('A', pa.int16()),
   ...     ('B', pa.string())
   ... ])
   ```
   
   The ouput will include these two columns specified by the schema.
   
   ```python
   >>> df.to_parquet('df.parquet.gzip', schema=schema)
   >>> pd.read_parquet('df.parquet.gzip')
      A  B
   0  1  a
   1  2  b
   2  3  c
   ```
   
   You will have to **include all of them in the schema** if you do not want 
any of the columns to be lost.
   If there are a large number of columns you can save the schema from pyarrrow 
table and then change the type for one column only:
   
   ```python
   >>> table = pa.table(df)
   >>> table
   pyarrow.Table
   A: int64
   B: string
   C: timestamp[ns]
   ----
   A: [[1,2,3]]
   B: [["a","b","c"]]
   C: [[2016-01-01 00:00:00.000000000,2016-01-02 00:00:00.000000000,2016-01-03 
00:00:00.000000000]]
   
   >>> # Save the schema
   >>> schema = table.schema
   >>> schema
   A: int64
   B: string
   C: timestamp[ns]
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": "idx", "start": 0, ' + 
600
   
   >>> # Change datatype and save into new schema
   >>> schema_new = schema.set(0, pa.field('A', pa.int16()))
   >>> schema_new
   A: int16
   B: string
   C: timestamp[ns]
   -- schema metadata --
   pandas: '{"index_columns": [{"kind": "range", "name": "idx", "start": 0, ' + 
600
   
   >>> # Create a parquet file qith all columns and changed type of one
   >>> df.to_parquet('df.parquet.gzip', schema=schema_new)
   >>> df_from_parquet = pd.read_parquet('df.parquet.gzip')
   >>> df_from_parquet
      A  B          C
   0  1  a 2016-01-01
   1  2  b 2016-01-02
   2  3  c 2016-01-03
   >>> pa.table(df_from_parquet)
   pyarrow.Table
   A: int16
   B: string
   C: timestamp[ns]
   ----
   A: [[1,2,3]]
   B: [["a","b","c"]]
   C: [[2016-01-01 00:00:00.000000000,2016-01-02 00:00:00.000000000,2016-01-03 
00:00:00.000000000]]
   ```
   
   If the structure of the dataframes do not change and you only need to do 
this once, I think that might help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] AlenkaF commented on issue #14025: Is it possible to provide a partial schema to pyarrow to_parquet(), and infer everything else?

Reply via email to