Spaarsh commented on PR #982: URL: https://github.com/apache/datafusion-python/pull/982#issuecomment-2708491224
Key Points: 1. ```|``` operator not supported for python < 3.10, anyone pulling the main post merge will not be able to use ```SessionContext``` at all 2. ```global_ctx``` already exposed to python Details: The `|` operator being used in all the ```read_*``` functions is supported only for python >=3.10. So in order to even import SessionContext, I had to change all ```|``` operations with ```Union```. Until then, I was getting this error: ``` $ python3 Python 3.9.7 (default, Oct 18 2021, 02:25:46) [Clang 13.0.0 ] on linux Type "help", "copyright", "credits" or "license" for more information. >>> imp KeyboardInterrupt >>> from datafusion import SessionContext Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/spaarsh/gsoc/df-py/datafusion-python/python/datafusion/__init__.py", line 48, in <module> from .io import read_avro, read_csv, read_json, read_parquet File "/home/spaarsh/gsoc/df-py/datafusion-python/python/datafusion/io.py", line 31, in <module> path: str | pathlib.Path, TypeError: unsupported operand type(s) for |: 'type' and 'type' ``` After replacing all ```|``` operations with ```Union```, it all works. ```global_ctx``` is already being exposed to python, unless I have misunderstood something. I pulled the branch and tested it. It works. ``` $ python3 test.py DataFrame() +----+----------+-----+---------+------------+-----------+ | id | name | age | salary | start_date | is_active | +----+----------+-----+---------+------------+-----------+ | 1 | John Doe | 32 | 75000.5 | 2020-01-15 | true | +----+----------+-----+---------+------------+-----------+ DataFrame() +----+----------+-----+---------+------------+-----------+ | id | name | age | salary | start_date | is_active | +----+----------+-----+---------+------------+-----------+ | 1 | John Doe | 32 | 75000.5 | 2020-01-15 | true | +----+----------+-----+---------+------------+-----------+ DataFrame() +-----+----+-----------+----------+---------+------------+ | age | id | is_active | name | salary | start_date | +-----+----+-----------+----------+---------+------------+ | 32 | 1 | true | John Doe | 75000.5 | 2020-01-15 | +-----+----+-----------+----------+---------+------------+ DataFrame() +----+----------+-----+---------+------------+-----------+ | id | name | age | salary | start_date | is_active | +----+----------+-----+---------+------------+-----------+ | 1 | John Doe | 32 | 75000.5 | 2020-01-15 | true | +----+----------+-----+---------+------------+-----------+ ``` Just for reference, these are the scripts I used to generate and test the functions: <details> ``` ####test.py from datafusion import SessionContext #### Create a new session ctx = SessionContext() #### Read different file formats df1 = ctx.read_csv("data.csv") # Accepts str or Path df2 = ctx.read_parquet("data.parquet") df3 = ctx.read_json("data.json") df4 = ctx.read_avro("data.avro") print(df1) print(df2) print(df3) print(df4) ``` ``` ####create.py - to create the data files import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import json import fastavro #### Sample data as a dictionary data = { 'id': 1, 'name': ['John Doe'], 'age': [32], 'salary': [75000.50], 'start_date': ['2020-01-15'], 'is_active': [True] } #### Create DataFrame df = pd.DataFrame(data) #### Save as Parquet df.to_parquet('data.parquet') #### Save as JSON (line-delimited) with open('data.json', 'w') as f: for _, row in df.iterrows(): json.dump(row.to_dict(), f) f.write('\n') #### Save as Avro schema = { 'name': 'Employee', 'type': 'record', 'fields': [ {'name': 'id', 'type': 'int'}, {'name': 'name', 'type': 'string'}, {'name': 'age', 'type': 'int'}, {'name': 'salary', 'type': 'double'}, {'name': 'start_date', 'type': 'string'}, {'name': 'is_active', 'type': 'boolean'} ] } records = df.to_dict('records') with open('data.avro', 'wb') as f: fastavro.writer(f, schema, records) ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org