AlenkaF commented on issue #13125: URL: https://github.com/apache/arrow/issues/13125#issuecomment-1124600121
Tried using [pyarrow.csv.read_csv](https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html#pyarrow.csv.read_csv) to read arrow table from csv and then write to parquet? Hope this will help: ```python >>> import io >>> import pyarrow.csv as csv >>> s = """int__v|Decimal__v|Float__v|Boolean__v|String__v|Null__v|Date__v|Timestamp__v ... 1|43.4|11.02|True|'456'|12|2021-03-02|2019-08-07 10:11:12 ... 2|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 ... 3|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 ... 4|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 ... 5|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 ... 6|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 ... 7|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 ... 8|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 ... 9|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 ... 10|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 ... 11|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 ... 12|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 ... 13|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 ... 14|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 ... 15|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 ... 16|43.4|11.02|True|'456'||2021-03-02|2019-08-07 10:11:12 ... 17|101.4|11.128|False|'456'||2020-09-09|2019-05-04 11:12:13 ... 18|43.4|13.02|True|'456'||2022-03-14|2012-08-07 10:15:12 ... 19|202.4|14.128|False|'456'||2020-03-15|2017-09-04 11:17:13 ... 4||||||| ... """ >>> source = io.BytesIO(s.encode()) # Read with pyarrow.csv.read_csv >>> parse_options = csv.ParseOptions(delimiter="|") >>> table = csv.read_csv(source, parse_options=parse_options) # Write to parquet >>> import pyarrow.parquet as pq >>> pq.write_table(table, 'example.parquet', compression='snappy') # Check the result >>> pq.read_table('example.parquet')["Boolean__v"] <pyarrow.lib.ChunkedArray object at 0x139459450> [ [ true, false, true, false, true, ... true, false, true, false, null ] ] ``` You can also define [pyarrow.csv.ReadOptions](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions) like `block_size`, `encoding` and [pyarrow.csv.ParseOptions](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions) like `ignore_empty_lines.` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
