Writing Parquet datasets using pyarrow.parquet.ParquetWriter

Palak Harwani Sat, 30 May 2020 07:53:04 -0700

Hi,
I had a few questions regarding pyarrow.parquet. I want to write a Parquet
dataset which is partitioned according to one column. I have a large csv
file and I'm using chunks of csv using the following code :


  # csv_to_parquet.py

import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq

csv_file = '/path/to/my.tsv'
parquet_file = '/path/to/my.parquet'
chunksize = 100_000

csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize,
low_memory=False)
for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:
        # Guess the schema of the CSV file from the first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file,
parquet_schema, compression='snappy')
    # Write CSV chunk to the parquet file
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)


parquet_writer.close()



But this code writes a single parquet file and I don't see any method in
Parquet writer to write to a dataset, It just has the write_table method.
Is there a way to do this ?

Also how do I write the metadata file in the example mentioned above and
the common metadata file as well as the metadata files in case of a
partitioned dataset?

Thanks in advanced.

-- 
*Regards,*
*Palak Harwani*

Writing Parquet datasets using pyarrow.parquet.ParquetWriter

Reply via email to