Hi Palak, The ParquetWriter class is meant to write a single parquet file (so in that sense, that you see only a single parquet file being written based on the shown code, that is expected).
If you want to write multiple files, you can either manually create multiple ParquetWriter instances (each with a different parquet file name). Or, you can use the `pq.write_to_dataset()` function, which can automatically partition your data in multiple files based on a column. But, this function requires the full dataset in memory as a pandas dataframe or pyarrow table (so this is not compatible with the chunked csv reading). If you want to do it in chunks, it might be easier to use a higher level package such as dask. Dask can read a csv file in chunks and write to parquet using pyarrow automatically (see eg https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv). It would look like: import dask.dataframe as dd df = dd.read_csv(..) df.to_parquet(.., partition_on=['col'], engine="pyarrow") Dask can also write a (common) metadata file for you. If you want to do this manually using pyarrow, you can take a look at the `parquet.write_metadata` function ( https://github.com/apache/arrow/blob/494e7a9c5714f3ed9e5590aeef8362114d5a3a46/python/pyarrow/parquet.py#L1748-L1783). This needs to be better documented (covered by https://issues.apache.org/jira/browse/ARROW-3154). Best, Joris On Sat, 30 May 2020 at 16:52, Palak Harwani <[email protected]> wrote: > Hi, > I had a few questions regarding pyarrow.parquet. I want to write a Parquet > dataset which is partitioned according to one column. I have a large csv > file and I'm using chunks of csv using the following code : > > # csv_to_parquet.py > > import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq > > csv_file = '/path/to/my.tsv' > parquet_file = '/path/to/my.parquet' > chunksize = 100_000 > > csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, > low_memory=False) > for i, chunk in enumerate(csv_stream): > print("Chunk", i) > if i == 0: > # Guess the schema of the CSV file from the first chunk > parquet_schema = pa.Table.from_pandas(df=chunk).schema > # Open a Parquet file for writing > parquet_writer = pq.ParquetWriter(parquet_file, > parquet_schema, compression='snappy') > # Write CSV chunk to the parquet file > table = pa.Table.from_pandas(chunk, schema=parquet_schema) > parquet_writer.write_table(table) > > > parquet_writer.close() > > > > But this code writes a single parquet file and I don't see any method in > Parquet writer to write to a dataset, It just has the write_table method. > Is there a way to do this ? > > Also how do I write the metadata file in the example mentioned above and > the common metadata file as well as the metadata files in case of a > partitioned dataset? > > Thanks in advanced. > > -- > *Regards,* > *Palak Harwani* >
