[GitHub] [arrow] kato1208 opened a new issue, #13142: write_batch vs write_table of ParquetWriter

GitBox Thu, 12 May 2022 19:21:23 -0700


kato1208 opened a new issue, #13142:
URL: https://github.com/apache/arrow/issues/13142


   What is the difference between the two methods "write_table" and 
"write_batch" in ParquetWriter?
   The following code is for comparing "write_table" and "write_batch", but 
they appear to work the same.
   ```
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   schema = pa.schema([
       pa.field('name', pa.string()),
       pa.field('id', pa.int32()),
       pa.field("points", pa.list_(pa.int32())),
   ])
   
   records1 = [
       {
           "name": "Bob",
           "id": 323234,
           "points": [20, 12, 22],
       },
       {
           "name": "Alex",
           "id": 1234
       }
   ]
   
   records2 = [
       {
           "name": "Niclas",
           "id": 123222,
           "points": [21, 32, 1],
       },
       {
           "name": "lena",
           "id": 12345
       }
   ]
   
   writer = pq.ParquetWriter('test_table.parquet', schema=schema)
   
   for records in [records1, records2]:
       table = pa.Table.from_pylist(records, schema=schema)
       writer.write_table(table)
   writer.close()
   
   print("show result for write_table")
   print(pq.read_table("test_table.parquet").to_pandas())
   
   writer = pq.ParquetWriter('test_batch.parquet', schema=schema)
   
   for records in [records1, records2]:
       batch = pa.RecordBatch.from_pylist(records, schema=schema)
       writer.write_batch(batch)
   writer.close()
   
   print("show result for write_batch")
   print(pq.read_table("test_batch.parquet").to_pandas())
   ```
   ```result
   show result for write_table
        name      id        points
   0     Bob  323234  [20, 12, 22]
   1    Alex    1234          None
   2  Niclas  123222   [21, 32, 1]
   3    lena   12345          None
   show result for write_batch
        name      id        points
   0     Bob  323234  [20, 12, 22]
   1    Alex    1234          None
   2  Niclas  123222   [21, 32, 1]
   3    lena   12345          None
   ```
   I would like to split the data and write to the same parquet file to save 
memory, should I use "write_table" or "write_batch" ?
   Also, what would be the best size after splitting?
   
   I would appreciate your response.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] kato1208 opened a new issue, #13142: write_batch vs write_table of ParquetWriter

Reply via email to