kato1208 opened a new issue, #13142:
URL: https://github.com/apache/arrow/issues/13142
What is the difference between the two methods "write_table" and
"write_batch" in ParquetWriter?
The following code is for comparing "write_table" and "write_batch", but
they appear to work the same.
```
import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema([
pa.field('name', pa.string()),
pa.field('id', pa.int32()),
pa.field("points", pa.list_(pa.int32())),
])
records1 = [
{
"name": "Bob",
"id": 323234,
"points": [20, 12, 22],
},
{
"name": "Alex",
"id": 1234
}
]
records2 = [
{
"name": "Niclas",
"id": 123222,
"points": [21, 32, 1],
},
{
"name": "lena",
"id": 12345
}
]
writer = pq.ParquetWriter('test_table.parquet', schema=schema)
for records in [records1, records2]:
table = pa.Table.from_pylist(records, schema=schema)
writer.write_table(table)
writer.close()
print("show result for write_table")
print(pq.read_table("test_table.parquet").to_pandas())
writer = pq.ParquetWriter('test_batch.parquet', schema=schema)
for records in [records1, records2]:
batch = pa.RecordBatch.from_pylist(records, schema=schema)
writer.write_batch(batch)
writer.close()
print("show result for write_batch")
print(pq.read_table("test_batch.parquet").to_pandas())
```
```result
show result for write_table
name id points
0 Bob 323234 [20, 12, 22]
1 Alex 1234 None
2 Niclas 123222 [21, 32, 1]
3 lena 12345 None
show result for write_batch
name id points
0 Bob 323234 [20, 12, 22]
1 Alex 1234 None
2 Niclas 123222 [21, 32, 1]
3 lena 12345 None
```
I would like to split the data and write to the same parquet file to save
memory, should I use "write_table" or "write_batch" ?
Also, what would be the best size after splitting?
I would appreciate your response.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]