westonpace commented on issue #35269:
URL: https://github.com/apache/arrow/issues/35269#issuecomment-1523819701
Something like this should work. Note that this might crash on pyarrow
11.0.0 (currently released version). There was a write_dataset bug introduced.
It should be fixed in 12.0.0 (will release soon) but should also work on
10.0.0.
```
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.compute as pc
# Create a table with one column of random 20-character strings and one
column of incrementing integers
A, Z = np.array(["A","Z"]).view("int32")
LENGTH = 10_000_000
STRLEN = 20
np_arr =
np.random.randint(low=A,high=Z,size=LENGTH*STRLEN,dtype="int32").view(f"U{STRLEN}")
pa_arr = pa.array(np_arr)
other_col = pa.array(range(LENGTH))
table = pa.Table.from_arrays([pa_arr, other_col], names=["strings",
"numbers"])
# Write the table out. This will be our "source dataset". You already have
this
pq.write_table(table, "/tmp/source.parquet")
# Create a dataset object to represent our source dataset
my_dataset = ds.dataset(["/tmp/source.parquet"], format="parquet")
# Create a column map. We want to load all the columns as normal but we also
# want to add an additional dynamic column which is the first 2 characters
of the long
# strings array
columns = {}
for field in my_dataset.schema:
columns[field.name] = pc.field(field.name)
columns["string_code"] = pc.utf8_slice_codeunits(pc.field("strings"), 0, 2)
# Use a scanner as input to write_dataset. This way we don't need to load
the entire
# dataset into memory. Partition on our dynamic column.
ds.write_dataset(my_dataset.scanner(columns=columns), "/tmp/my_dataset",
partitioning=["string_code"], partitioning_flavor="hive", format="parquet")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]