[GitHub] [arrow] westonpace commented on issue #35269: [Python] Partition a dataset by numeric column

via GitHub Wed, 26 Apr 2023 10:47:37 -0700


westonpace commented on issue #35269:
URL: https://github.com/apache/arrow/issues/35269#issuecomment-1523819701


   Something like this should work.  Note that this might crash on pyarrow 
11.0.0 (currently released version).  There was a write_dataset bug introduced. 
 It should be fixed in 12.0.0 (will release soon) but should also work on 
10.0.0.
   
   ```
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   import pyarrow.compute as pc
   
   # Create a table with one column of random 20-character strings and one 
column of incrementing integers
   A, Z = np.array(["A","Z"]).view("int32")
   LENGTH = 10_000_000
   STRLEN = 20
   np_arr = 
np.random.randint(low=A,high=Z,size=LENGTH*STRLEN,dtype="int32").view(f"U{STRLEN}")
   pa_arr = pa.array(np_arr)
   other_col = pa.array(range(LENGTH))
   table = pa.Table.from_arrays([pa_arr, other_col], names=["strings", 
"numbers"])
   
   # Write the table out.  This will be our "source dataset".  You already have 
this
   pq.write_table(table, "/tmp/source.parquet")
   
   # Create a dataset object to represent our source dataset
   my_dataset = ds.dataset(["/tmp/source.parquet"], format="parquet")
   
   # Create a column map.  We want to load all the columns as normal but we also
   # want to add an additional dynamic column which is the first 2 characters 
of the long
   # strings array
   columns = {}
   for field in my_dataset.schema:
       columns[field.name] = pc.field(field.name)
   columns["string_code"] = pc.utf8_slice_codeunits(pc.field("strings"), 0, 2)
   
   # Use a scanner as input to write_dataset.  This way we don't need to load 
the entire
   # dataset into memory.  Partition on our dynamic column.
   ds.write_dataset(my_dataset.scanner(columns=columns), "/tmp/my_dataset", 
partitioning=["string_code"], partitioning_flavor="hive", format="parquet")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #35269: [Python] Partition a dataset by numeric column

Reply via email to