Re: [I] `ArrowTypeError` when reading a parquet dataset that's partitioned on a `large_string` column [arrow]

via GitHub Thu, 24 Jul 2025 05:47:16 -0700


TomAugspurger commented on issue #47177:
URL: https://github.com/apache/arrow/issues/47177#issuecomment-3113339892


   This script shows that using `pyarrow.parquet.write_to_dataset` works fine:
   
   <details>
   
   ```python
   import pyarrow as pa
   import pyarrow.dataset
   import pathlib
   import pyarrow.parquet
   import shutil
   
   
   shutil.rmtree("string.parquet", ignore_errors=True)
   shutil.rmtree("ds.parquet", ignore_errors=True)
   t = pa.table(
       {"part": pa.array(["a", "a", "b", "b"], type=pa.large_string()), "col": 
[1, 2, 3, 4]}
   )
   root = pathlib.Path("string.parquet")
   a = root / "a/data.parquet"
   b = root / "b/data.parquet"
   a.parent.mkdir(parents=True, exist_ok=True)
   b.parent.mkdir(parents=True, exist_ok=True)
   
   # Manually write the two parts to disk using `write_table`
   
   pyarrow.parquet.write_table(t[:2], a)
   pyarrow.parquet.write_table(t[2:], b)
   
   source = list(root.glob("**/*.parquet"))
   
   # Use write_to_dataset to let pyarrow handle the partitioning
   ds_root = pathlib.Path("ds.parquet")
   pyarrow.parquet.write_to_dataset(t, ds_root, partition_cols=["part"])
   
   ds_source = list(ds_root.glob("**/*.parquet"))
   ds = pyarrow.dataset.dataset(ds_source, partitioning=["part"], 
partition_base_dir=str(ds_root))
   
   print("manual")
   print(pyarrow.parquet.read_table(source[0]))
   
   print("\n\ndataset")
   print(pyarrow.parquet.read_table(ds_source[0]))
   ```
   
   </details>
   
   That prints out
   
   ```
   manual
   pyarrow.Table
   part: large_string
   col: int64
   ----
   part: [["a","a"]]
   col: [[1,2]]
   
   
   dataset
   pyarrow.Table
   col: int64
   part: dictionary<values=string, indices=int32, ordered=0>
   ----
   col: [[1,2]]
   part: [  -- dictionary:
   ["a"]  -- indices:
   [0,0]]
   ```
   
   So the big difference is that `pyarrow.parquet.write_to_dataset(..., 
partition_on=)` will dictionary encode the partition keys.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] `ArrowTypeError` when reading a parquet dataset that's partitioned on a `large_string` column [arrow]

Reply via email to