TomAugspurger commented on issue #47177:
URL: https://github.com/apache/arrow/issues/47177#issuecomment-3113339892
This script shows that using `pyarrow.parquet.write_to_dataset` works fine:
<details>
```python
import pyarrow as pa
import pyarrow.dataset
import pathlib
import pyarrow.parquet
import shutil
shutil.rmtree("string.parquet", ignore_errors=True)
shutil.rmtree("ds.parquet", ignore_errors=True)
t = pa.table(
{"part": pa.array(["a", "a", "b", "b"], type=pa.large_string()), "col":
[1, 2, 3, 4]}
)
root = pathlib.Path("string.parquet")
a = root / "a/data.parquet"
b = root / "b/data.parquet"
a.parent.mkdir(parents=True, exist_ok=True)
b.parent.mkdir(parents=True, exist_ok=True)
# Manually write the two parts to disk using `write_table`
pyarrow.parquet.write_table(t[:2], a)
pyarrow.parquet.write_table(t[2:], b)
source = list(root.glob("**/*.parquet"))
# Use write_to_dataset to let pyarrow handle the partitioning
ds_root = pathlib.Path("ds.parquet")
pyarrow.parquet.write_to_dataset(t, ds_root, partition_cols=["part"])
ds_source = list(ds_root.glob("**/*.parquet"))
ds = pyarrow.dataset.dataset(ds_source, partitioning=["part"],
partition_base_dir=str(ds_root))
print("manual")
print(pyarrow.parquet.read_table(source[0]))
print("\n\ndataset")
print(pyarrow.parquet.read_table(ds_source[0]))
```
</details>
That prints out
```
manual
pyarrow.Table
part: large_string
col: int64
----
part: [["a","a"]]
col: [[1,2]]
dataset
pyarrow.Table
col: int64
part: dictionary<values=string, indices=int32, ordered=0>
----
col: [[1,2]]
part: [ -- dictionary:
["a"] -- indices:
[0,0]]
```
So the big difference is that `pyarrow.parquet.write_to_dataset(...,
partition_on=)` will dictionary encode the partition keys.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]