CytoShahar opened a new pull request, #48086:
URL: https://github.com/apache/arrow/pull/48086
### Rationale for this change
Arrow currently always URL-encodes Hive partition values when writing
datasets (e.g., spaces become `%20`, slashes become `%2F`). This behavior:
- Cannot be disabled, even for local filesystems where special characters
are valid
- Creates incompatibility with non-Arrow tools expecting unencoded directory
names
- Makes partition directories difficult to read and debug
- Causes issues when URIs are already encoded by service providers
As reported in #41618, users working with local filesystems need
human-readable directory names (e.g., `category=Product A` instead of
`category=Product%20A`) while maintaining compatibility with existing Arrow
workflows.
### What changes are included in this PR?
Added a new optional boolean parameter `url_encode_hive_values` (default
`true`) to control URL encoding behavior in Hive-style partitioning across all
three language bindings:
**C++ Core** (`cpp/src/arrow/dataset/partition.cc`):
- Modified `HivePartitioning::FormatValues()` to conditionally apply
`UriEscape()` based on `segment_encoding()`
- When `SegmentEncoding::None` is set, partition values are used as-is
- When `SegmentEncoding::Uri` is set (default), maintains existing URL
encoding behavior
**R API** (`r/R/dataset-write.R`):
- Added `url_encode_hive_values = TRUE` parameter to `write_dataset()`,
`write_csv_dataset()`, `write_tsv_dataset()`, `write_delim_dataset()`
- Sets `segment_encoding` parameter when creating `HivePartitioning` objects
- Defaults to `TRUE` to maintain backward compatibility
**Python API** (`python/pyarrow/dataset.py`):
- Added `url_encode_hive_values = True` parameter to `write_dataset()`
- Modified `_ensure_write_partitioning()` to handle the parameter for all
partitioning input types
- Creates `HivePartitioning` objects with appropriate `segment_encoding`
- Defaults to `True` to maintain backward compatibility
The implementation leverages the existing `segment_encoding` parameter in
`HivePartitioning`, requiring no changes to core C++ data structures.
### Are these changes tested?
Yes, comprehensive test coverage across all three languages:
**C++ Tests** (`cpp/src/arrow/dataset/partition_test.cc`):
- Added `WriteHiveWithSlashesInValuesDisableUrlEncoding` test
- Verifies that partition values with spaces, slashes, ampersands, and
percent signs are written without URL encoding when `SegmentEncoding::None` is
set
- All existing partition tests continue to pass, ensuring backward
compatibility
**R Tests** (`r/tests/testthat/test-dataset-write.R`):
- Added comprehensive test covering special characters: space, slash,
percent, plus, ampersand, equals, question mark
- Validates directory names are correctly encoded/unencoded based on
parameter value
- Verifies data integrity is maintained across both encoding modes
- Tests with CSV, TSV, and Parquet formats
**Python Tests** (`python/pyarrow/tests/test_dataset.py`):
- Added `test_hive_partitioning_url_encoding()` test
- Tests both URL encoding enabled (default) and disabled (new feature)
- Tests with explicitly created `HivePartitioning` objects and string
partition specs
- Validates directory names and data integrity
### Are there any user-facing changes?
Yes, but fully backward compatible:
**New Parameter**: Users can now optionally disable URL encoding in
Hive-style partitioning:
**R Example**:
```r
# Default behavior (URL encoding enabled) - UNCHANGED
write_dataset(data, "path", partitioning = "category",
hive_style = TRUE) # url_encode_hive_values defaults to TRUE
# Creates: category=Product%20A/
# New behavior (URL encoding disabled)
write_dataset(data, "path", partitioning = "category",
hive_style = TRUE, url_encode_hive_values = FALSE)
# Creates: category=Product A/
```
**Python Example**:
```python
# Default behavior (URL encoding enabled) - UNCHANGED
ds.write_dataset(table, "path", partitioning=["category"],
partitioning_flavor="hive") # url_encode_hive_values
defaults to True
# Creates: category=Product%20A/
# New behavior (URL encoding disabled)
ds.write_dataset(table, "path", partitioning=["category"],
partitioning_flavor="hive", url_encode_hive_values=False)
# Creates: category=Product A/
```
**Backward Compatibility**:
- Default behavior unchanged: `url_encode_hive_values` defaults to `true`,
maintaining existing URL encoding
- All existing code continues to work without modification
- Only affects Hive-style partitioning, not directory partitioning
- Reading datasets works with both encoded and unencoded partition values
Closes #41618
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]