CytoShahar opened a new pull request, #48086:
URL: https://github.com/apache/arrow/pull/48086

   ### Rationale for this change
   
   Arrow currently always URL-encodes Hive partition values when writing 
datasets (e.g., spaces become `%20`, slashes become `%2F`). This behavior:
   - Cannot be disabled, even for local filesystems where special characters 
are valid
   - Creates incompatibility with non-Arrow tools expecting unencoded directory 
names
   - Makes partition directories difficult to read and debug
   - Causes issues when URIs are already encoded by service providers
   
   As reported in #41618, users working with local filesystems need 
human-readable directory names (e.g., `category=Product A` instead of 
`category=Product%20A`) while maintaining compatibility with existing Arrow 
workflows.
   
   ### What changes are included in this PR?
   
   Added a new optional boolean parameter `url_encode_hive_values` (default 
`true`) to control URL encoding behavior in Hive-style partitioning across all 
three language bindings:
   
   **C++ Core** (`cpp/src/arrow/dataset/partition.cc`):
   - Modified `HivePartitioning::FormatValues()` to conditionally apply 
`UriEscape()` based on `segment_encoding()`
   - When `SegmentEncoding::None` is set, partition values are used as-is
   - When `SegmentEncoding::Uri` is set (default), maintains existing URL 
encoding behavior
   
   **R API** (`r/R/dataset-write.R`):
   - Added `url_encode_hive_values = TRUE` parameter to `write_dataset()`, 
`write_csv_dataset()`, `write_tsv_dataset()`, `write_delim_dataset()`
   - Sets `segment_encoding` parameter when creating `HivePartitioning` objects
   - Defaults to `TRUE` to maintain backward compatibility
   
   **Python API** (`python/pyarrow/dataset.py`):
   - Added `url_encode_hive_values = True` parameter to `write_dataset()`
   - Modified `_ensure_write_partitioning()` to handle the parameter for all 
partitioning input types
   - Creates `HivePartitioning` objects with appropriate `segment_encoding`
   - Defaults to `True` to maintain backward compatibility
   
   The implementation leverages the existing `segment_encoding` parameter in 
`HivePartitioning`, requiring no changes to core C++ data structures.
   
   ### Are these changes tested?
   
   Yes, comprehensive test coverage across all three languages:
   
   **C++ Tests** (`cpp/src/arrow/dataset/partition_test.cc`):
   - Added `WriteHiveWithSlashesInValuesDisableUrlEncoding` test
   - Verifies that partition values with spaces, slashes, ampersands, and 
percent signs are written without URL encoding when `SegmentEncoding::None` is 
set
   - All existing partition tests continue to pass, ensuring backward 
compatibility
   
   **R Tests** (`r/tests/testthat/test-dataset-write.R`):
   - Added comprehensive test covering special characters: space, slash, 
percent, plus, ampersand, equals, question mark
   - Validates directory names are correctly encoded/unencoded based on 
parameter value
   - Verifies data integrity is maintained across both encoding modes
   - Tests with CSV, TSV, and Parquet formats
   
   **Python Tests** (`python/pyarrow/tests/test_dataset.py`):
   - Added `test_hive_partitioning_url_encoding()` test
   - Tests both URL encoding enabled (default) and disabled (new feature)
   - Tests with explicitly created `HivePartitioning` objects and string 
partition specs
   - Validates directory names and data integrity
   
   ### Are there any user-facing changes?
   
   Yes, but fully backward compatible:
   
   **New Parameter**: Users can now optionally disable URL encoding in 
Hive-style partitioning:
   
   **R Example**:
   ```r
   # Default behavior (URL encoding enabled) - UNCHANGED
   write_dataset(data, "path", partitioning = "category",
                 hive_style = TRUE)  # url_encode_hive_values defaults to TRUE
   # Creates: category=Product%20A/
   
   # New behavior (URL encoding disabled)
   write_dataset(data, "path", partitioning = "category",
                 hive_style = TRUE, url_encode_hive_values = FALSE)
   # Creates: category=Product A/
   ```
   
   **Python Example**:
   ```python
   # Default behavior (URL encoding enabled) - UNCHANGED
   ds.write_dataset(table, "path", partitioning=["category"],
                    partitioning_flavor="hive")  # url_encode_hive_values 
defaults to True
   # Creates: category=Product%20A/
   
   # New behavior (URL encoding disabled)
   ds.write_dataset(table, "path", partitioning=["category"],
                    partitioning_flavor="hive", url_encode_hive_values=False)
   # Creates: category=Product A/
   ```
   
   **Backward Compatibility**:
   - Default behavior unchanged: `url_encode_hive_values` defaults to `true`, 
maintaining existing URL encoding
   - All existing code continues to work without modification
   - Only affects Hive-style partitioning, not directory partitioning
   - Reading datasets works with both encoded and unencoded partition values
   
   Closes #41618
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to