rdtr opened a new pull request, #56184:
URL: https://github.com/apache/spark/pull/56184

   ### What changes were proposed in this pull request?
   
   `SQLQueryTestHelper.getSparkSettings` splits `--SET` directive values on 
every comma:
   
   ```scala
   val settingLines = comments.filter(_.startsWith("--SET 
")).map(_.substring(6))
   settingLines.flatMap(_.split(",").map { kv =>
     val (conf, value) = kv.span(_ != '=')
     conf.trim -> value.substring(1).trim
   })
   ```
   
   This breaks for any setting whose value itself contains a comma (e.g. 
`spark.sql.optimizer.excludedRules`, which is documented as a comma-separated 
list of rule names). The parser splits inside the value, the next segment has 
no `=`, and `value.substring(1)` then throws `StringIndexOutOfBoundsException`.
   
   This PR changes the split to only occur at commas that are immediately 
followed by what looks like a new `key=`:
   
   ```scala
   settingLines.flatMap(_.split(",(?=[\\w.]+=)").map { ... })
   ```
   
   The positive lookahead `(?=[\w.]+=)` peeks at the following characters and 
only splits when they form a config key followed by `=`, without consuming 
those characters. This preserves the documented multi-setting form `--SET 
k1=v1,k2=v2` while now also correctly handling values that themselves contain 
commas.
   
   ### Why are the changes needed?
   
   Several Spark configs take comma-separated lists as values — 
`spark.sql.optimizer.excludedRules` being the most common one. SQL test files 
currently cannot express such a value in a `--SET` directive at all; the parser 
fails before the test even runs, forcing authors to scope `--SET` down to one 
excluded rule per directive.
   
   This was surfaced while writing a per-file workaround in Apache Gluten 
([apache/incubator-gluten#12165](https://github.com/apache/incubator-gluten/pull/12165)),
 which reuses this test helper and needs to exclude multiple rules at once.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. Test-framework-only change. All existing `--SET` directives in the SQL 
test corpus continue to parse identically.
   
   ### How was this patch tested?
   
   Added `SQLQueryTestHelperSuite` (new file) with focused unit tests covering:
   
   - single `key=value`
   - documented multi-setting form `k1=v1,k2=v2` in a single `--SET`
   - multiple `--SET` lines
   - a value containing commas (e.g. `excludedRules=Rule1,Rule2`) — fails on 
`master`, passes after the fix
   - mixed: multi-setting where one of the values itself contains commas
   - non-`--SET` comments are ignored
   
   All 6 tests pass with the fix applied.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to