xiangfu0 opened a new pull request, #18711:
URL: https://github.com/apache/pinot/pull/18711
## Motivation
The default-value overloads of the `jsonPath*` scalar functions —
`jsonPathString/Long/Double(obj, path, default)` and
`jsonPathArrayDefaultEmpty(obj, path)` — parse their argument with Jackson and
catch any exception to return the default. For columns that are **sometimes
structured JSON and sometimes plain text** (a very common pattern for log
`message` fields ingested via Vector/Fluent Bit/etc.), every plain-text value
throws a `JsonParseException`.
Throwing the exception is cheap; **`fillInStackTrace()` walking the deep
ingestion stack is not**. Profiling a realtime log table (a
`JSONPATHSTRING(message, '$.x', null)`-heavy transform config where `message`
is plain text ~75% of the time) showed roughly **30% of consume-loop CPU spent
in `fillInStackTrace()`** for exceptions that are immediately swallowed.
## Change
Add a conservative, allocation-free pre-check `canExtractJsonPath(object)`
used by the four default-value overloads. It returns the caller's default
**without invoking the parser** when the input is `null` or a string whose
first non-whitespace character cannot begin a JSON value.
Any input that *could* begin a JSON value (`{`, `[`, `"`, `-`, a digit, or a
`true`/`false`/`null` literal) is still handed to the parser, so the parser's
behavior — including the exceptions raised by the **throwing** two-arg
overloads, which are intentionally left untouched — is unchanged.
This is **behavior-preserving**: the skipped inputs already returned the
default, just via a thrown-and-caught exception. Existing `JsonFunctionsTest`
(incl. `testJsonPathStringWithInvalidJson`, which asserts the two-arg overloads
still throw on `"not json"`/`null`) passes unchanged.
## Validation
JMH benchmark of the realtime transform pipeline
(`TransformPipeline.processRow`) over a representative JSON log-ingestion
workload (nested fields arriving as JSON-encoded strings; `message` 75% plain
text):
| | transform throughput | vs baseline |
|---|---:|---:|
| baseline | 21.4K rows/s | 1.0x |
| **+ this fix** | **37.0K rows/s** | **1.7x** |
| + parse-once transform configs (config-side) | 69.7K rows/s | 3.3x |
End-to-end (transform + index into a mutable segment), the fix moves a log
table from ~17K to ~27K rows/s with no config change.
## Testing
- New unit test `testJsonPathDefaultVariantsSkipNonJson` covers plain-text,
`null`, empty/whitespace, valid object/array (incl. leading whitespace), and
the parser-passthrough cases (`"not json"`, bare scalars) for all four
default-value overloads.
- Full `JsonFunctionsTest` passes (38/38).
## Backward compatibility / release notes
None required — observable behavior of the default-value overloads is
preserved; the throwing overloads are unchanged. No config, SPI, or wire-format
changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]