xiangfu0 opened a new pull request, #18711:
URL: https://github.com/apache/pinot/pull/18711

   ## Motivation
   
   The default-value overloads of the `jsonPath*` scalar functions — 
`jsonPathString/Long/Double(obj, path, default)` and 
`jsonPathArrayDefaultEmpty(obj, path)` — parse their argument with Jackson and 
catch any exception to return the default. For columns that are **sometimes 
structured JSON and sometimes plain text** (a very common pattern for log 
`message` fields ingested via Vector/Fluent Bit/etc.), every plain-text value 
throws a `JsonParseException`.
   
   Throwing the exception is cheap; **`fillInStackTrace()` walking the deep 
ingestion stack is not**. Profiling a realtime log table (a 
`JSONPATHSTRING(message, '$.x', null)`-heavy transform config where `message` 
is plain text ~75% of the time) showed roughly **30% of consume-loop CPU spent 
in `fillInStackTrace()`** for exceptions that are immediately swallowed.
   
   ## Change
   
   Add a conservative, allocation-free pre-check `canExtractJsonPath(object)` 
used by the four default-value overloads. It returns the caller's default 
**without invoking the parser** when the input is `null` or a string whose 
first non-whitespace character cannot begin a JSON value.
   
   Any input that *could* begin a JSON value (`{`, `[`, `"`, `-`, a digit, or a 
`true`/`false`/`null` literal) is still handed to the parser, so the parser's 
behavior — including the exceptions raised by the **throwing** two-arg 
overloads, which are intentionally left untouched — is unchanged.
   
   This is **behavior-preserving**: the skipped inputs already returned the 
default, just via a thrown-and-caught exception. Existing `JsonFunctionsTest` 
(incl. `testJsonPathStringWithInvalidJson`, which asserts the two-arg overloads 
still throw on `"not json"`/`null`) passes unchanged.
   
   ## Validation
   
   JMH benchmark of the realtime transform pipeline 
(`TransformPipeline.processRow`) over a representative JSON log-ingestion 
workload (nested fields arriving as JSON-encoded strings; `message` 75% plain 
text):
   
   | | transform throughput | vs baseline |
   |---|---:|---:|
   | baseline | 21.4K rows/s | 1.0x |
   | **+ this fix** | **37.0K rows/s** | **1.7x** |
   | + parse-once transform configs (config-side) | 69.7K rows/s | 3.3x |
   
   End-to-end (transform + index into a mutable segment), the fix moves a log 
table from ~17K to ~27K rows/s with no config change.
   
   ## Testing
   
   - New unit test `testJsonPathDefaultVariantsSkipNonJson` covers plain-text, 
`null`, empty/whitespace, valid object/array (incl. leading whitespace), and 
the parser-passthrough cases (`"not json"`, bare scalars) for all four 
default-value overloads.
   - Full `JsonFunctionsTest` passes (38/38).
   
   ## Backward compatibility / release notes
   
   None required — observable behavior of the default-value overloads is 
preserved; the throwing overloads are unchanged. No config, SPI, or wire-format 
changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to