JoshRosen commented on PR #36885:
URL: https://github.com/apache/spark/pull/36885#issuecomment-1159547579
> Can I ask a newbie questions about version compatibility / optional field
handling? The existing code is capable of being lenient in some cases and
ignoring certain extra/missing fields. Does the translation from Json4s ->
Jackson preserve that behavior?
Yes, it should.
There are two different types of compatibility that we need to consider:
- **Compatibility for reading JSON produced by old versions of Spark:**
- When we add a new field to an existing event type, we need the read path
to treat that field as optional. Both the old and new code achieve this via the
`jsonOption` function, which allows us to control behavior when a field is
missing at read time.
- There are [some existing tests for
this](https://github.com/apache/spark/blob/a859dd25019715165ddb0defe3ddfd8e3cba866e/core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala#L239-L241)
in `JsonProtocolSuite`. These tests remove fields from the JSON (simulating
JSON generated by old versions that don't emit those fields) and tests that
events still parse.
- There are also some older eventlogs which are used by history server
tests:
https://github.com/apache/spark/tree/master/core/src/test/resources/spark-events
- As long as we've been consistent about marking new fields with
`jsonOption`, then I think that the existing tests + code review to ensure that
we've preserved `jsonOption` should give us pretty high confidence in terms of
our ability to parse old logs.
- **Writing event JSON which is readable by old HistoryServer versions**:
- `JsonProtocolSuite` has tests which serialize events and check that
Json4s parses the generated JSON into the same tree as an expected JSON
response (stored in a string literal); see
[`assertJsonStringEquals`](https://github.com/apache/spark/blob/a859dd25019715165ddb0defe3ddfd8e3cba866e/core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala#L918-L929).
Right now this PR is replacing that use of Json4s with Jackson, but I will
update this to check that the trees are considered equal by _both_ Jackson and
Json4s (and will leave a comment in the code explaining why we have this
compatibility test).
- A stronger guarantee would be to check that the Json4s and Jackson
versions of `JsonProtocol` produce character-for-character identical output. I
am going to experiment with running a test where I copy the existing `Json4s`
version of `JsonProtocol` to a class named `OldJsonProtocol` then generate
random events (likely via ScalaCheck) and check that the Json output is exactly
identical. If this test passes in all cases then we have strong assurance that
our new output will be compatible with old history servers. I'm not planning to
check this test in because I don't want to have two versions of JsonProtocol
that need to be kept in sync when making updates (and I think the risk of
future changes introducing breakage _which wouldn't be caught by another test_
is small).
I think (but will confirm) that our Json4s code path is using Jackson under
the hood (behind the Json4s interfaces), which gives me further confidence that
things should be compatible; if the underlying json parsers were different then
we might have to worry about corner-cases in how numbers are parsed, etc., but
hopefully they're using the same low-level reader.
I'll post an update here once I've had a chance to run those more extensive
compatibility experiments.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]