> The 95%+ use case for working with JSON for your average java coder is best done with data binding.
To be brave yet controversial: I'm not sure this is neccesarily true. I will elaborate and respond to the other points after a hot cocoa, but the last point is part of why I think that tree-crawling needs _something_ better as an API to fit the bill. With my sketch that set of requirements would be represented as record Thing( List<Long> xs ) { static Thing fromJson(Json json) var defaultList = List.of(0L); return new Thing(Decoder.optionalNullableField( json "xs", Decoder.oneOf( Decoder.array(Decoder.oneOf( x -> Long.parseLong(Decoder.string(x)), Decoder::long )) Decoder.null_(defaultList), x -> List.of(Decoder.long_(x)) ), defaultList )); ) } Which isn't amazing at first glance, but also {} {"xs": null} {"xs": 5} {"xs": [5]} {"xs": ["5"]} {"xs": [1, "2", "3"]} these are some wildly varied structures. You could make a solid argument that something which silently treats these all the same is a bad API for all the reasons you would consider it a good one. On Thu, Dec 15, 2022 at 6:18 PM Johannes Lichtenberger < lichtenberger.johan...@gmail.com> wrote: > I'll have to read the whole thing, but are pure JSON parsers really the > go-to for most people? I'm a big advocate of providing also something > similar to XPath/XQuery and that's IMHO JSONiq (90% XQuery). I might be > biased, of course, as I'm working on Brackit[1] in my spare time (which is > also a query compiler and intended to be used with proven optimizations by > document stores / JSON stores), but also can be used as an in-memory query > engine. > > kind regards > Johannes > > [1] https://github.com/sirixdb/brackit > > Am Do., 15. Dez. 2022 um 23:03 Uhr schrieb Reinier Zwitserloot < > rein...@zwitserloot.com>: > >> A recent Advent-of-Code puzzle also made me double check the support of >> JSON in the java core libs and it is indeed a curious situation that the >> java core libs don’t cater to it particularly well. >> >> However, I’m not seeing an easy way forward to try to close this hole in >> the core library offerings. >> >> If you need to stream huge swaths of JSON, generally there’s a clear unit >> size that you can just databind. Something like: >> >> String jsonStr = """ { "version": 5, "data": [ >> -- 1 million relatively small records in this list -- >> ] } """; >> >> >> The usual swath of JSON parsers tend to support this (giving you a stream >> of java instances created by databinding those small records one by one), >> or if not, the best move forward is presumably to file a pull request with >> those projects; the java.util.log experiment shows that trying to >> ‘core-librarize’ needs that the community at large already fulfills with >> third party deps isn’t a good move, especially if the core library variant >> tries to oversimplify to avoid the trap of being too opinionated (which >> core libs shouldn’t be). In other words, the need for ’stream this JSON for >> me’ style APIs is even more exotic that Ethan is suggesting. >> >> I see a fundamental problem here: >> >> >> - The 95%+ use case for working with JSON for your average java coder >> is best done with data binding. >> - core libs doesn’t want to provide it, partly because it’s got a >> large design space, partly because the field’s already covered by GSON and >> Jackson-json; java.util.log proves this doesn’t work. At least, I gather >> that’s what Ethan thinks and I agree with this assessment. >> - A language that claims to be “batteries included” that doesn’t ship >> with a JSON parser in this era is dubious, to say the least. >> >> >> I’m not sure how to square this circle. Hence it feels like core-libs >> needs to hold some more fundamental debates first: >> >> >> - Maybe it’s time to state in a more or less official decree that >> well-established, large design space jobs will remain the purview of >> dependencies no matter how popular it has, unless being part of the >> core-libs adds something more fundamental the third party deps cannot >> bring >> to the table (such as language integration), or the community standardizes >> on a single library (JSR310’s story, more or less). JSON parsing would >> qualify as ‘well-established’ (GSON and Jackson) and ‘large design space’ >> as Ethan pointed out. >> - Given that 99% of java projects, even really simple ones, start >> with maven/gradle and a list of deps, is that really a problem? >> >> >> I’m honestly not sure what the right answer is. On one hand, the npm >> ecosystem seems to be doing very well even though their ‘batteries >> included’ situation is an utter shambles. Then again, the notion that your >> average nodejs project includes 10x+ more dependencies than other languages >> is likely a significant part of the security clown fiesta going on over >> there as far as 3rd party deps is concerned, so by no means should java >> just blindly emulate their solutions. >> >> I don’t like the idea of shipping a non-data-binding JSON API in the core >> libs. The root issue with JSON is that you just can’t tell how to interpret >> any given JSON token, because that’s not how JSON is used in practice. What >> does 5 mean? Could be that I’m to take that as an int, or as a double, >> or perhaps even as a j.t.Instant (epoch-millis), and defaulting >> behaviour (similar to j.u.Map’s .getOrDefault is *very* convenient to >> parse most JSON out there in the real world - omitting k/v pairs whose >> value is still on default is very common). That’s what makes those databind >> libraries so enticing: Instead of trying to pattern match my way into this >> behaviour: >> >> >> - If the element isn’t there at all or null, give me a list-of-longs >> with a single 0 in it. >> - If the element is a number, make me a list-of-longs with 1 value in >> it, that is that number, as long. >> - If the element is a string, parse it into a long, then get me a >> list with this one long value (because IEEE double rules mean sometimes >> you >> have to put these things in string form or they get mangled by javascript- >> eval style parsers). >> >> >> And yet the above is quite common, and can easily be done by a >> databinder, which sees you want a List<Long> for a field whose default >> value is List.of(1L), and, armed with that knowledge, can transit the >> JSON into java in that way. >> >> You don’t *need* databinding to cater to this idea: You could for >> example have a jsonNode.asLong(123) method that would parse a string if >> need be, even. But this has nothing to do with pattern matching either. >> >> --Reinier Zwitserloot >> >> >> On 15 Dec 2022 at 21:30:17, Ethan McCue <et...@mccue.dev> wrote: >> >>> I'm writing this to drive some forward motion and to nerd-snipe those >>> who know better than I do into putting their thoughts into words. >>> >>> There are three ways to process JSON[1] >>> - Streaming (Push or Pull) >>> - Traversing a Tree (Realized or Lazy) >>> - Declarative Databind (N ways) >>> >>> Of these, JEP-198 explicitly ruled out providing "JAXB style type safe >>> data binding." >>> >>> No justification is given, but if I had to insert my own: mapping the >>> Json model to/from the Java/JVM object model is a cursed combo of >>> - Huge possible design space >>> - Unpalatably large surface for backwards compatibility >>> - Serialization! Boo![2] >>> >>> So for an artifact like the JDK, it probably doesn't make sense to >>> include. That tracks. >>> It won't make everyone happy, people like databind APIs, but it tracks. >>> >>> So for the "read flow" these are the things to figure out. >>> >>> | Should Provide? | Intended User(s) | >>> ----------------+-----------------+------------------+ >>> Streaming Push | | | >>> ----------------+-----------------+------------------+ >>> Streaming Pull | | | >>> ----------------+-----------------+------------------+ >>> Realized Tree | | | >>> ----------------+-----------------+------------------+ >>> Lazy Tree | | | >>> ----------------+-----------------+------------------+ >>> >>> At which point, we should talk about what "meets needs of Java >>> developers using JSON" implies. >>> >>> JSON is ubiquitous. Most kinds of software us schmucks write could have >>> a reason to interact with it. >>> The full set of "user personas" therefore aren't practical for me to >>> talk about.[3] >>> >>> JSON documents, however, are not so varied. >>> >>> - There are small ones (1-10kb) >>> - There are medium ones (10-1000kb) >>> - There are big ones (1000kb-???) >>> >>> - There are shallow ones >>> - There are deep ones >>> >>> So that feels like an easier direction to talk about it from. >>> >>> >>> This repo[4] has some convenient toy examples of how some of those APIs >>> look in libraries >>> in the ecosystem. Specifically the Streaming Pull and Realized Tree >>> models. >>> >>> User r = new User(); >>> while (true) { >>> JsonToken token = reader.peek(); >>> switch (token) { >>> case BEGIN_OBJECT: >>> reader.beginObject(); >>> break; >>> case END_OBJECT: >>> reader.endObject(); >>> return r; >>> case NAME: >>> String fieldname = reader.nextName(); >>> switch (fieldname) { >>> case "id": >>> r.setId(reader.nextString()); >>> break; >>> case "index": >>> r.setIndex(reader.nextInt()); >>> break; >>> ... >>> case "friends": >>> r.setFriends(new ArrayList<>()); >>> Friend f = null; >>> carryOn = true; >>> while (carryOn) { >>> token = reader.peek(); >>> switch (token) { >>> case BEGIN_ARRAY: >>> reader.beginArray(); >>> break; >>> case END_ARRAY: >>> reader.endArray(); >>> carryOn = false; >>> break; >>> case BEGIN_OBJECT: >>> reader.beginObject(); >>> f = new Friend(); >>> break; >>> case END_OBJECT: >>> reader.endObject(); >>> r.getFriends().add(f); >>> break; >>> case NAME: >>> String fn = reader.nextName(); >>> switch (fn) { >>> case "id": >>> >>> f.setId(reader.nextString()); >>> break; >>> case "name": >>> >>> f.setName(reader.nextString()); >>> break; >>> } >>> break; >>> } >>> } >>> break; >>> } >>> } >>> >>> I think its not hard to argue that the streaming apis are brutalist. The >>> above is Gson, but Jackson, moshi, etc >>> seem at least morally equivalent. >>> >>> Its hard to write, hard to write *correctly*, and theres is a curious >>> protensity towards pairing it >>> with anemic, mutable models. >>> >>> That being said, it handles big documents and deep documents really >>> well. It also performs >>> pretty darn well and is good enough as a "fallback" when the intended >>> user experience >>> is through something like databind. >>> >>> So what could we do meaningfully better with the language we have >>> today/will have tommorow? >>> >>> - Sealed interfaces + Pattern matching could give a nicer model for >>> tokens >>> >>> sealed interface JsonToken { >>> record Field(String name) implements JsonToken {} >>> record BeginArray() implements JsonToken {} >>> record EndArray() implements JsonToken {} >>> record BeginObject() implements JsonToken {} >>> record EndObject() implements JsonToken {} >>> // ... >>> } >>> >>> // ... >>> >>> User r = new User(); >>> while (true) { >>> JsonToken token = reader.peek(); >>> switch (token) { >>> case BeginObject __: >>> reader.beginObject(); >>> break; >>> case EndObject __: >>> reader.endObject(); >>> return r; >>> case Field("id"): >>> r.setId(reader.nextString()); >>> break; >>> case Field("index"): >>> r.setIndex(reader.nextInt()); >>> break; >>> >>> // ... >>> >>> case Field("friends"): >>> r.setFriends(new ArrayList<>()); >>> Friend f = null; >>> carryOn = true; >>> while (carryOn) { >>> token = reader.peek(); >>> switch (token) { >>> // ... >>> >>> - Value classes can make it all more efficient >>> >>> sealed interface JsonToken { >>> value record Field(String name) implements JsonToken {} >>> value record BeginArray() implements JsonToken {} >>> value record EndArray() implements JsonToken {} >>> value record BeginObject() implements JsonToken {} >>> value record EndObject() implements JsonToken {} >>> // ... >>> } >>> >>> - (Fun One) We can transform a simpler-to-write push parser into a pull >>> parser with Coroutines >>> >>> This is just a toy we could play with while making something in the >>> JDK. I'm pretty sure >>> we could make a parser which feeds into something like >>> >>> interface Listener { >>> void onObjectStart(); >>> void onObjectEnd(); >>> void onArrayStart(); >>> void onArrayEnd(); >>> void onField(String name); >>> // ... >>> } >>> >>> and invert a loop like >>> >>> while (true) { >>> char c = next(); >>> switch (c) { >>> case '{': >>> listener.onObjectStart(); >>> // ... >>> // ... >>> } >>> } >>> >>> by putting a Coroutine.yield in the callback. >>> >>> That might be a meaningful simplification in code structure, I don't >>> know enough to say. >>> >>> But, I think there are some hard questions like >>> >>> - Is the intent[5] to be make backing parser for ecosystem databind apis? >>> - Is the intent that users who want to handle big/deep documents fall >>> back to this? >>> - Are those new language features / conveniences enough to offset the >>> cost of committing to a new api? >>> - To whom exactly does a low level api provide value? >>> - What benefit is standardization in the JDK? >>> >>> and just generally - who would be the consumer(s) of this? >>> >>> The other kind of API still on the table is a Tree. There are two ways >>> to handle this >>> >>> 1. Load it into `Object`. Use a bunch of instanceof checks/casts to >>> confirm what it actually is. >>> >>> Object v; >>> User u = new User(); >>> >>> if ((v = jso.get("id")) != null) { >>> u.setId((String) v); >>> } >>> if ((v = jso.get("index")) != null) { >>> u.setIndex(((Long) v).intValue()); >>> } >>> if ((v = jso.get("guid")) != null) { >>> u.setGuid((String) v); >>> } >>> if ((v = jso.get("isActive")) != null) { >>> u.setIsActive(((Boolean) v)); >>> } >>> if ((v = jso.get("balance")) != null) { >>> u.setBalance((String) v); >>> } >>> // ... >>> if ((v = jso.get("latitude")) != null) { >>> u.setLatitude(v instanceof BigDecimal ? ((BigDecimal) >>> v).doubleValue() : (Double) v); >>> } >>> if ((v = jso.get("longitude")) != null) { >>> u.setLongitude(v instanceof BigDecimal ? ((BigDecimal) >>> v).doubleValue() : (Double) v); >>> } >>> if ((v = jso.get("greeting")) != null) { >>> u.setGreeting((String) v); >>> } >>> if ((v = jso.get("favoriteFruit")) != null) { >>> u.setFavoriteFruit((String) v); >>> } >>> if ((v = jso.get("tags")) != null) { >>> List<Object> jsonarr = (List<Object>) v; >>> u.setTags(new ArrayList<>()); >>> for (Object vi : jsonarr) { >>> u.getTags().add((String) vi); >>> } >>> } >>> if ((v = jso.get("friends")) != null) { >>> List<Object> jsonarr = (List<Object>) v; >>> u.setFriends(new ArrayList<>()); >>> for (Object vi : jsonarr) { >>> Map<String, Object> jso0 = (Map<String, Object>) vi; >>> Friend f = new Friend(); >>> f.setId((String) jso0.get("id")); >>> f.setName((String) jso0.get("name")); >>> u.getFriends().add(f); >>> } >>> } >>> >>> 2. Have an explicit model for Json, and helper methods that do said >>> casts[6] >>> >>> >>> this.setSiteSetting(readFromJson(jsonObject.getJsonObject("site"))); >>> JsonArray groups = jsonObject.getJsonArray("group"); >>> if(groups != null) >>> { >>> int len = groups.size(); >>> for(int i=0; i<len; i++) >>> { >>> JsonObject grp = groups.getJsonObject(i); >>> SNMPSetting grpSetting = readFromJson(grp); >>> String grpName = grp.getString("dbgroup", null); >>> if(grpName != null && grpSetting != null) >>> this.groupSettings.put(grpName, grpSetting); >>> } >>> } >>> JsonArray hosts = jsonObject.getJsonArray("host"); >>> if(hosts != null) >>> { >>> int len = hosts.size(); >>> for(int i=0; i<len; i++) >>> { >>> JsonObject host = hosts.getJsonObject(i); >>> SNMPSetting hostSetting = readFromJson(host); >>> String hostName = host.getString("dbhost", null); >>> if(hostName != null && hostSetting != null) >>> this.hostSettings.put(hostName, hostSetting); >>> } >>> } >>> >>> I think what has become easier to represent in the language nowadays is >>> that explicit model for Json. >>> Its the 101 lesson of sealed interfaces.[7] It feels nice and clean. >>> >>> sealed interface Json { >>> final class Null implements Json {} >>> final class True implements Json {} >>> final class False implements Json {} >>> final class Array implements Json {} >>> final class Object implements Json {} >>> final class String implements Json {} >>> final class Number implements Json {} >>> } >>> >>> And the cast-and-check approach is now more viable on account of pattern >>> matching. >>> >>> if (jso.get("id") instanceof String v) { >>> u.setId(v); >>> } >>> if (jso.get("index") instanceof Long v) { >>> u.setIndex(v.intValue()); >>> } >>> if (jso.get("guid") instanceof String v) { >>> u.setGuid(v); >>> } >>> >>> // or >>> >>> if (jso.get("id") instanceof String id && >>> jso.get("index") instanceof Long index && >>> jso.get("guid") instanceof String guid) { >>> return new User(id, index, guid, ...); // look ma, no >>> setters! >>> } >>> >>> >>> And on the horizon, again, is value types. >>> >>> But there are problems with this approach beyond the performance >>> implications of loading into >>> a tree. >>> >>> For one, all the code samples above have different behaviors around null >>> keys and missing keys >>> that are not obvious from first glance. >>> >>> This won't accept any null or missing fields >>> >>> if (jso.get("id") instanceof String id && >>> jso.get("index") instanceof Long index && >>> jso.get("guid") instanceof String guid) { >>> return new User(id, index, guid, ...); >>> } >>> >>> This will accept individual null or missing fields, but also will >>> silently ignore >>> fields with incorrect types >>> >>> if (jso.get("id") instanceof String v) { >>> u.setId(v); >>> } >>> if (jso.get("index") instanceof Long v) { >>> u.setIndex(v.intValue()); >>> } >>> if (jso.get("guid") instanceof String v) { >>> u.setGuid(v); >>> } >>> >>> And, compared to databind where there is information about the expected >>> structure of the document >>> and its the job of the framework to assert that, I posit that the errors >>> that would be encountered >>> when writing code against this would be more like >>> >>> "something wrong with user" >>> >>> than >>> >>> "problem at users[5].name, expected string or null. got 5" >>> >>> Which feels unideal. >>> >>> >>> One approach I find promising is something close to what Elm does with >>> its decoders[8]. Not just combining assertion >>> and binding like what pattern matching with records allows, but >>> including a scheme for bubbling/nesting errors. >>> >>> static String string(Json json) throws JsonDecodingException { >>> if (!(json instanceof Json.String jsonString)) { >>> throw JsonDecodingException.of( >>> "expected a string", >>> json >>> ); >>> } else { >>> return jsonString.value(); >>> } >>> } >>> >>> static <T> T field(Json json, String fieldName, Decoder<? extends T> >>> valueDecoder) throws JsonDecodingException { >>> var jsonObject = object(json); >>> var value = jsonObject.get(fieldName); >>> if (value == null) { >>> throw JsonDecodingException.atField( >>> fieldName, >>> JsonDecodingException.of( >>> "no value for field", >>> json >>> ) >>> ); >>> } >>> else { >>> try { >>> return valueDecoder.decode(value); >>> } catch (JsonDecodingException e) { >>> throw JsonDecodingException.atField( >>> fieldName, >>> e >>> ); >>> } catch (Exception e) { >>> throw JsonDecodingException.atField(fieldName, >>> JsonDecodingException.of(e, value)); >>> } >>> } >>> } >>> >>> Which I think has some benefits over the ways I've seen of working with >>> trees. >>> >>> >>> >>> - It is declarative enough that folks who prefer databind might be happy >>> enough. >>> >>> static User fromJson(Json json) { >>> return new User( >>> Decoder.field(json, "id", Decoder::string), >>> Decoder.field(json, "index", Decoder::long_), >>> Decoder.field(json, "guid", Decoder::string), >>> ); >>> } >>> >>> / ... >>> >>> List<User> users = Decoders.array(json, User::fromJson); >>> >>> - Handling null and optional fields could be less easily conflated >>> >>> Decoder.field(json, "id", Decoder::string); >>> >>> Decoder.nullableField(json, "id", Decoder::string); >>> >>> Decoder.optionalField(json, "id", Decoder::string); >>> >>> Decoder.optionalNullableField(json, "id", Decoder::string); >>> >>> >>> - It composes well with user defined classes >>> >>> record Guid(String value) { >>> Guid { >>> // some assertions on the structure of value >>> } >>> } >>> >>> Decoder.string(json, "guid", guid -> new Guid(Decoder.string(guid))); >>> >>> // or even >>> >>> record Guid(String value) { >>> Guid { >>> // some assertions on the structure of value >>> } >>> >>> static Guid fromJson(Json json) { >>> return new Guid(Decoder.string(guid)); >>> } >>> } >>> >>> Decoder.string(json, "guid", Guid::fromJson); >>> >>> >>> - When something goes wrong, the API can handle the fiddlyness of >>> capturing information for feedback. >>> >>> In the code I've sketched out its just what field/index things went >>> wrong at. Potentially >>> capturing metadata like row/col numbers of the source would be >>> sensible too. >>> >>> Its just not reasonable to expect devs to do extra work to get that >>> and its really nice to give it. >>> >>> There are also some downsides like >>> >>> - I do not know how compatible it would be with lazy trees. >>> >>> Lazy trees being the only way that a tree api could handle big or >>> deep documents. >>> The general concept as applied in libraries like json-tree[9] is to >>> navigate without >>> doing any work, and that clashes with wanting to instanceof check >>> the info at the >>> current path. >>> >>> - It *almost* gives enough information to be a general schema approach >>> >>> If one field fails, that in the model throws an exception >>> immediately. If an API should >>> return "errors": [...], that is inconvenient to construct. >>> >>> - None of the existing popular libraries are doing this >>> >>> The only mechanics that are strictly required to give this sort of >>> API is lambdas. Those have >>> been out for a decade. Yes sealed interfaces make the data model >>> prettier but in concept you >>> can build the same thing on top of anything. >>> >>> I could argue that this is because of "cultural momentum" of >>> databind or some other reason, >>> but the fact remains that it isn't a proven out approach. >>> >>> Writing Json libraries is a todo list[10]. There are a lot of bad >>> ideas and this might be one of the, >>> >>> - Performance impact of so many instanceof checks >>> >>> I've gotten a 4.2% slowdown compared to the "regular" tree code >>> without the repeated casts. >>> >>> But that was with a parser that is 5x slower than Jacksons. (using >>> the same benchmark project as for the snippets). >>> I think there could be reason to believe that the JIT does well >>> enough with repeated instanceof >>> checks to consider it. >>> >>> >>> My current thinking is that - despite not solving for large or deep >>> documents - starting with a really "dumb" realized tree api >>> might be the right place to start for the read side of a potential >>> incubator module. >>> >>> But regardless - this feels like a good time to start more concrete >>> conversations. I fell I should cap this email since I've reached the point >>> of decoherence and haven't even mentioned the write side of things >>> >>> >>> >>> >>> [1]: http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html >>> [2]: https://security.snyk.io/vuln/maven?search=jackson-databind >>> [3]: I only know like 8 people >>> [4]: >>> https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java >>> [5]: When I say "intent", I do so knowing full well no one has been >>> actively thinking of this for an entire Game of Thrones >>> [6]: >>> https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java >>> [7]: https://www.infoq.com/articles/data-oriented-programming-java/ >>> [8]: https://package.elm-lang.org/packages/elm/json/latest/Json-Decode >>> [9]: https://github.com/jbee/json-tree >>> [10]: https://stackoverflow.com/a/14442630/2948173 >>> [11]: In 30 days JEP-198 it will be recognizably PI days old for the 2nd >>> time in its history. >>> [12]: To me, the fact that is still an open JEP is more a social >>> convenience than anything. I could just as easily writing this exact same >>> email about TOML. >>> >>