JonathanGiles opened a new pull request, #1181: URL: https://github.com/apache/arrow-java/pull/1181
In considering using this API, I was concerned by the heaviness of the dependencies. I set about looking into the feasibility of removing these. This PR is not meant to be a final complete solution, but as a starting point for a discussion around the appetite of reducing the dependency size to make the library more palatable for developers building libraries (as I am with the Azure SDKs for Java). **AI Disclosure:** I built this on my machine with the help of coding agents (Claude Opus 4.8) using the GitHub Copilot CLI tooling. ## What's Changed Removes several heavy dependencies from **`arrow-vector`** by migrating its JSON handling to the lightweight `jackson-core` streaming API and the JDK's built-in `HexFormat`. **Dependencies dropped from `arrow-vector`:** - `com.fasterxml.jackson.core:jackson-databind` - `com.fasterxml.jackson.core:jackson-annotations` - `com.fasterxml.jackson.datatype:jackson-datatype-jsr310` - `commons-codec:commons-codec` `jackson-core` is retained (used for streaming JSON read/write). **How:** - `Schema`, `Field`, `DictionaryEncoding`, and `ArrowType` (+ generated subtypes) JSON (de)serialization rewritten from `ObjectMapper` + jackson annotations (`@JsonCreator` / `@JsonProperty` / `@JsonTypeInfo`) to `jackson-core` streaming (`JsonGenerator` / `JsonParser`), with two small internal helpers: `JsonValues` (parser→tree + typed extractors) and `JsonStringSerializer` (compact `toString()` JSON). - `extension/OpaqueType` and the IPC `JsonFileReader` / `JsonFileWriter` migrated to the streaming API. - Hex encoding/decoding in the IPC JSON files switched from commons-codec `Hex` to `java.util.HexFormat`. - `Text` no longer carries a jackson `@JsonSerialize` annotation (and its inner `TextSerializer` is removed). **Compatibility preserved deliberately:** - `JsonStringHashMap` / `JsonStringArrayList` had a `static ObjectMapper` field removed, which would have changed their implicitly-computed `serialVersionUID` and broken Java deserialization of objects written by older Arrow versions (e.g. blobs stored by H2, objects serialized by Spark). The original `serialVersionUID` values are now pinned explicitly to retain wire compatibility. - These collections' `toString()` can contain `java.time` values (from temporal vectors' `getObject()`). The previous output used `JavaTimeModule`'s numeric form (e.g. `LocalDateTime` → `[2021,1,2,3,4,5]`, `Duration` → `90.000000000`). That exact output is reproduced in `JsonStringSerializer`, so `toString()` is byte-for-byte unchanged. A clearly-marked code block documents how to revert to native ISO-8601 output if desired. **This contains breaking changes.** ## Breaking changes | # | Change | Impact / migration | |---|--------|--------------------| | 1 | Removed public class `org.apache.arrow.vector.util.ObjectMapperFactory` | External callers should use their own `ObjectMapper` (add `jackson-databind` directly). It only configured `JavaTimeModule`; supply that module if you need `java.time` support. | | 2 | Removed public inner class `Text.TextSerializer` (and `Text`'s `@JsonSerialize`) | If you serialized `Text` via an external jackson `ObjectMapper`, register a custom serializer. | | 3 | Jackson annotations removed from `Schema` / `Field` / `DictionaryEncoding` / `ArrowType` | Code that (de)serialized these POJOs with an external `ObjectMapper` relying on Arrow's annotations will no longer work; use `Schema.fromJSON(String)` / `Schema.toJson()` (public API, unchanged signatures) instead. | | 4 | Transitive dependencies removed | `jackson-databind`, `jackson-annotations`, `jackson-datatype-jsr310`, and `commons-codec` are no longer pulled in transitively via `arrow-vector`. Downstreams that relied on this transitively must declare them directly. | | 5 | `module-info` `requires` reduced | `arrow-vector`'s module no longer `requires` `com.fasterxml.jackson.databind`, `…annotation`, `…jsr310`, or `org.apache.commons.codec`. | ## Non-breaking behavioral notes - `toString()` of complex vectors containing `java.time` values is unchanged (legacy numeric format reproduced); `serialVersionUID` of `JsonStringHashMap` / `JsonStringArrayList` is preserved. - No change to the Arrow IPC binary format or the integration-test JSON wire format (JSON object key ordering is not significant and is unaffected; `JsonFileWriter` emits raw epoch numbers as before). ## Testing - `arrow-vector`: full suite **1131 tests pass**. - `arrow-tools`: **13 pass** (exercises the Arrow↔JSON file round-trip). - `adapter/jdbc`: **152 pass** (exercises `JsonStringHashMap` serialization + H2 blob deserialization). - `spotless:check` + `checkstyle:check` clean. Closes #418. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
