This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch simplify-parse-context-serialization in repository https://gitbox.apache.org/repos/asf/tika.git
commit 9e4251b26bd893cae849a4c862fe99b43fec545d Author: tallison <[email protected]> AuthorDate: Wed Nov 26 16:14:45 2025 -0500 simplify parse context serialization --- PARSECONTEXT_CONFIG_IMPLEMENTATION.md | 152 ++++++++++++++++ .../serialization/FetchEmitTupleDeserializer.java | 9 +- .../tika/serialization/ConfigDeserializer.java | 129 ++++++++++++++ .../serialization/ParseContextDeserializer.java | 7 +- .../tika/serialization/ParseContextSerializer.java | 13 +- .../TestParseContextSerialization.java | 195 +++++++++++++++++++++ 6 files changed, 498 insertions(+), 7 deletions(-) diff --git a/PARSECONTEXT_CONFIG_IMPLEMENTATION.md b/PARSECONTEXT_CONFIG_IMPLEMENTATION.md new file mode 100644 index 000000000..4563548e5 --- /dev/null +++ b/PARSECONTEXT_CONFIG_IMPLEMENTATION.md @@ -0,0 +1,152 @@ +# ParseContext Configuration Implementation + +## Overview + +This implementation unifies the configuration format between `tika-config.json` and per-request configurations (like `FetchEmitTuple`). Users can now use the same friendly parser names (e.g., `"pdf-parser"`, `"html-parser"`) in both contexts. + +## Key Changes + +### 1. ParseContextSerializer (`tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextSerializer.java`) + +**Supports two serialization formats:** + +#### Legacy Format (backward compatible): +Objects set directly in ParseContext via `context.set(SomeClass.class, object)` are serialized under an `"objects"` field with `"_class"` type information: + +```json +{ + "objects": { + "org.apache.tika.metadata.filter.MetadataFilter": { + "_class": "org.apache.tika.metadata.filter.CompositeMetadataFilter", + "filters": [...] + } + } +} +``` + +#### New Friendly-Name Format: +Configurations stored in ConfigContainer are serialized as top-level fields using friendly names matching `tika-config.json`: + +```json +{ + "pdf-parser": { + "ocrStrategy": "AUTO", + "extractInlineImages": true + }, + "html-parser": { + "extractScripts": false + } +} +``` + +**Both formats can coexist in the same JSON.** + +### 2. ParseContextDeserializer (`tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextDeserializer.java`) + +**Deserializes both formats:** + +- **Legacy `"objects"` field**: Deserializes with security validation (class name whitelist) and stores objects directly in ParseContext +- **Friendly-name fields**: Stores as JSON strings in ConfigContainer for parsers to deserialize on-demand + +**Security features:** +- Validates class names against whitelist (only Tika classes, metadata-extractor, and safe Java types) +- Prevents deserialization attacks (CVE-2017-7525 style) + +### 3. ConfigDeserializer (`tika-serialization/src/main/java/org/apache/tika/serialization/ConfigDeserializer.java`) + +**New helper utility for parsers** to retrieve and deserialize their configuration from ConfigContainer. + +**Key features:** +- **Generic merging**: Automatically merges user config with parser defaults using Jackson's `readerForUpdating` +- **No per-config cloneAndUpdate needed**: Eliminates repetitive merge logic across different config classes +- **Optional dependency**: Parsers check for NoClassDefFoundError to fall back if tika-serialization is not on classpath + +**Usage in parsers:** + +```java +PDFParserConfig localConfig; +try { + // ConfigDeserializer automatically merges user config with defaultConfig + localConfig = ConfigDeserializer.getConfig( + context, "pdf-parser", PDFParserConfig.class, defaultConfig); +} catch (NoClassDefFoundError e) { + // tika-serialization not on classpath, fall back to direct ParseContext lookup + PDFParserConfig userConfig = context.get(PDFParserConfig.class); + localConfig = (userConfig != null) ? + defaultConfig.cloneAndUpdate(userConfig) : defaultConfig; +} +``` + +### 4. FetchEmitTupleDeserializer (already updated) + +Uses the secure `ParseContextDeserializer.readParseContext(jsonNode, mapper)` method. + +## User-Facing Changes + +### Before (users had to know ParseContext internals): + +```json +{ + "id": "myId", + "fetcher": "fs", + "fetchKey": "hello_world.xml", + "parseContext": { + "objects": { + "org.apache.tika.sax.HandlerConfig": { + "_class": "org.apache.tika.sax.HandlerConfig", + "type": "xml", + "parseMode": "rmeta" + } + } + } +} +``` + +### After (matches tika-config.json format): + +```json +{ + "id": "myId", + "fetcher": "fs", + "fetchKey": "hello_world.xml", + "parseContext": { + "pdf-parser": { + "ocrStrategy": "AUTO", + "extractInlineImages": true + }, + "html-parser": { + "extractScripts": false + } + } +} +``` + +## Testing + +Comprehensive tests added to `TestParseContextSerialization.java`: + +1. ✅ `testFriendlyNameFormat()` - New friendly-name format serialization/deserialization +2. ✅ `testLegacyObjectsFormat()` - Legacy objects format still works +3. ✅ `testMixedFormat()` - Both formats coexist +4. ✅ `testConfigDeserializerHelper()` - ConfigDeserializer utility works +5. ✅ `testDeserializeFriendlyNameFromJSON()` - Raw JSON parsing +6. ✅ `testDeserializeMixedFromJSON()` - Mixed format JSON parsing +7. ✅ `testBasic()` - Original test still passes + +## Benefits + +1. **Unified Configuration**: Same friendly names across `tika-config.json` and per-request configs +2. **Cleaner API**: Users don't need to know about `_class`, `objects`, or fully-qualified class names +3. **Backward Compatible**: Legacy format still works +4. **Generic Merging**: No more per-config `cloneAndUpdate` methods needed +5. **Secure**: Class name validation prevents deserialization attacks +6. **No Dependencies**: tika-parsers-standard doesn't need tika-serialization dependency +7. **Parser Responsibility**: Each parser deserializes its own config when needed + +## Next Steps + +To use this in parsers: + +1. Update parser code to call `ConfigDeserializer.getConfig()` with try/catch for NoClassDefFoundError +2. Document the friendly parser names in configuration documentation +3. Update examples to show the new friendly-name format diff --git a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/serialization/FetchEmitTupleDeserializer.java b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/serialization/FetchEmitTupleDeserializer.java index 018a18893..7fc49d054 100644 --- a/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/serialization/FetchEmitTupleDeserializer.java +++ b/tika-pipes/tika-pipes-core/src/main/java/org/apache/tika/pipes/core/serialization/FetchEmitTupleDeserializer.java @@ -36,6 +36,7 @@ import com.fasterxml.jackson.core.JsonParser; import com.fasterxml.jackson.databind.DeserializationContext; import com.fasterxml.jackson.databind.JsonDeserializer; import com.fasterxml.jackson.databind.JsonNode; +import com.fasterxml.jackson.databind.ObjectMapper; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; @@ -48,6 +49,7 @@ public class FetchEmitTupleDeserializer extends JsonDeserializer<FetchEmitTuple> @Override public FetchEmitTuple deserialize(JsonParser jsonParser, DeserializationContext deserializationContext) throws IOException, JacksonException { + ObjectMapper mapper = (ObjectMapper) jsonParser.getCodec(); JsonNode root = jsonParser.readValueAsTree(); String id = readVal(ID, root, null, true); @@ -58,8 +60,13 @@ public class FetchEmitTupleDeserializer extends JsonDeserializer<FetchEmitTuple> long fetchRangeStart = readLong(FETCH_RANGE_START, root, -1l, false); long fetchRangeEnd = readLong(FETCH_RANGE_END, root, -1l, false); Metadata metadata = readMetadata(root); + + // Deserialize ParseContext with security validation JsonNode parseContextNode = root.get(PARSE_CONTEXT); - ParseContext parseContext = parseContextNode == null ? new ParseContext() : ParseContextDeserializer.readParseContext(parseContextNode); + ParseContext parseContext = parseContextNode == null ? + new ParseContext() : + ParseContextDeserializer.readParseContext(parseContextNode, mapper); + FetchEmitTuple.ON_PARSE_EXCEPTION onParseException = readOnParseException(root); return new FetchEmitTuple(id, new FetchKey(fetcherId, fetchKey, fetchRangeStart, fetchRangeEnd), diff --git a/tika-serialization/src/main/java/org/apache/tika/serialization/ConfigDeserializer.java b/tika-serialization/src/main/java/org/apache/tika/serialization/ConfigDeserializer.java new file mode 100644 index 000000000..c583eb21c --- /dev/null +++ b/tika-serialization/src/main/java/org/apache/tika/serialization/ConfigDeserializer.java @@ -0,0 +1,129 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.serialization; + +import java.io.IOException; + +import com.fasterxml.jackson.databind.ObjectMapper; + +import org.apache.tika.config.ConfigContainer; +import org.apache.tika.parser.ParseContext; + +/** + * Helper utility for parsers to deserialize their configuration from ConfigContainer. + * <p> + * This allows parsers to retrieve their configuration using the same friendly names + * as in tika-config.json (e.g., "pdf-parser", "html-parser") from per-request + * configurations sent via FetchEmitTuple or other serialization mechanisms. + * <p> + * The helper automatically merges user configuration with parser defaults, eliminating + * the need for config-specific cloneAndUpdate methods. + * <p> + * Example usage in a parser: + * <pre> + * // Try to get config from ConfigContainer if tika-serialization is available + * PDFParserConfig localConfig; + * try { + * // ConfigDeserializer automatically merges user config with defaultConfig + * localConfig = ConfigDeserializer.getConfig( + * context, "pdf-parser", PDFParserConfig.class, defaultConfig); + * } catch (NoClassDefFoundError e) { + * // tika-serialization not on classpath, fall back to direct ParseContext lookup + * PDFParserConfig userConfig = context.get(PDFParserConfig.class); + * localConfig = (userConfig != null) ? + * defaultConfig.cloneAndUpdate(userConfig) : defaultConfig; + * } + * </pre> + */ +public class ConfigDeserializer { + + private static final ObjectMapper MAPPER = new ObjectMapper(); + + /** + * Retrieves and deserializes a parser configuration from the ConfigContainer in ParseContext. + * If a default config is provided, the user config will be merged on top of it. + * + * @param context the parse context containing the ConfigContainer + * @param configKey the configuration key (e.g., "pdf-parser", "html-parser") + * @param configClass the configuration class to deserialize into + * @param defaultConfig optional default config to merge with user config (can be null) + * @param <T> the configuration type + * @return the merged configuration, the default config if no user config found, or null if neither exists + * @throws IOException if deserialization fails + */ + public static <T> T getConfig(ParseContext context, String configKey, Class<T> configClass, T defaultConfig) + throws IOException { + if (context == null) { + return defaultConfig; + } + + ConfigContainer configContainer = context.get(ConfigContainer.class); + if (configContainer == null) { + return defaultConfig; + } + + String configJson = configContainer.get(configKey).orElse(null); + if (configJson == null) { + return defaultConfig; + } + + // If there's a default config, merge the user config on top of it + if (defaultConfig != null) { + // Use Jackson's readerForUpdating to merge user config into a copy of the default + return MAPPER.readerForUpdating(defaultConfig).readValue(configJson); + } else { + // No default config, just deserialize the user config + return MAPPER.readValue(configJson, configClass); + } + } + + /** + * Retrieves and deserializes a parser configuration from the ConfigContainer in ParseContext. + * This version does not merge with any default config. + * + * @param context the parse context containing the ConfigContainer + * @param configKey the configuration key (e.g., "pdf-parser", "html-parser") + * @param configClass the configuration class to deserialize into + * @param <T> the configuration type + * @return the deserialized configuration, or null if not found + * @throws IOException if deserialization fails + */ + public static <T> T getConfig(ParseContext context, String configKey, Class<T> configClass) + throws IOException { + return getConfig(context, configKey, configClass, null); + } + + /** + * Checks if a configuration exists in the ConfigContainer. + * + * @param context the parse context + * @param configKey the configuration key to check + * @return true if the configuration exists + */ + public static boolean hasConfig(ParseContext context, String configKey) { + if (context == null) { + return false; + } + + ConfigContainer configContainer = context.get(ConfigContainer.class); + if (configContainer == null) { + return false; + } + + return configContainer.get(configKey).isPresent(); + } +} diff --git a/tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextDeserializer.java b/tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextDeserializer.java index 2d7acaace..162c687be 100644 --- a/tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextDeserializer.java +++ b/tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextDeserializer.java @@ -105,7 +105,7 @@ public class ParseContextDeserializer extends JsonDeserializer<ParseContext> { ParseContext parseContext = new ParseContext(); - // Deserialize objects from "objects" field + // Handle legacy "objects" field - deserialize directly into ParseContext if (contextNode.has("objects")) { JsonNode objectsNode = contextNode.get("objects"); for (Map.Entry<String, JsonNode> entry : objectsNode.properties()) { @@ -133,7 +133,9 @@ public class ParseContextDeserializer extends JsonDeserializer<ParseContext> { } } - // Deserialize ConfigContainer from top-level fields (excluding "objects") + // Store all non-"objects" fields as named configurations in ConfigContainer + // This allows parsers to look up their config by friendly name (e.g., "pdf-parser") + // matching the same format used in tika-config.json ConfigContainer configContainer = null; for (Iterator<String> it = contextNode.fieldNames(); it.hasNext(); ) { String fieldName = it.next(); @@ -162,4 +164,5 @@ public class ParseContextDeserializer extends JsonDeserializer<ParseContext> { } return valNode.asText(); } + } diff --git a/tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextSerializer.java b/tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextSerializer.java index 206a955d0..9497901e6 100644 --- a/tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextSerializer.java +++ b/tika-serialization/src/main/java/org/apache/tika/serialization/ParseContextSerializer.java @@ -47,7 +47,8 @@ public class ParseContextSerializer extends JsonSerializer<ParseContext> { Map<String, Object> contextMap = parseContext.getContextMap(); ConfigContainer configContainer = parseContext.get(ConfigContainer.class); - // Write non-ConfigContainer objects under "objects" field + // Serialize objects stored directly in ParseContext (legacy format) + // These are objects set via context.set(SomeClass.class, someObject) boolean hasNonConfigObjects = contextMap.size() > (configContainer != null ? 1 : 0); if (hasNonConfigObjects) { jsonGenerator.writeFieldName("objects"); @@ -74,7 +75,7 @@ public class ParseContextSerializer extends JsonSerializer<ParseContext> { jsonGenerator.writeStringField("_class", value.getClass().getName()); // Serialize object properties using Jackson - com.fasterxml.jackson.databind.JsonNode tree = mapper.valueToTree(value); + JsonNode tree = mapper.valueToTree(value); var fields = tree.fields(); while (fields.hasNext()) { var field = fields.next(); @@ -88,10 +89,14 @@ public class ParseContextSerializer extends JsonSerializer<ParseContext> { jsonGenerator.writeEndObject(); } - // Write ConfigContainer fields directly as top-level properties + // Write ConfigContainer fields as top-level properties (new friendly-name format) + // Each field contains a JSON string representing a parser/component configuration + // using the same friendly names as tika-config.json (e.g., "pdf-parser", "html-parser") if (configContainer != null) { for (String key : configContainer.getKeys()) { - jsonGenerator.writeStringField(key, configContainer.get(key).get()); + jsonGenerator.writeFieldName(key); + // Write the JSON string as raw JSON (not as a quoted string) + jsonGenerator.writeRawValue(configContainer.get(key).get()); } } diff --git a/tika-serialization/src/test/java/org/apache/tika/serialization/TestParseContextSerialization.java b/tika-serialization/src/test/java/org/apache/tika/serialization/TestParseContextSerialization.java index 55546d7d3..acaddcca6 100644 --- a/tika-serialization/src/test/java/org/apache/tika/serialization/TestParseContextSerialization.java +++ b/tika-serialization/src/test/java/org/apache/tika/serialization/TestParseContextSerialization.java @@ -17,6 +17,7 @@ package org.apache.tika.serialization; import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotNull; import static org.junit.jupiter.api.Assertions.assertTrue; import java.io.StringWriter; @@ -25,6 +26,7 @@ import java.util.List; import com.fasterxml.jackson.core.JsonFactory; import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.JsonNode; import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.module.SimpleModule; import org.junit.jupiter.api.Test; @@ -38,6 +40,25 @@ import org.apache.tika.parser.ParseContext; public class TestParseContextSerialization { + private ObjectMapper createMapper() { + ObjectMapper mapper = new ObjectMapper(); + SimpleModule module = new SimpleModule(); + module.addDeserializer(ParseContext.class, new ParseContextDeserializer()); + module.addSerializer(ParseContext.class, new ParseContextSerializer()); + mapper.registerModule(module); + return mapper; + } + + private String serializeParseContext(ParseContext pc) throws Exception { + try (Writer writer = new StringWriter()) { + try (JsonGenerator jsonGenerator = new JsonFactory().createGenerator(writer)) { + ParseContextSerializer serializer = new ParseContextSerializer(); + serializer.serialize(pc, jsonGenerator, null); + } + return writer.toString(); + } + } + @Test public void testBasic() throws Exception { @@ -70,4 +91,178 @@ public class TestParseContextSerialization { assertEquals(1, metadataFilters.size()); assertTrue(metadataFilters.get(0) instanceof DateNormalizingMetadataFilter); } + + @Test + public void testFriendlyNameFormat() throws Exception { + // Test the new friendly-name format matching tika-config.json + ParseContext pc = new ParseContext(); + ConfigContainer configContainer = new ConfigContainer(); + + // Add friendly-named configurations + configContainer.set("pdf-parser", "{\"ocrStrategy\":\"AUTO\",\"extractInlineImages\":true}"); + configContainer.set("html-parser", "{\"extractScripts\":false}"); + + pc.set(ConfigContainer.class, configContainer); + + String json = serializeParseContext(pc); + + // Verify JSON structure + ObjectMapper mapper = createMapper(); + JsonNode root = mapper.readTree(json); + + assertTrue(root.has("pdf-parser"), "Should have pdf-parser field"); + assertTrue(root.has("html-parser"), "Should have html-parser field"); + assertEquals("AUTO", root.get("pdf-parser").get("ocrStrategy").asText()); + assertEquals(false, root.get("html-parser").get("extractScripts").asBoolean()); + + // Verify round-trip + ParseContext deserialized = mapper.readValue(json, ParseContext.class); + ConfigContainer deserializedConfig = deserialized.get(ConfigContainer.class); + assertNotNull(deserializedConfig); + assertTrue(deserializedConfig.get("pdf-parser").isPresent()); + assertTrue(deserializedConfig.get("html-parser").isPresent()); + } + + @Test + public void testLegacyObjectsFormat() throws Exception { + // Test the legacy format with "objects" field + MetadataFilter metadataFilter = new CompositeMetadataFilter( + List.of(new DateNormalizingMetadataFilter())); + ParseContext pc = new ParseContext(); + pc.set(MetadataFilter.class, metadataFilter); + + String json = serializeParseContext(pc); + + // Verify JSON has "objects" field + ObjectMapper mapper = createMapper(); + JsonNode root = mapper.readTree(json); + assertTrue(root.has("objects"), "Should have objects field for legacy format"); + + // Verify round-trip + ParseContext deserialized = mapper.readValue(json, ParseContext.class); + MetadataFilter deserializedFilter = deserialized.get(MetadataFilter.class); + assertNotNull(deserializedFilter); + assertTrue(deserializedFilter instanceof CompositeMetadataFilter); + } + + @Test + public void testMixedFormat() throws Exception { + // Test that both legacy objects and new friendly names can coexist + MetadataFilter metadataFilter = new CompositeMetadataFilter( + List.of(new DateNormalizingMetadataFilter())); + ParseContext pc = new ParseContext(); + pc.set(MetadataFilter.class, metadataFilter); + + ConfigContainer configContainer = new ConfigContainer(); + configContainer.set("pdf-parser", "{\"ocrStrategy\":\"NO_OCR\"}"); + pc.set(ConfigContainer.class, configContainer); + + String json = serializeParseContext(pc); + + // Verify both formats are present + ObjectMapper mapper = createMapper(); + JsonNode root = mapper.readTree(json); + assertTrue(root.has("objects"), "Should have objects field"); + assertTrue(root.has("pdf-parser"), "Should have pdf-parser field"); + + // Verify round-trip + ParseContext deserialized = mapper.readValue(json, ParseContext.class); + + // Check legacy object + MetadataFilter deserializedFilter = deserialized.get(MetadataFilter.class); + assertNotNull(deserializedFilter); + assertTrue(deserializedFilter instanceof CompositeMetadataFilter); + + // Check friendly-name config + ConfigContainer deserializedConfig = deserialized.get(ConfigContainer.class); + assertNotNull(deserializedConfig); + assertTrue(deserializedConfig.get("pdf-parser").isPresent()); + } + + @Test + public void testConfigDeserializerHelper() throws Exception { + // Test the ConfigDeserializer helper utility + ParseContext pc = new ParseContext(); + ConfigContainer configContainer = new ConfigContainer(); + + // Simulate a PDFParserConfig as JSON + String pdfConfig = "{\"extractInlineImages\":true,\"ocrStrategy\":\"AUTO\"}"; + configContainer.set("pdf-parser", pdfConfig); + + pc.set(ConfigContainer.class, configContainer); + + // Test hasConfig + assertTrue(ConfigDeserializer.hasConfig(pc, "pdf-parser")); + + // Test getConfig with a simple JSON deserialization + // We can't use actual PDFParserConfig here since we don't have the dependency, + // but we can verify the JSON is retrieved correctly + String retrievedConfig = pc.get(ConfigContainer.class).get("pdf-parser").orElse(null); + assertNotNull(retrievedConfig); + assertTrue(retrievedConfig.contains("extractInlineImages")); + } + + @Test + public void testDeserializeFriendlyNameFromJSON() throws Exception { + // Test deserializing friendly-name format from raw JSON string + String json = """ + { + "pdf-parser": { + "ocrStrategy": "AUTO", + "extractInlineImages": true + }, + "html-parser": { + "extractScripts": false + } + } + """; + + ObjectMapper mapper = createMapper(); + ParseContext deserialized = mapper.readValue(json, ParseContext.class); + + ConfigContainer config = deserialized.get(ConfigContainer.class); + assertNotNull(config); + assertTrue(config.get("pdf-parser").isPresent()); + assertTrue(config.get("html-parser").isPresent()); + + // Verify the JSON content + String pdfParserJson = config.get("pdf-parser").get(); + assertTrue(pdfParserJson.contains("AUTO")); + assertTrue(pdfParserJson.contains("extractInlineImages")); + } + + @Test + public void testDeserializeMixedFromJSON() throws Exception { + // Test deserializing JSON with both legacy objects and friendly names + String json = """ + { + "objects": { + "org.apache.tika.metadata.filter.MetadataFilter": { + "_class": "org.apache.tika.metadata.filter.CompositeMetadataFilter", + "filters": [ + { + "_class": "org.apache.tika.metadata.filter.DateNormalizingMetadataFilter" + } + ] + } + }, + "pdf-parser": { + "ocrStrategy": "AUTO" + } + } + """; + + ObjectMapper mapper = createMapper(); + ParseContext deserialized = mapper.readValue(json, ParseContext.class); + + // Verify legacy object was deserialized + MetadataFilter filter = deserialized.get(MetadataFilter.class); + assertNotNull(filter); + assertTrue(filter instanceof CompositeMetadataFilter); + + // Verify friendly-name config was stored + ConfigContainer config = deserialized.get(ConfigContainer.class); + assertNotNull(config); + assertTrue(config.get("pdf-parser").isPresent()); + } }
