Copilot commented on code in PR #2864:
URL: https://github.com/apache/tika/pull/2864#discussion_r3351244669
##########
tika-app/src/main/java/org/apache/tika/cli/XmlToJsonConfigConverter.java:
##########
@@ -367,11 +368,51 @@ private static Map<String, Object>
convertParserElement(Element parserElement,
config.put("exclude", excludes);
}
+ if ("pdf-parser".equals(componentName)) {
+ // 4.x PDFParserConfig groups OCR settings under a nested "ocr"
object
+ // (OcrConfig); the legacy flat ocr* keys were removed.
+ nestOcrParams(config);
+ }
+
Map<String, Object> result = new LinkedHashMap<>();
result.put(componentName, config);
return result;
}
+ // Maps the legacy flat PDFParser ocr* params to their nested OcrConfig
keys.
+ private static final Map<String, String> OCR_PARAM_TO_NESTED_KEY =
Map.ofEntries(
+ Map.entry("ocrStrategy", "strategy"),
+ Map.entry("ocrStrategyAuto", "strategyAuto"),
+ Map.entry("ocrRenderingStrategy", "renderingStrategy"),
+ Map.entry("ocrImageFormat", "imageFormat"),
+ Map.entry("ocrImageType", "imageType"),
+ Map.entry("ocrDPI", "dpi"),
+ Map.entry("ocrImageQuality", "imageQuality"),
+ Map.entry("ocrMaxImagePixels", "maxImagePixels"),
+ Map.entry("ocrMaxPagesToOcr", "maxPagesToOcr"));
+
+ /**
+ * Moves the legacy flat {@code ocr*} PDFParser params (e.g. {@code
ocrStrategy},
+ * {@code ocrDPI}) into the nested {@code "ocr"} object used by 4.x
+ * {@code PDFParserConfig} ({@code OcrConfig}). The flat {@code ocr*} JSON
keys were
+ * removed in 4.x, so a verbatim copy would no longer load.
+ */
+ private static void nestOcrParams(Map<String, Object> config) {
+ Map<String, Object> ocr = new LinkedHashMap<>();
+ Iterator<Map.Entry<String, Object>> it = config.entrySet().iterator();
+ while (it.hasNext()) {
+ Map.Entry<String, Object> entry = it.next();
+ String nestedKey = OCR_PARAM_TO_NESTED_KEY.get(entry.getKey());
+ if (nestedKey != null) {
+ ocr.put(nestedKey, entry.getValue());
+ it.remove();
+ }
+ }
+ if (!ocr.isEmpty()) {
+ config.put("ocr", ocr);
+ }
+ }
Review Comment:
The converter will overwrite any existing "ocr" object in the pdf-parser
config when nesting legacy flat ocr* params. Because <param name="ocr"
type="map"> is supported, users can already provide a nested ocr map; if they
also include any legacy ocr* params, this implementation replaces the existing
map and loses those settings. Consider merging into the existing "ocr" map and
only filling missing keys from legacy params.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]