Re: [PR] TIKA-4748 -- clean up ocr configuration within pdfparser [tika]

via GitHub Wed, 03 Jun 2026 12:19:41 -0700


Copilot commented on code in PR #2864:
URL: https://github.com/apache/tika/pull/2864#discussion_r3351244669



##########
tika-app/src/main/java/org/apache/tika/cli/XmlToJsonConfigConverter.java:
##########
@@ -367,11 +368,51 @@ private static Map<String, Object> 
convertParserElement(Element parserElement,
             config.put("exclude", excludes);
         }
 
+        if ("pdf-parser".equals(componentName)) {
+            // 4.x PDFParserConfig groups OCR settings under a nested "ocr" 
object
+            // (OcrConfig); the legacy flat ocr* keys were removed.
+            nestOcrParams(config);
+        }
+
         Map<String, Object> result = new LinkedHashMap<>();
         result.put(componentName, config);
         return result;
     }
 
+    // Maps the legacy flat PDFParser ocr* params to their nested OcrConfig 
keys.
+    private static final Map<String, String> OCR_PARAM_TO_NESTED_KEY = 
Map.ofEntries(
+            Map.entry("ocrStrategy", "strategy"),
+            Map.entry("ocrStrategyAuto", "strategyAuto"),
+            Map.entry("ocrRenderingStrategy", "renderingStrategy"),
+            Map.entry("ocrImageFormat", "imageFormat"),
+            Map.entry("ocrImageType", "imageType"),
+            Map.entry("ocrDPI", "dpi"),
+            Map.entry("ocrImageQuality", "imageQuality"),
+            Map.entry("ocrMaxImagePixels", "maxImagePixels"),
+            Map.entry("ocrMaxPagesToOcr", "maxPagesToOcr"));
+
+    /**
+     * Moves the legacy flat {@code ocr*} PDFParser params (e.g. {@code 
ocrStrategy},
+     * {@code ocrDPI}) into the nested {@code "ocr"} object used by 4.x
+     * {@code PDFParserConfig} ({@code OcrConfig}). The flat {@code ocr*} JSON 
keys were
+     * removed in 4.x, so a verbatim copy would no longer load.
+     */
+    private static void nestOcrParams(Map<String, Object> config) {
+        Map<String, Object> ocr = new LinkedHashMap<>();
+        Iterator<Map.Entry<String, Object>> it = config.entrySet().iterator();
+        while (it.hasNext()) {
+            Map.Entry<String, Object> entry = it.next();
+            String nestedKey = OCR_PARAM_TO_NESTED_KEY.get(entry.getKey());
+            if (nestedKey != null) {
+                ocr.put(nestedKey, entry.getValue());
+                it.remove();
+            }
+        }
+        if (!ocr.isEmpty()) {
+            config.put("ocr", ocr);
+        }
+    }

Review Comment:
   The converter will overwrite any existing "ocr" object in the pdf-parser 
config when nesting legacy flat ocr* params. Because <param name="ocr" 
type="map"> is supported, users can already provide a nested ocr map; if they 
also include any legacy ocr* params, this implementation replaces the existing 
map and loses those settings. Consider merging into the existing "ocr" map and 
only filling missing keys from legacy params.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] TIKA-4748 -- clean up ocr configuration within pdfparser [tika]

Reply via email to