[jira] [Commented] (TIKA-4748) Clean up pdf+ocr config in 4.x

ASF GitHub Bot (Jira) Wed, 03 Jun 2026 12:20:26 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085897#comment-18085897
 ]


ASF GitHub Bot commented on TIKA-4748:
--------------------------------------

Copilot commented on code in PR #2864:
URL: https://github.com/apache/tika/pull/2864#discussion_r3351244669


##########
tika-app/src/main/java/org/apache/tika/cli/XmlToJsonConfigConverter.java:
##########
@@ -367,11 +368,51 @@ private static Map<String, Object> 
convertParserElement(Element parserElement,
             config.put("exclude", excludes);
         }
 
+        if ("pdf-parser".equals(componentName)) {
+            // 4.x PDFParserConfig groups OCR settings under a nested "ocr" 
object
+            // (OcrConfig); the legacy flat ocr* keys were removed.
+            nestOcrParams(config);
+        }
+
         Map<String, Object> result = new LinkedHashMap<>();
         result.put(componentName, config);
         return result;
     }
 
+    // Maps the legacy flat PDFParser ocr* params to their nested OcrConfig 
keys.
+    private static final Map<String, String> OCR_PARAM_TO_NESTED_KEY = 
Map.ofEntries(
+            Map.entry("ocrStrategy", "strategy"),
+            Map.entry("ocrStrategyAuto", "strategyAuto"),
+            Map.entry("ocrRenderingStrategy", "renderingStrategy"),
+            Map.entry("ocrImageFormat", "imageFormat"),
+            Map.entry("ocrImageType", "imageType"),
+            Map.entry("ocrDPI", "dpi"),
+            Map.entry("ocrImageQuality", "imageQuality"),
+            Map.entry("ocrMaxImagePixels", "maxImagePixels"),
+            Map.entry("ocrMaxPagesToOcr", "maxPagesToOcr"));
+
+    /**
+     * Moves the legacy flat {@code ocr*} PDFParser params (e.g. {@code 
ocrStrategy},
+     * {@code ocrDPI}) into the nested {@code "ocr"} object used by 4.x
+     * {@code PDFParserConfig} ({@code OcrConfig}). The flat {@code ocr*} JSON 
keys were
+     * removed in 4.x, so a verbatim copy would no longer load.
+     */
+    private static void nestOcrParams(Map<String, Object> config) {
+        Map<String, Object> ocr = new LinkedHashMap<>();
+        Iterator<Map.Entry<String, Object>> it = config.entrySet().iterator();
+        while (it.hasNext()) {
+            Map.Entry<String, Object> entry = it.next();
+            String nestedKey = OCR_PARAM_TO_NESTED_KEY.get(entry.getKey());
+            if (nestedKey != null) {
+                ocr.put(nestedKey, entry.getValue());
+                it.remove();
+            }
+        }
+        if (!ocr.isEmpty()) {
+            config.put("ocr", ocr);
+        }
+    }

Review Comment:
   The converter will overwrite any existing "ocr" object in the pdf-parser 
config when nesting legacy flat ocr* params. Because <param name="ocr" 
type="map"> is supported, users can already provide a nested ocr map; if they 
also include any legacy ocr* params, this implementation replaces the existing 
map and loses those settings. Consider merging into the existing "ocr" map and 
only filling missing keys from legacy params.





> Clean up pdf+ocr config in 4.x
> ------------------------------
>
>                 Key: TIKA-4748
>                 URL: https://issues.apache.org/jira/browse/TIKA-4748
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> [~birdya22] ran into a two-path config issue on TIKA-4747 in how we set ocr 
> options in the pdfconfig. We should clean up our code to allow only a single 
> (non-flat) option.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4748) Clean up pdf+ocr config in 4.x

Reply via email to