[
https://issues.apache.org/jira/browse/TIKA-4748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085897#comment-18085897
]
ASF GitHub Bot commented on TIKA-4748:
--------------------------------------
Copilot commented on code in PR #2864:
URL: https://github.com/apache/tika/pull/2864#discussion_r3351244669
##########
tika-app/src/main/java/org/apache/tika/cli/XmlToJsonConfigConverter.java:
##########
@@ -367,11 +368,51 @@ private static Map<String, Object>
convertParserElement(Element parserElement,
config.put("exclude", excludes);
}
+ if ("pdf-parser".equals(componentName)) {
+ // 4.x PDFParserConfig groups OCR settings under a nested "ocr"
object
+ // (OcrConfig); the legacy flat ocr* keys were removed.
+ nestOcrParams(config);
+ }
+
Map<String, Object> result = new LinkedHashMap<>();
result.put(componentName, config);
return result;
}
+ // Maps the legacy flat PDFParser ocr* params to their nested OcrConfig
keys.
+ private static final Map<String, String> OCR_PARAM_TO_NESTED_KEY =
Map.ofEntries(
+ Map.entry("ocrStrategy", "strategy"),
+ Map.entry("ocrStrategyAuto", "strategyAuto"),
+ Map.entry("ocrRenderingStrategy", "renderingStrategy"),
+ Map.entry("ocrImageFormat", "imageFormat"),
+ Map.entry("ocrImageType", "imageType"),
+ Map.entry("ocrDPI", "dpi"),
+ Map.entry("ocrImageQuality", "imageQuality"),
+ Map.entry("ocrMaxImagePixels", "maxImagePixels"),
+ Map.entry("ocrMaxPagesToOcr", "maxPagesToOcr"));
+
+ /**
+ * Moves the legacy flat {@code ocr*} PDFParser params (e.g. {@code
ocrStrategy},
+ * {@code ocrDPI}) into the nested {@code "ocr"} object used by 4.x
+ * {@code PDFParserConfig} ({@code OcrConfig}). The flat {@code ocr*} JSON
keys were
+ * removed in 4.x, so a verbatim copy would no longer load.
+ */
+ private static void nestOcrParams(Map<String, Object> config) {
+ Map<String, Object> ocr = new LinkedHashMap<>();
+ Iterator<Map.Entry<String, Object>> it = config.entrySet().iterator();
+ while (it.hasNext()) {
+ Map.Entry<String, Object> entry = it.next();
+ String nestedKey = OCR_PARAM_TO_NESTED_KEY.get(entry.getKey());
+ if (nestedKey != null) {
+ ocr.put(nestedKey, entry.getValue());
+ it.remove();
+ }
+ }
+ if (!ocr.isEmpty()) {
+ config.put("ocr", ocr);
+ }
+ }
Review Comment:
The converter will overwrite any existing "ocr" object in the pdf-parser
config when nesting legacy flat ocr* params. Because <param name="ocr"
type="map"> is supported, users can already provide a nested ocr map; if they
also include any legacy ocr* params, this implementation replaces the existing
map and loses those settings. Consider merging into the existing "ocr" map and
only filling missing keys from legacy params.
> Clean up pdf+ocr config in 4.x
> ------------------------------
>
> Key: TIKA-4748
> URL: https://issues.apache.org/jira/browse/TIKA-4748
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> [~birdya22] ran into a two-path config issue on TIKA-4747 in how we set ocr
> options in the pdfconfig. We should clean up our code to allow only a single
> (non-flat) option.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)