[
https://issues.apache.org/jira/browse/CAMEL-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrea Cosentino updated CAMEL-23457:
-------------------------------------
Fix Version/s: 4.21.0
> camel-docling: OCR fails to detect footer regions in scanned images
> -------------------------------------------------------------------
>
> Key: CAMEL-23457
> URL: https://issues.apache.org/jira/browse/CAMEL-23457
> Project: Camel
> Issue Type: Bug
> Components: camel-docling
> Reporter: Andrea Cosentino
> Priority: Major
> Fix For: 4.21.0
>
>
> When OCR is enabled and applied to a scanned image with a clearly visible
> footer, the OCR result does not include the footer text. This is captured as
> an open issue in {{OcrExtractionIT.java}} (around line 181): the test
> contains a TODO noting "footer is not found by the ocr by Camel docling".
> h3. Reproduction
> # Send a scanned PDF or image with a known footer (e.g., page number,
> copyright line) to a docling endpoint with {{enableOCR=true}}
> # Inspect the extracted text
> h3. Expected behavior
> The footer text is present in the OCR output, possibly with positional/layout
> information when {{includeLayoutInfo=true}}.
> h3. Actual behavior
> Footer text is missing from the OCR output. The TODO in
> {{OcrExtractionIT.java}} acknowledges this gap.
> h3. Investigation hints
> * Verify whether the issue is in docling's OCR pipeline (region detection
> cuts off page bottom) or in how camel-docling configures the OCR call
> * Check whether different {{ocrEngine}} values change the result
> * Check whether {{forceOcr=true}} or {{doOcr=true}} produces a different
> outcome
> * Confirm against the latest docling-serve / docling CLI version
> h3. Acceptance criteria
> * Footer regions are reliably included in OCR output for typical document
> layouts
> * The TODO in {{OcrExtractionIT.java}} is removed and the test asserts on
> footer text
> * If the issue turns out to be upstream-only, file an upstream issue and
> document the workaround/limitation in {{docling-component.adoc}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)