Andrea Cosentino created CAMEL-23456:
----------------------------------------

             Summary: camel-docling: Document title and type metadata not 
extracted reliably
                 Key: CAMEL-23456
                 URL: https://issues.apache.org/jira/browse/CAMEL-23456
             Project: Camel
          Issue Type: Bug
          Components: camel-docling
            Reporter: Andrea Cosentino


The {{EXTRACT_METADATA}} operation does not reliably populate the {{title}} and 
{{documentType}} fields on the returned {{DocumentMetadata}}. This is 
documented as an open issue in {{MetadataExtractionIT.java}} (lines 74-75): the 
integration test asserts on these fields but a TODO comment indicates the 
values are missing or incorrect.

h3. Reproduction
# Configure a docling-serve endpoint with {{operation=EXTRACT_METADATA}}
# Send a document that has a clear title and a known type (e.g., a PDF with 
explicit metadata)
# Inspect the {{DocumentMetadata}} object returned in the body

h3. Expected behavior
{{title}} reflects the document's title metadata; {{documentType}} reflects the 
document type as detected by docling.

h3. Actual behavior
Both fields are empty or null; the test in {{MetadataExtractionIT.java}} 
carries a TODO acknowledging this.

h3. Investigation hints
* The metadata is parsed from docling's JSON output in 
{{DoclingProducer.handleExtractMetadata()}}; verify the JSON path used to read 
these fields against the current docling-serve schema
* It is possible the field names or nesting changed in a recent docling release

h3. Acceptance criteria
* {{title}} and {{documentType}} are populated when present in the source 
document
* The TODO comments in {{MetadataExtractionIT.java}} are removed and the 
assertions pass
* If the upstream docling format does not expose these fields, document the 
limitation and remove the fields from {{DocumentMetadata}} (rather than 
silently leaving them empty)




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to