Dear Stanbol devs,

I have a question regarding the following scenario -- text extraction from
different file types.

- I am using Tika engine to extract text from common documents
- I have implemented a Tesseract OCR engine to extract text from images

- Now I would like to use those in a single chain and do something with the
extracted text (output is as metadata).

The problem is, Tika extracts empty text/plain ContentPart for images, and
my tesseract engine extracts it's own. Therefore I end up with multiple
text/plain ContentParts.

*How does one conveniently access all the text/plain content parts
extracted by previous engines?* ContentItemHelper.getBlob() only returns
the first ContentPart that matches the media type. I guess I could iterate
over all content parts manually and check their types, but this does not
seem very convenient.

Also, ideally I would like to have only a single text/plain ContentPart, as
the meaning is identical with both engines... The scenario feels very
typical, which makes me wonder if I am misunderstanding something very
obvious.

Your help is most appreciated, thank you for your time!
Michal

Reply via email to