Dear Stanbol devs,
I have a question regarding the following scenario -- text extraction from
different file types.
- I am using Tika engine to extract text from common documents
- I have implemented a Tesseract OCR engine to extract text from images
- Now I would like to use those in a single chain and do something with the
extracted text (output is as metadata).
The problem is, Tika extracts empty text/plain ContentPart for images, and
my tesseract engine extracts it's own. Therefore I end up with multiple
*How does one conveniently access all the text/plain content parts
extracted by previous engines?* ContentItemHelper.getBlob() only returns
the first ContentPart that matches the media type. I guess I could iterate
over all content parts manually and check their types, but this does not
seem very convenient.
Also, ideally I would like to have only a single text/plain ContentPart, as
the meaning is identical with both engines... The scenario feels very
typical, which makes me wonder if I am misunderstanding something very
Your help is most appreciated, thank you for your time!