Dear Stanbol devs, I have a question regarding the following scenario -- text extraction from different file types.
- I am using Tika engine to extract text from common documents - I have implemented a Tesseract OCR engine to extract text from images - Now I would like to use those in a single chain and do something with the extracted text (output is as metadata). The problem is, Tika extracts empty text/plain ContentPart for images, and my tesseract engine extracts it's own. Therefore I end up with multiple text/plain ContentParts. *How does one conveniently access all the text/plain content parts extracted by previous engines?* ContentItemHelper.getBlob() only returns the first ContentPart that matches the media type. I guess I could iterate over all content parts manually and check their types, but this does not seem very convenient. Also, ideally I would like to have only a single text/plain ContentPart, as the meaning is identical with both engines... The scenario feels very typical, which makes me wonder if I am misunderstanding something very obvious. Your help is most appreciated, thank you for your time! Michal