Beletsky Andrey created TIKA-2508:
-------------------------------------
Summary: ParsingReader uses hardcoded content handler
Key: TIKA-2508
URL: https://issues.apache.org/jira/browse/TIKA-2508
Project: Tika
Issue Type: Improvement
Affects Versions: 1.16
Reporter: Beletsky Andrey
ParsingReader uses hardcoded content handler what makes it not useful in the
following case:
I want to parse image using TesseractParser using HOCR output format. but I
can't read it using this reader because its content handler is hardcoded to
BodyContentHandler which uses WriteOutContentHandler which uses
ToTextContentHandler by default. This sequence of content handlers gets rid of
all HOCR result format tags and their attributes.
*Expected Result:*
I would refactor this reader to make it more useful in cases like this. I
suppose making content handler configurable will solve the issue like this, but
you know better... probably there are some bottlenecks I don't know about.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)