Bob Paulin created TIKA-1904:
--------------------------------
Summary: Tika 2.0 - Create Proxy Parser and Detectors
Key: TIKA-1904
URL: https://issues.apache.org/jira/browse/TIKA-1904
Project: Tika
Issue Type: Improvement
Affects Versions: 2.0
Reporter: Bob Paulin
Assignee: Bob Paulin
There are several parsers and detectors that instantiate parsers and detectors
that live in different modules in tika 2.0. As of now these modules have are
dependent on other modules this includes:
tika-parser-office-module -> tika-parser-web-module, tika-parser-text-module,
tika-parser-package-module
tika-parser-ebook-module -> tika-parser-text-module
tika-parser-journal-module -> tika-parser-pdf-module
May of these dependencies could be made optional by introducing the concept of
proxy parser and detectors that would enable functionality if all the
dependencies are included in the project but not throw a ClassNotFoundException
if the dependent module was not include( ex. parse function would do nothing).
EX
Currently
ChmParser
{code}
private void parsePage(byte[] byteObject, ContentHandler xhtml) throws
TikaException {// throws IOException
InputStream stream = null;
Metadata metadata = new Metadata();
HtmlParser htmlParser = new HtmlParser();
ContentHandler handler = new EmbeddedContentHandler(new
BodyContentHandler(xhtml));// -1
ParseContext parser = new ParseContext();
try {
stream = new ByteArrayInputStream(byteObject);
htmlParser.parse(stream, handler, metadata, parser);
} catch (SAXException e) {
throw new RuntimeException(e);
} catch (IOException e) {
// Pushback overflow from tagsoup
}
}
{code}
Instead the HtmlParser could be Proxyed in the constructor
{code}
private final Parser htmlProxyParser;
public ChmParser() {
this.htmlProxyParser = new
ProxyParser("org.apache.tika.parser.html.HtmlParser");
}
{code}
And
{code}
private void parsePage(byte[] byteObject, ContentHandler xhtml) throws
TikaException {// throws IOException
InputStream stream = null;
Metadata metadata = new Metadata();
ContentHandler handler = new EmbeddedContentHandler(new
BodyContentHandler(xhtml));// -1
ParseContext parser = new ParseContext();
try {
stream = new ByteArrayInputStream(byteObject);
htmlProxyParser.parse(stream, handler, metadata, parser);
} catch (SAXException e) {
throw new RuntimeException(e);
} catch (IOException e) {
// Pushback overflow from tagsoup
}
}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)