Francesco Capponi created NUTCH-2276:
----------------------------------------
Summary: Tika Boilerpipe Parser in combo with RSS items doesn't
work
Key: NUTCH-2276
URL: https://issues.apache.org/jira/browse/NUTCH-2276
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.11, 1.12
Environment: feed parser for RSS
Tika parser with boilerpipe (with ArticleExtractor) for HTML
Reporter: Francesco Capponi
Sometimes it happens that the text (description) for an RSS item is too short
or has characteristics that Tika with Boilerpipe decide to cut the entire text,
resulting in an empty string.
in fact when the feed plugin selects a parser uses the function:
Parser parser = parserFactory.getParsers(contentType, link)[0];
the content being a HTML returns the Tika Boilerpipe article extractor.
Since the description text of an RSS as far as I know is always html, instead
of asking for the contentType, we could set another mimetype for this specific
case
String contentType = contentMeta.get(Response.CONTENT_TYPE);
->String contentType = "text/html-short";
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)