Nutch powerpoint parser plugin fails to parse ppt with images
-------------------------------------------------------------
Key: NUTCH-463
URL: https://issues.apache.org/jira/browse/NUTCH-463
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8.1
Environment: Windows
Reporter: Wilson Fong
With powerpoint presentations that have images, the parser seems to treat
images as if they are text and tries to index it resulting in maxFieldLength
being reached.
The lines from the crawl log file for the powerpoint in question:
Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null)
Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer [EMAIL
PROTECTED] (null)
maxFieldLength 10000 reached, ignoring following tokens
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.