Nutch powerpoint parser plugin fails to parse ppt with images -------------------------------------------------------------
Key: NUTCH-463 URL: https://issues.apache.org/jira/browse/NUTCH-463 Project: Nutch Issue Type: Bug Affects Versions: 0.8.1 Environment: Windows Reporter: Wilson Fong With powerpoint presentations that have images, the parser seems to treat images as if they are text and tries to index it resulting in maxFieldLength being reached. The lines from the crawl log file for the powerpoint in question: Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null) Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer [EMAIL PROTECTED] (null) maxFieldLength 10000 reached, ignoring following tokens -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers