[ https://issues.apache.org/jira/browse/NUTCH-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wilson Fong updated NUTCH-463: ------------------------------ Description: With powerpoint presentations that have images, the parser seems to treat images as if they are text and tries to index it resulting in maxFieldLength being reached. The lines from the crawl log file for the powerpoint in question: Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null) Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer [EMAIL PROTECTED] (null) maxFieldLength 10000 reached, ignoring following tokens The parser should extract only the text and skip the images. was: With powerpoint presentations that have images, the parser seems to treat images as if they are text and tries to index it resulting in maxFieldLength being reached. The lines from the crawl log file for the powerpoint in question: Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null) Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer [EMAIL PROTECTED] (null) maxFieldLength 10000 reached, ignoring following tokens > Nutch powerpoint parser plugin fails to parse ppt with images > ------------------------------------------------------------- > > Key: NUTCH-463 > URL: https://issues.apache.org/jira/browse/NUTCH-463 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.8.1 > Environment: Windows > Reporter: Wilson Fong > > With powerpoint presentations that have images, the parser seems to treat > images as if they are text and tries to index it resulting in maxFieldLength > being reached. > The lines from the crawl log file for the powerpoint in question: > Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null) > Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer > [EMAIL PROTECTED] (null) > maxFieldLength 10000 reached, ignoring following tokens > > The parser should extract only the text and skip the images. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers