[ https://issues.apache.org/jira/browse/NUTCH-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche closed NUTCH-463. ------------------------------- Resolution: Won't Fix Parsing delegated to Tika > Nutch powerpoint parser plugin fails to parse ppt with images > ------------------------------------------------------------- > > Key: NUTCH-463 > URL: https://issues.apache.org/jira/browse/NUTCH-463 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.8.1 > Environment: Windows > Reporter: W Fong > > With powerpoint presentations that have images, the parser seems to treat > images as if they are text and tries to index it resulting in maxFieldLength > being reached. > The lines from the crawl log file for the powerpoint in question: > Indexing [http://127.0.0.1/] with analyzer > org.apache.nutch.analysis.NutchDocumentAnalyzer@1ce85c4 (null) > Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer > org.apache.nutch.analysis.NutchDocumentAnalyzer@1ce85c4 (null) > maxFieldLength 10000 reached, ignoring following tokens > > The parser should extract only the text and skip the images. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira