[jira] [Commented] (NUTCH-463) Nutch powerpoint parser plugin fails to parse ppt with images

Lewis John McGibbney (JIRA) Tue, 09 Aug 2011 08:16:56 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081695#comment-13081695
 ]


Lewis John McGibbney commented on NUTCH-463:
--------------------------------------------

Can we close this issue?

.ppt detection and parsing was delegated to Tika as of Nutch 1.3 official 
release, therefore the parse-powerpoint plugin code has been deprecated.


> Nutch powerpoint parser plugin fails to parse ppt with images
> -------------------------------------------------------------
>
>                 Key: NUTCH-463
>                 URL: https://issues.apache.org/jira/browse/NUTCH-463
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>         Environment: Windows
>            Reporter: W Fong
>
> With powerpoint presentations that have images, the parser seems to treat 
> images as if they are text and tries to index it resulting in maxFieldLength 
> being reached.
> The lines from the crawl log file for the powerpoint in question:
>  Indexing [http://127.0.0.1/] with analyzer 
> org.apache.nutch.analysis.NutchDocumentAnalyzer@1ce85c4 (null)
>  Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer 
> org.apache.nutch.analysis.NutchDocumentAnalyzer@1ce85c4 (null)
> maxFieldLength 10000 reached, ignoring following tokens
>  
> The parser should extract only the text and skip the images.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-463) Nutch powerpoint parser plugin fails to parse ppt with images

Reply via email to