[ 
https://issues.apache.org/jira/browse/NUTCH-463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilson Fong updated NUTCH-463:
------------------------------

    Description: 
With powerpoint presentations that have images, the parser seems to treat 
images as if they are text and tries to index it resulting in maxFieldLength 
being reached.
The lines from the crawl log file for the powerpoint in question:

 Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null)
 Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer [EMAIL 
PROTECTED] (null)
maxFieldLength 10000 reached, ignoring following tokens
 
The parser should extract only the text and skip the images.


  was:
With powerpoint presentations that have images, the parser seems to treat 
images as if they are text and tries to index it resulting in maxFieldLength 
being reached.
The lines from the crawl log file for the powerpoint in question:

 Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null)
 Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer [EMAIL 
PROTECTED] (null)
maxFieldLength 10000 reached, ignoring following tokens
 




> Nutch powerpoint parser plugin fails to parse ppt with images
> -------------------------------------------------------------
>
>                 Key: NUTCH-463
>                 URL: https://issues.apache.org/jira/browse/NUTCH-463
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>         Environment: Windows
>            Reporter: Wilson Fong
>
> With powerpoint presentations that have images, the parser seems to treat 
> images as if they are text and tries to index it resulting in maxFieldLength 
> being reached.
> The lines from the crawl log file for the powerpoint in question:
>  Indexing [http://127.0.0.1/] with analyzer [EMAIL PROTECTED] (null)
>  Indexing [http://127.0.0.1/wiki/images/0/01/Customer.ppt] with analyzer 
> [EMAIL PROTECTED] (null)
> maxFieldLength 10000 reached, ignoring following tokens
>  
> The parser should extract only the text and skip the images.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to