[ http://issues.apache.org/jira/browse/NUTCH-21?page=comments#action_64520 
]
     
David Spencer commented on NUTCH-21:
------------------------------------

This may be of some use:

I needed a PPT parser in the context of Lucene, so I copied the code from here, 
commented out a few nutch-specific things (e.g. the logging calls), and tested 
it on some local PPT files. I'm using POI-2.5.1-final.

The code is not perfect, nor is the PPT I have :) but it's pretty good.
When it works it works well.
Went it fails it sometimes says there is no content, but in the doc I spot 
checked there seemed to be textual content. I have only spot checked a few docs 
but I did run it thru my disk:

In a test run:
[a] I had 195 PPT files
[b] In 36 files it said there was no body
[c] With one file it thru an exception
[d] With 158 files it found content

Wrt [b] this is not necessarily wrong e.g. if there are only images, however in 
the 1 file I spot checked there was apparently textual content.

Wrt [d], I didn't spot check many files but the ones I did seemed fine.

Personally I would advocate using this esp if someone verifies this within 
nutch - but I'm confident it will work as I didn't change much to use it in 
Lucene.



This was the "bug" that happened in 1 file
Caused by: java.io.IOException: Cannot remove block[ 18805 ]; out of range
        at 
org.apache.poi.poifs.storage.BlockListImpl.remove(BlockListImpl.java:103)
        at 
org.apache.poi.poifs.storage.BlockAllocationTableReader.<init>(BlockAllocationTableReader.java:92)
        at 
org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:83)
        at com.tropo.ppt.PPT2Text.init(PPT2Text.java:92)



-- Dave







> parser plugin for MS PowerPoint slides
> --------------------------------------
>
>          Key: NUTCH-21
>          URL: http://issues.apache.org/jira/browse/NUTCH-21
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stefan Grroschupf
>     Priority: Trivial
>  Attachments: build.xml.patch.txt, parse-mspowerpoint.zip
>
> transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1109321&group_id=59548&atid=491356
> submitted by:
> Stephan Strittmatter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to