[ http://issues.apache.org/jira/browse/NUTCH-21?page=comments#action_64531 
]
     
David Spencer commented on NUTCH-21:
------------------------------------

I figured out why I was getting docs with zero body.
Here's a stack trace, but note that I changed the package to com.tropo.ppt for 
my uses..

java.lang.StringIndexOutOfBoundsException: String index out of range: 1756156169
        at java.lang.String.checkBounds(String.java:287)
        at java.lang.String.<init>(String.java:370)
        at 
com.tropo.ppt.ContentReaderListener.extractSlides(ContentReaderListener.java:353)
        at 
com.tropo.ppt.ContentReaderListener.processPOIFSReaderEvent(ContentReaderListener.java:121)
        at 
org.apache.poi.poifs.eventfilesystem.POIFSReader.processProperties(POIFSReader.java:259)
        at 
org.apache.poi.poifs.eventfilesystem.POIFSReader.read(POIFSReader.java:95)
        at com.tropo.ppt.PPT2Text.init(PPT2Text.java:92)
        at com.tropo.ppt.PPT2Text.getProperties(PPT2Text.java:132)
        at com.tropo.ppt.PPT2Text.main(PPT2Text.java:144)

---
What's happening is in ContentReaderListener.java.

In processPOIFSReaderEvent() there's an empty catch(Throwable) block that hides 
the error.

In extractSlides() it happily goes thru some data but then, for some reason, 
'size' is larger than pptdata.length.

One hack to "fix" this is to replace this:

final String strTempContent = new String(pptdata, (int) i + 6,
            (int) (size) + 2);

String strTempContent;

try {
    strTempContent = new String(pptdata, (int) i + 6, (int) (size) + 2);
}
catch( StringIndexOutOfBoundsException ouch) {
    strTempContent = "";
}

When I do this I get data out of a ppt file that previously seemed to to have a 
zero length body...





> parser plugin for MS PowerPoint slides
> --------------------------------------
>
>          Key: NUTCH-21
>          URL: http://issues.apache.org/jira/browse/NUTCH-21
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Reporter: Stefan Grroschupf
>     Priority: Trivial
>  Attachments: build.xml.patch.txt, parse-mspowerpoint.zip
>
> transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1109321&group_id=59548&atid=491356
> submitted by:
> Stephan Strittmatter

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to