Hi list, (Sorry if this isn't the proper list to post this)
I was experimenting with nutch (the version I got this morning from subversion) and it didn't index a site I tried: apparently, the unGzipping of the page wasn't successful, for some reason. The log I was getting was: 050303 211020 fetched 4471 bytes of compressed content (expanded to 0 bytes) from http://www.lesauna.net/index.php3 I played with GZIPUtils a wee bit and realised that in the unzipBestEffort(byte[] in, int sizeLimit) method, there was a catch with nothing in it. When I added some log, I had the following exception: java.lang.IndexOutOfBoundsException at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:89) at org.apache.nutch.util.GZIPUtils.unzipBestEffort(GZIPUtils.java:70) at org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:166) at org.apache.nutch.protocol.http.Http.getContent(Http.java:186) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:120) I think the problem occurs when: 1. the page is gzipped, 2. you set http.content.limit to -1. (sizeLimit - written) is therefore negative in this method and you get the exception. I hope this helps. Regards, S�bastien. D�couvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails ! Cr�ez votre Yahoo! Mail sur http://fr.mail.yahoo.com/ ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
