[Nutch-general] GZipped pages

Sï¿½bastien LE CALLONNEC Thu, 03 Mar 2005 13:00:06 -0800

Hi list, 

(Sorry if this isn't the proper list to post this)


I was experimenting with nutch (the version I got this morning from
subversion) and it didn't index a site I tried:  apparently, the
unGzipping of the page wasn't successful, for some reason.  The log I
was getting was:

050303 211020 fetched 4471 bytes of compressed content (expanded to 0
bytes) from http://www.lesauna.net/index.php3

I played with GZIPUtils a wee bit and realised that in the
unzipBestEffort(byte[] in, int sizeLimit) method, there was a catch
with nothing in it.  When I added some log, I had the following
exception:

java.lang.IndexOutOfBoundsException
        at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:89)
        at
org.apache.nutch.util.GZIPUtils.unzipBestEffort(GZIPUtils.java:70)
        at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:166)
        at
org.apache.nutch.protocol.http.Http.getContent(Http.java:186)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:120)

I think the problem occurs when:

1. the page is gzipped,
2. you set http.content.limit to -1.  (sizeLimit - written) is
therefore negative in this method and you get the exception.


I hope this helps.

Regards, 
Sï¿½bastien.


        

        
                
Dï¿½couvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails ! 
Crï¿½ez votre Yahoo! Mail sur http://fr.mail.yahoo.com/


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] GZipped pages

Reply via email to