nutch fetch issue - empty content

Viral Shah Tue, 09 Sep 2008 16:55:14 -0700

Hello --

We are using Nutch to crawl html content for Wikipedia articles. We'reusing somewhat old nightly build version of nutch.

We use static list urls as an input. To do this we've injected ourlist of urls, set db.update.additions.allowed to false, and set thecrawl depth to 1.

- We iterate over the output segment files using'SequenceFile.Reader' and pullout the 'string' as well as 'binary'form of content.

        
                reader = SequenceFile.Reader(filesystem, Path(sys.argv[1]), job)
                key = reader.getKeyClass()()
                content = reader.getValueClass()()
                while reader.next(key, content):
                        content_text = String(content.getContent(), 
"UTF-8").toString()
                        content_binary = content.getContent()

- I get empty content for some urls but the status in crawldb is setto 'db_fetched'.The value of content_text is "" and that of content_binary isarray('b',[])

- This is completely random in terms of when it happens and the urlsinvolved.

- This failure is completely silent as far as I can tell as nothingcan be seen in logs regarding this error.

Again, we are crawling wikipedia which is verifiable for it's contentand whether that content is accessible. We have tried manually gettingthe problem urls and everything looked fine.


Thank you,
Viral Shah

nutch fetch issue - empty content

Reply via email to