We are using nutch version nutch-2008-07-22_04-01-29. We have a crawldb with over 1 million urls.
We have noticed some of the urls in search results do not have titles. After some research comparing urls with titles and urls without titles, the urls without titles have empty parsetext. Why would some urls have empty parsetext? Is there some place I can look to determine why parsetext is missing? Is the only way to reparse those urls with empty parsetext to remove the crawl_parse directory for the corresponding segment and run the nutch parse command? Is there something I should do to guarantee all urls get a parsetext, and hopefully, a title? Thanks in advance for any assistance or pointers to other resources or ideas. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
