Richard Braman wrote:
any idea why
http://24.75.221.234:8080/search.jsp?query=e-file+site%3Awww.irs.gov
<http://24.75.221.234:8080/search.jsp?query=e-file+site%3Awww.irs.gov&hi
tsPerPage=10&hitsPerSite=0&clustering>
&hitsPerPage=10&hitsPerSite=0&clustering=
returns a list of hits where the title of the page is not shown , but
instead the url is shown. The pages do have titles.
The "explain" button also shows a null title and the cache does not
include these files. Are you sure they were fetched? Perhaps they only
have links. What version of Nutch are you using? 0.8 does not support
indexing pages with only links, but I think 0.7 may have. If 0.8, then
I'd suspect the parser. Try re-parsing these pages (e.g., by crawling
only these pages in a test crawl). Maybe put some print statements in
the parser to see what's going on?
Doug