Richard Braman wrote:
any idea why http://24.75.221.234:8080/search.jsp?query=e-file+site%3Awww.irs.gov
<http://24.75.221.234:8080/search.jsp?query=e-file+site%3Awww.irs.gov&hi
tsPerPage=10&hitsPerSite=0&clustering>
&hitsPerPage=10&hitsPerSite=0&clustering=
returns a list of hits where the title of the page is not shown , but
instead the url is shown.  The pages do have titles.

The "explain" button also shows a null title and the cache does not include these files. Are you sure they were fetched? Perhaps they only have links. What version of Nutch are you using? 0.8 does not support indexing pages with only links, but I think 0.7 may have. If 0.8, then I'd suspect the parser. Try re-parsing these pages (e.g., by crawling only these pages in a test crawl). Maybe put some print statements in the parser to see what's going on?

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to