Without looking too much at the source code, I assume this is down to the handling of getOutlinks method in the DOMContentUtils class in the parse-html plugin.
This method extracts outlinks from the DOM tree created from the HTML page. These are then inserted into the crawldb for subsequent fetching. Suggest that you try debugging that method to see what it does with such anchors -- meaning what is the final content if any of such anchors (if no one else has any more specific direction). Best regards, Alan _________________________ Alan Tanaman iDNA Solutions Tel: +44 (20) 7257 6125 Mobile: +44 (7796) 932 362 http://blog.idna-solutions.com -----Original Message----- From: Denis Pimenov [mailto:[EMAIL PROTECTED] Sent: 24 January 2007 15:36 To: [email protected] Subject: Re: nutch scrawls only relative links Denis Pimenov пишет: I used this +^.* in crawl-urlfilter.txt, but it's don't working..it doesn't crawl relative links, but only absolute... > Hello > > I am a newbie in nutch... It seems to me that scrawling is not > working by relative urls by default. How to fix it? > > For example i have relative link on start page <a > href="/test/my.jsp"> is not scrawled(but browsers opens in with > proper prefix) , but if i have link <a > href="http://mydomain.com:8080/test/my.jsp"> it's crawled well .. Is > there any configuration file or something else to fix that?.. I have > seen such question in mail archive but it wasn't answered > > Denis Pimenov > > Denis Pimenov ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
