Re: [Nutch-general] nutch scrawls only relative links

Alan Tanaman Wed, 24 Jan 2007 10:35:13 -0800

Without looking too much at the source code, I assume this is down to the
handling of getOutlinks method in the DOMContentUtils class in the
parse-html plugin.


This method extracts outlinks from the DOM tree created from the HTML page.
These are then inserted into the crawldb for subsequent fetching.

Suggest that you try debugging that method to see what it does with such
anchors -- meaning what is the final content if any of such anchors (if no
one else has any more specific direction).

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
Tel: +44 (20) 7257 6125
Mobile: +44 (7796) 932 362
http://blog.idna-solutions.com

-----Original Message-----
From: Denis Pimenov [mailto:[EMAIL PROTECTED] 
Sent: 24 January 2007 15:36
To: [email protected]
Subject: Re: nutch scrawls only relative links

Denis Pimenov пишет:

I used this +^.* in crawl-urlfilter.txt, but it's don't working..it 
doesn't crawl relative links, but only absolute...
> Hello
>
> I am a newbie in nutch...  It seems to me that scrawling is not 
> working by relative urls by default. How to fix it?
>
> For example i have relative link on start page <a 
> href="/test/my.jsp">  is not scrawled(but browsers opens in with 
> proper prefix) , but  if i have link <a 
> href="http://mydomain.com:8080/test/my.jsp";> it's crawled well .. Is 
> there any configuration file or something else to fix that?.. I have 
> seen such question in mail archive but it wasn't answered
>
> Denis Pimenov
>
>
Denis Pimenov



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] nutch scrawls only relative links

Reply via email to