Hi,

what's the best place to implement this feature:

source, under htdig (indentation denotes call chart order):

ExternalParser::parse // at "case'u'", "case 'm'"
HTML::parse // at "if(dofollow)", several places.
  Retriever::got_href()
    Server::push()

The trick is, that the canonicalization is done in got_href but to
implement the feature the parent document URL is needed in canonical form.

The easy way is to add an argument to got_href to pass the canonical
parent URL to got_href, and implement the function in got_href. 

However, the canonical base URL needs to be pre-parsed for easy use of the
substring matching algorythm (is it ?), so maybe a modification will be
made to the canonicalization code proper, to do it there, once, and pass
the parsed result as public data of some class. 

HTML::do_tag also knows nothing about the parent name ?

I think that the special pre-parsing should be done in HTML::parse and
data be stored in a public data member of HTML::, then used in got_href()
after the canonicalization of the new URL will be done there, to call a
new member function of Retriever:: that will ok or prune the URL wrt the
feature to be implemented.

The special case of the 'first' URL on a server must also be handled,
although it should never appear (as it is injected directly via push()
and not with got_href() ?).

Opinions on how it's best to do this ? 

tia,

        Peter

PS: wrt the feature, redescribed:

If a document with URL /a/b/c contains a href that is an exact substring
of /a/b/c, such as /a/b or /a, then that href should be ignored and
removed from the URLs to be parsed (push()-ed). 

Questions:  

* a good name for a config option that turns this feature on 
* should the pruned URL appear in the URL list in despite of its not being
followed ?  
* what is a good strategy to match a string (list) of tokens separated by
'/' backwards. This: ?

match last char || fail
while more parts
  match last part || fail
  last = prev(last)

* other ? 




------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.

Reply via email to