Gilles Detillieux wrote:

> According to Vincent Queru:
> > Some time ago, I read that someone wanted to index not only the HTML
> > source but also the URLs that the robot comes across when indexing a
> > site.
> >
> > I DO NOT want to index the URLs but unfortunately, they get indexed : is
> > there something I missed here ?
>
> htdig doesn't make a point of indexing the URLs itself, but if any pages
> it indexes contain URLs as the link description text in a hypertext link,
> then that links description text gets indexed.  E.g.:  in this link...
>
>   <a href="http://www.htdig.org/files/">http://www.htdig.org/files/</a>
>
> the second occurrence of the URL will be treated as plain text, as
> well as a link description, and will be indexed.  There's no easy,
> automatic way of avoiding this.  Your best bet is to hunt down such
> files and change them.  You could set description_factor to 0, and that
> will prevent the description from being indexed for the referenced page,
> but it will do this for all link descriptions, which may be overkill and
> undesired, plus htdig will still index the description as plain text for
> the page containing the reference, so you won't get rid of it entirely.

Ok, I put the description_factor to 0 and it works fine because the site I index
is very special (it consists in one page full of links that all point to the same
page, only the arguments change (it is a dynamic PHP-coded site)).

But I still have one more question : I had included a  META NAME="robots"
VALUE=noindex" tag in the page containing the links but they still got indexed, is
that normal ?

Furthermore,  it is not the link description that got indexed but the link itself
(ie. the URL contained in the A HREF tag).


------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

Reply via email to