The main URL is in the form of:

http://www.somewhere.com

The page that is getting overlooked is:

http://www.somewhere.com/MyDir/index.html

The page in question (second URL above contains a table with about 70 links in
it to other subdirs and pages beneath MyDir):

http://www.somewhere.com/MyDir/A/index.html
http://www.somewhere.com/MyDir/B/index.html
..

The link at the top level index page (first URL above) is a relative URL, not
absolute (href=/MyDir/index.html).

My "limit_urls_to" keyword is simply set to the "start_url":

start_url: http://www.somewhere.com
limit_urls_to: ${start_url}

Question: will htdig convert the relative URLs to absolute URLs using the FQDN,
or do I need to add "MyDir" or something to limit_urls_to?

Thanks in advance!

Gabriel Fenteany wrote:

>  Hmmm.  I have links in tables (and for that matter in pull-down menus too)
>   that are followed and indexed just fine.  (ht://Dig is the greatest thing
>   since sliced bread!)  Are the linked files in the same or a sub-directory
> to
>   the index file entry page?  If not (if they do not contain all the URL
>   strings of the start_url) then the domain paths have to be defined in
>   limit_urls_to also; you can include a list of URLs of arbitrary length in
>   both limit_urls_to and start_url, each item separated by a whitespace.  Do
>   you have a robots.txt file or do you use robots metatags?  Maybe you
>   inadvertently excluded certain files or directories?
>
>   I'd start a completely new dig htdig -i (and -a if you want alternate work
>   files used) after checking these things and then see again.  If that
> fails,
>   I'd explicitly add the the page to start_url and see what happens.  The
> more
>   info the better.
>
>   Good luck.
>
>   Gabriel
>
>   >
>   >
>   > Hi gang - this one is absolutely making me crazy.  I'm not sure if this
>   > has come up on the list before, but I have no other options at this
>   > point:
>   >
>   > I am trying to index a site that has around 800 or so documents.  For
>   > some reason, htdig fails to see the links.
>   >
>   > My .conf file resembles:
>   >
>   > start_url:            http://www.somewhere.com
>   > limit_urls_to:     ${start_url}
>   > exclude_urls:    /cgi-bin/ .cgi
>   >
>   > Okay, so far, no voodoo there.  I have this one HTML file that has a
>   > large table in it, about 70 x 3 (200+ cells). Yep, you guessed it, htdig
>   > fails to see ANY of the links in the table. According to my logs, it
>   > never even retrieves it....(i.e. there is no "Retrieval command for
>   > http://whatever for this particular file).
>   >
>   > The file is definitely linked within the site - off the front page for
>   > that matter (as well as a few other places).  The log files show that it
>   > sees the actual link TO the file (from other files), but it never
>   > attempts to retrieve it.... :-(
>   >
>   > Any ideas on this one?  I'm about to take my own life - hehe.
>   >
>   > Cheers.
>   > Scott
>   >
>   > ------------------------------------
>   > To unsubscribe from the htdig mailing list, send a message to
>   > [EMAIL PROTECTED] containing the single word "unsubscribe" in
>   > the SUBJECT of the message.
>   >
>
>   --
>   Gabriel Fenteany, Ph.D.
>   Post-doctoral Fellow &
>   WWW VL: Cell Biology Maintainer
>   http://vl.bwh.harvard.edu
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED] containing the single word "unsubscribe" in
> the SUBJECT of the message.

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to