The main URL is in the form of:
http://www.somewhere.com
The page that is getting overlooked is:
http://www.somewhere.com/MyDir/index.html
The page in question (second URL above contains a table with about 70 links in
it to other subdirs and pages beneath MyDir):
http://www.somewhere.com/MyDir/A/index.html
http://www.somewhere.com/MyDir/B/index.html
..
The link at the top level index page (first URL above) is a relative URL, not
absolute (href=/MyDir/index.html).
My "limit_urls_to" keyword is simply set to the "start_url":
start_url: http://www.somewhere.com
limit_urls_to: ${start_url}
Question: will htdig convert the relative URLs to absolute URLs using the FQDN,
or do I need to add "MyDir" or something to limit_urls_to?
Thanks in advance!
Gabriel Fenteany wrote:
> Hmmm. I have links in tables (and for that matter in pull-down menus too)
> that are followed and indexed just fine. (ht://Dig is the greatest thing
> since sliced bread!) Are the linked files in the same or a sub-directory
> to
> the index file entry page? If not (if they do not contain all the URL
> strings of the start_url) then the domain paths have to be defined in
> limit_urls_to also; you can include a list of URLs of arbitrary length in
> both limit_urls_to and start_url, each item separated by a whitespace. Do
> you have a robots.txt file or do you use robots metatags? Maybe you
> inadvertently excluded certain files or directories?
>
> I'd start a completely new dig htdig -i (and -a if you want alternate work
> files used) after checking these things and then see again. If that
> fails,
> I'd explicitly add the the page to start_url and see what happens. The
> more
> info the better.
>
> Good luck.
>
> Gabriel
>
> >
> >
> > Hi gang - this one is absolutely making me crazy. I'm not sure if this
> > has come up on the list before, but I have no other options at this
> > point:
> >
> > I am trying to index a site that has around 800 or so documents. For
> > some reason, htdig fails to see the links.
> >
> > My .conf file resembles:
> >
> > start_url: http://www.somewhere.com
> > limit_urls_to: ${start_url}
> > exclude_urls: /cgi-bin/ .cgi
> >
> > Okay, so far, no voodoo there. I have this one HTML file that has a
> > large table in it, about 70 x 3 (200+ cells). Yep, you guessed it, htdig
> > fails to see ANY of the links in the table. According to my logs, it
> > never even retrieves it....(i.e. there is no "Retrieval command for
> > http://whatever for this particular file).
> >
> > The file is definitely linked within the site - off the front page for
> > that matter (as well as a few other places). The log files show that it
> > sees the actual link TO the file (from other files), but it never
> > attempts to retrieve it.... :-(
> >
> > Any ideas on this one? I'm about to take my own life - hehe.
> >
> > Cheers.
> > Scott
> >
> > ------------------------------------
> > To unsubscribe from the htdig mailing list, send a message to
> > [EMAIL PROTECTED] containing the single word "unsubscribe" in
> > the SUBJECT of the message.
> >
>
> --
> Gabriel Fenteany, Ph.D.
> Post-doctoral Fellow &
> WWW VL: Cell Biology Maintainer
> http://vl.bwh.harvard.edu
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED] containing the single word "unsubscribe" in
> the SUBJECT of the message.
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.