Re: [htdig] my rundig -vvvv results (help?) w/urls

Gilles Detillieux Wed, 05 Dec 2001 14:04:27 -0800

According to [EMAIL PROTECTED]:
> My start urls will be various sites belonging to separate users but 
> on the same server... e.g.,
> http://my.school.edu/~user1  (that will get the index page from the 
> stus public_html folder in that home dir)
> http://my.school.edu/~user2, http://my.school.edu/~user3 and so on, 
> up to 50 users.
> 
> I do NOT have access permissions to the server except to those public 
> http pages and to run my htdig which is completely installed in my 
> home dir with htsearch in my cgi-bin and using cgi-wrap to run htdig.
> So, correct me if I am wrong but I have to access/index the sites by 
> http (i think that's what you guys have called it), i.e., i can't set 
> it up, say, just one host url or something.
> Am i clear? and am I correct?


I think you're confusing two separate issues here.  One issue is the
transport that's used to get the documents from the server into htdig,
and the other is the means by which htdig will figure out or be told
which documents to get.  Let's look at these two in isolation.

1) With htdig 3.1.5, only http URLs are allowed, and so the only transport
allowed is the HTTP protocol, by which a client (e.g. htdig) requests
documents from a web server over the network.  However, if htdig is
running on the same server as the HTTP server, and you know how http
URLs map onto directories on this server, you can bypass the HTTP server
using the local_urls and local_user_urls attributes and get files directly
from the filesystem.  You're still using http URLs, but this local_urls
machanism allows htdig to side-step the HTTP server for static files,
which speeds things up.

The 3.2 betas complicate this a little bit, because they support other
transports as well, such as file:// URLs, news: URLs, and with an external
transport defined, ftp:// URLs too.  However, the local_urls mechanism
still works the same way, allowing htdig to side-step these transports
and go to the local filesystem.

Now, with what you're doing, I get the impression that you are indeed
running htdig on the web server, even though you don't have complete
access to it.  If that's the case, you may still be able to define
local_user_urls to get at the web pages directly, provided you know
where the user directories are, and they use a consistent location for
home directories on this server.  All you need is read access to the
users' web pages, which you ought to have as normally web pages tend
to be world-readable.  So, even though you're using http URLs, you
might not have to use HTTP server much to get at the files, as long
as the files fit the restrictions that local_urls handling imposes
(read the docs).  However, htdig will fall back to HTTP if the local
fetching fails, so either way it should be able to get the pages.

2) There are many means of getting URLs into htdig, and the ones that
are most appropriate for you depend on what you're indexing, and how
these pages are linked.  This is covered at fairly great length in
the FAQ, especially question 5.25 and 5.18, but also in other related
questions (follow the links).

This is independent of which transport htdig uses, but depends on
how pages are linked to each other.  The more "coverage" you have in
links from one page to another, the less individual pages you have
to feed into htdig's start_url.  On a well constructed site, you
should only have to give htdig the site's main page as the start_url,
and it'll find everything from there.  Because you're indexing user
pages, which frequently aren't all listed in a central index page,
and not necessarily all that well crosslinked, you'll likely need to
do more.  At the very least, you'd probably need to list each user's
home page in start_url, and then let htdig spider its way down to
other pages linked from their home pages.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] my rundig -vvvv results (help?) w/urls

Reply via email to