On Wednesday, June 25, 2003, at 04:19 PM, Dan Muey wrote:

I'll try to explain better what I was asking:

Say I have htdig_one.conf
start_url: http://www.mydomain.com/

and http://www.mydomain.com/robots.txt has:

Disallow: /members/

Then http://www.mydomain.com/members/ will not get spidered/indexed into the database for htdig_one.conf


Ok pretty standard and simple. Now the question:

I want to set up a separate database for http://www.mydomain.com/members/ so I do this:
( I realize the data is still accessable so the separate
database doesn't secure the data, I simply need the data seperated)


        htdig_two.conf
        start_url: http://www.mydomain.com/members/
                that will creat the db ...
        
        but http://www.mydomain.com/robots.txt still has:
        
        Disallow: /members/

in it.

So will htdig_two.conf still be able to spider/index http://www.mydomain.com/members/
Or will the http://www.mydomain.com/robots.txt file stop htdig in it's tracks in this case?

As described, htdig will not index /members/ It always checks for /robots.txt and respects any stated exclusions. However the robots exclusion protocol allows you to specify disallows on an agent by agent basis. So if you use the the user_agent attribute to specify a different agent name in the second configuration file, you can then define an robots.txt file that allows the agent you defined access to the members directory. I don't recall the exact syntax for the robots.txt file, but if you check a tutorial on the subject you should find that it is pretty straightforward. The user_agent attribute is described at http://www.htdig.org/attrs.html#user_agent.


If you are simply trying to exclude /members/ from one database and are not really concerned about what other crawlers are doing, then the easiest thing would probably be to use the exclude_urls attribute to drop URLs that contain /members/ in the first database.

Jim



-------------------------------------------------------
This SF.Net email is sponsored by: INetU
Attention Web Developers & Consultants: Become An INetU Hosting Partner.
Refer Dedicated Servers. We Manage Them. You Get 10% Monthly Commission!
INetU Dedicated Managed Hosting http://www.inetu.net/partner/index.php
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to