RE: [htdig] Searching cgi's plus

Dan Muey Mon, 30 Jun 2003 10:27:56 -0700

Thanks for helping clarify that for me. I'll look 
into the user_agent/allow/disallow stuff.


Thanks for the pointers!

Dan

> On Wednesday, June 25, 2003, at 04:19 PM, Dan Muey wrote:
> 
> > I'll try to explain better what I was asking:
> >
> > Say I have htdig_one.conf
> >     start_url: http://www.mydomain.com/
> >     
> >     and http://www.mydomain.com/robots.txt has:
> >     
> >     Disallow: /members/
> >     
> > Then http://www.mydomain.com/members/ will not get spidered/indexed
> > into the database for htdig_one.conf
> >
> > Ok pretty standard and simple. Now the question:
> >
> > I want to set up a separate database for
> > http://www.mydomain.com/members/ so I do this:
> >     ( I realize the data is still accessable so the separate
> >     database doesn't secure the data, I simply need the 
> data  seperated)
> >
> >     htdig_two.conf
> >     start_url: http://www.mydomain.com/members/
> >             that will creat the db ...
> >     
> >     but http://www.mydomain.com/robots.txt still has:
> >     
> >     Disallow: /members/
> >
> >     in it.
> >
> > So will htdig_two.conf still be able to spider/index
> > http://www.mydomain.com/members/
> > Or will the http://www.mydomain.com/robots.txt file stop 
> htdig in it's 
> > tracks in this case?
> 
> As described, htdig will not index /members/  It always checks for 
> /robots.txt and respects any stated exclusions. However the robots 
> exclusion protocol allows you to specify disallows on an 
> agent by agent 
> basis. So if you use the the user_agent attribute to specify a 
> different agent name in the second configuration file, you can then 
> define an robots.txt file that allows the agent you defined access to 
> the members directory. I don't recall the exact syntax for the 
> robots.txt file, but if you check a tutorial on the subject 
> you should 
> find that it is pretty straightforward. The user_agent attribute is 
> described at http://www.htdig.org/attrs.html#user_agent.
> 
> If you are simply trying to exclude /members/ from one 
> database and are 
> not really concerned about what other crawlers are doing, then the 
> easiest thing would probably be to use the exclude_urls attribute to 
> drop URLs that contain /members/ in the first database.
> 
> Jim
> 
> 


-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

RE: [htdig] Searching cgi's plus

Reply via email to