On Fri, 6 Apr 2001, Kir Kolyshkin wrote:
> This can be done relatively easy by adding "servers" SQL table. So, before
> indexing URL it will be searched for in that table (substring)....this is
> pretty simple, I hope you've got the idea..
>
> Any volunteer to implement this? :)
I thought I already volunteered ;)
Matt.
--
> Matt Sullivan wrote:
> >
> > Hi,
> >
> > Has anyone thought about the capability of restricting indexing to URL trees?
> >
> > One of the problems I currently have with ASPSeek as it stands is that given an
> > initial URL of say "http://xyz/tree/" the database is seeded with "http://xyz/"
> > in table "sites" and "http://xyz/tree/" in table "urlword". Now if a page
> > beneath "http://xyz/tree/" refers to a page above "http://xyz/tree/" then the
> > whole site may be indexed which is not necessarily a good thing.
> >
> > Here is an example of this in action:
> >
> > Clear the db:
> >
> > io:/root 4:02pm # index -C
> > Loading configuration from /etc/aspseek/charsets.conf
> > Loading configuration from /etc/aspseek/stopwords.conf
> > Loading configuration from /etc/aspseek/aspseek.conf
> > You are going to delete database 'aspseek' content
> > Are you sure?(YES/no)YES
> > Clearing files [..................................................] done.
> > Clearing the SQL database ... done.
> >
> > Seed the db:
> >
> > io:/root 4:06pm # echo "http://www.telerama.com/about/employment.php3" | index
> > -i -f -
> > Loading configuration from /etc/aspseek/charsets.conf
> > Loading configuration from /etc/aspseek/stopwords.conf
> > Loading configuration from /etc/aspseek/aspseek.conf
> > index process finished.
> >
> > Index:
> >
> > io:/root 4:06pm # index
> > Loading configuration from /etc/aspseek/charsets.conf
> > Loading configuration from /etc/aspseek/stopwords.conf
> > Loading configuration from /etc/aspseek/aspseek.conf
> > Adding URL: http://www.telerama.com/about/employment.php3
> > Adding URL: http://www.telerama.com/ssi/telerama.css <-----+
> > Adding URL: http://www.telerama.com/ <------------|- bugger!
> > Adding URL: http://www.telerama.com/freetrial
> > Adding URL: http://www.telerama.com/services/
> > Adding URL: http://www.telerama.com/members/
> > Adding URL: http://www.telerama.com/helpdesk/
> > Adding URL: http://www.telerama.com/about/
> > Adding URL: http://www.telerama.com/search/
> > ...
> >
> > What I wanted to happen above was for the indexing to be restricted to only
> > pages beneath "http://www.telerama.com/about/".
> >
> > Why is this not always a good thing? Well, not everyones concept of a web site
> > is "http://xyz/" - for example Geocities users sites are subdirectories of the
> > main Geocities site. If we were to accept a submission of an URL from a
> > Geocities user which resulted in *all* of Geocities being indexed we could have
> > a problem :) BTW Geocities isn't actually a problem due to the way they place
> > their advertising on each users page - but it could be - I think it is a good
> > example of what I am driving at.
> >
> > For those users using ASPSeek for indexing of singular small sites I doubt this
> > is a big issue - you could get round it with URL masks etc. - however on a
> > larger scale, 50,000+ unique domain names - this becomes a real problem to
> > manage.
> >
> > What I believe is needed is an additional column to table "urlword", say,
> > "tree_id", which would contain the "url_id" of the initial url. Each url
> > encountered during indexing from the initial url would carry this tree id. Part
> > of the decision making process when adding encountered urls would then be to
> > check the new url to add against it's tree id url to ensure that the new url
> > does not branch out of the current tree.
> >
> > This could quite easily be an optional feature e.g. urls seeded with tree_id
> > set to 0 would be allow to branch, those seeded with tree_id set to it's own
> > url_id would honour the tree concept.
> >
> > I'm quite happy to do the math and provide a patch for this - I need the SQL
> > table modifications to be supported however.
> >
> > Is there interest?
> >
> > Matt.
>
> -- [EMAIL PROTECTED] http://kir.sever.net ICQ 7551596 --
> If you can't stand the heat, sit down or leave the sauna
> Now listening to Scooter "I'm Your Pusher"
>