This can be done relatively easy by adding "servers" SQL table. So, before
indexing URL it will be searched for in that table (substring)....this is
pretty simple, I hope you've got the idea..
Any volunteer to implement this? :)
Matt Sullivan wrote:
>
> Hi,
>
> Has anyone thought about the capability of restricting indexing to URL trees?
>
> One of the problems I currently have with ASPSeek as it stands is that given an
> initial URL of say "http://xyz/tree/" the database is seeded with "http://xyz/"
> in table "sites" and "http://xyz/tree/" in table "urlword". Now if a page
> beneath "http://xyz/tree/" refers to a page above "http://xyz/tree/" then the
> whole site may be indexed which is not necessarily a good thing.
>
> Here is an example of this in action:
>
> Clear the db:
>
> io:/root 4:02pm # index -C
> Loading configuration from /etc/aspseek/charsets.conf
> Loading configuration from /etc/aspseek/stopwords.conf
> Loading configuration from /etc/aspseek/aspseek.conf
> You are going to delete database 'aspseek' content
> Are you sure?(YES/no)YES
> Clearing files [..................................................] done.
> Clearing the SQL database ... done.
>
> Seed the db:
>
> io:/root 4:06pm # echo "http://www.telerama.com/about/employment.php3" | index
> -i -f -
> Loading configuration from /etc/aspseek/charsets.conf
> Loading configuration from /etc/aspseek/stopwords.conf
> Loading configuration from /etc/aspseek/aspseek.conf
> index process finished.
>
> Index:
>
> io:/root 4:06pm # index
> Loading configuration from /etc/aspseek/charsets.conf
> Loading configuration from /etc/aspseek/stopwords.conf
> Loading configuration from /etc/aspseek/aspseek.conf
> Adding URL: http://www.telerama.com/about/employment.php3
> Adding URL: http://www.telerama.com/ssi/telerama.css <-----+
> Adding URL: http://www.telerama.com/ <------------|- bugger!
> Adding URL: http://www.telerama.com/freetrial
> Adding URL: http://www.telerama.com/services/
> Adding URL: http://www.telerama.com/members/
> Adding URL: http://www.telerama.com/helpdesk/
> Adding URL: http://www.telerama.com/about/
> Adding URL: http://www.telerama.com/search/
> ...
>
> What I wanted to happen above was for the indexing to be restricted to only
> pages beneath "http://www.telerama.com/about/".
>
> Why is this not always a good thing? Well, not everyones concept of a web site
> is "http://xyz/" - for example Geocities users sites are subdirectories of the
> main Geocities site. If we were to accept a submission of an URL from a
> Geocities user which resulted in *all* of Geocities being indexed we could have
> a problem :) BTW Geocities isn't actually a problem due to the way they place
> their advertising on each users page - but it could be - I think it is a good
> example of what I am driving at.
>
> For those users using ASPSeek for indexing of singular small sites I doubt this
> is a big issue - you could get round it with URL masks etc. - however on a
> larger scale, 50,000+ unique domain names - this becomes a real problem to
> manage.
>
> What I believe is needed is an additional column to table "urlword", say,
> "tree_id", which would contain the "url_id" of the initial url. Each url
> encountered during indexing from the initial url would carry this tree id. Part
> of the decision making process when adding encountered urls would then be to
> check the new url to add against it's tree id url to ensure that the new url
> does not branch out of the current tree.
>
> This could quite easily be an optional feature e.g. urls seeded with tree_id
> set to 0 would be allow to branch, those seeded with tree_id set to it's own
> url_id would honour the tree concept.
>
> I'm quite happy to do the math and provide a patch for this - I need the SQL
> table modifications to be supported however.
>
> Is there interest?
>
> Matt.
-- [EMAIL PROTECTED] http://kir.sever.net ICQ 7551596 --
If you can't stand the heat, sit down or leave the sauna
Now listening to Scooter "I'm Your Pusher"