Hi,

Has anyone thought about the capability of restricting indexing to URL trees? 

One of the problems I currently have with ASPSeek as it stands is that given an
initial URL of say "http://xyz/tree/" the database is seeded with "http://xyz/"
in table "sites" and "http://xyz/tree/" in table "urlword".  Now if a page
beneath "http://xyz/tree/" refers to a page above "http://xyz/tree/" then the
whole site may be indexed which is not necessarily a good thing.

Here is an example of this in action: 

Clear the db:

io:/root 4:02pm # index -C
Loading configuration from /etc/aspseek/charsets.conf
Loading configuration from /etc/aspseek/stopwords.conf
Loading configuration from /etc/aspseek/aspseek.conf
You are going to delete database 'aspseek' content
Are you sure?(YES/no)YES
Clearing files [..................................................] done.
Clearing the SQL database ... done.

Seed the db:

io:/root 4:06pm # echo "http://www.telerama.com/about/employment.php3" | index
-i -f -
Loading configuration from /etc/aspseek/charsets.conf
Loading configuration from /etc/aspseek/stopwords.conf
Loading configuration from /etc/aspseek/aspseek.conf
index process finished.

Index:

io:/root 4:06pm # index 
Loading configuration from /etc/aspseek/charsets.conf
Loading configuration from /etc/aspseek/stopwords.conf
Loading configuration from /etc/aspseek/aspseek.conf
Adding URL: http://www.telerama.com/about/employment.php3
Adding URL: http://www.telerama.com/ssi/telerama.css  <-----+
Adding URL: http://www.telerama.com/           <------------|- bugger!
Adding URL: http://www.telerama.com/freetrial
Adding URL: http://www.telerama.com/services/
Adding URL: http://www.telerama.com/members/
Adding URL: http://www.telerama.com/helpdesk/
Adding URL: http://www.telerama.com/about/
Adding URL: http://www.telerama.com/search/
...

What I wanted to happen above was for the indexing to be restricted to only
pages beneath "http://www.telerama.com/about/".

Why is this not always a good thing?  Well, not everyones concept of a web site
is "http://xyz/" - for example Geocities users sites are subdirectories of the
main Geocities site.  If we were to accept a submission of an URL from a
Geocities user which resulted in *all* of Geocities being indexed we could have
a problem :)  BTW Geocities isn't actually a problem due to the way they place
their advertising on each users page - but it could be - I think it is a good
example of what I am driving at. 

For those users using ASPSeek for indexing of singular small sites I doubt this
is a big issue - you could get round it with URL masks etc. - however on a
larger scale, 50,000+ unique domain names - this becomes a real problem to
manage. 

What I believe is needed is an additional column to table "urlword", say,
"tree_id", which would contain the "url_id" of the initial url.  Each url
encountered during indexing from the initial url would carry this tree id. Part
of the decision making process when adding encountered urls would then be to
check the new url to add against it's tree id url to ensure that the new url
does not branch out of the current tree.

This could quite easily be an optional feature e.g. urls seeded with tree_id
set to 0 would be allow to branch, those seeded with tree_id set to it's own
url_id would honour the tree concept.

I'm quite happy to do the math and provide a patch for this - I need the SQL
table modifications to be supported however. 

Is there interest? 


Matt.

Reply via email to