I'm trying to restrict the found documents to one's in a particular directory. Our aspseek search engine is at http://www.jhuccp.org/cgi-bin/s.cgi.
If you enter a search term like 'advocacy', you should get a return of about 424 documents. To do this, aspseek uses this URL: http://www.jhuccp.org/cgi-bin/s.cgi?q=advocacy&cs=&ps=20&o=0 We want to limit the found documents to the ones that have 'advocacy' in them in the /popreporter/ directory. To do this, I created this record in MySQL: www:/usr/local/aspseek/etc# mysql -u aspseek12 -p aspseek12 mysql> select * from subsets; +-----------+-------------------------------------+ | subset_id | mask | +-----------+-------------------------------------+ | 2 | http://www.jhuccp.org/popreporter/% | +-----------+-------------------------------------+ When I run index -B, I get: www:/usr/local/aspseek/etc# su - -s /bin/bash aspseek aspseek@www:~$ sbin/index -B Loading configuration from /usr/local/aspseek/etc/db.conf Loading configuration from /usr/local/aspseek/etc/ucharset.conf Loading configuration from /usr/local/aspseek/etc/stopwords.conf Loading configuration from /usr/local/aspseek/etc/aspseek.conf Generating subset http://www.jhuccp.org/popreporter/% ... done (96 URLs) index process finished. aspseek@www:~$ This seems to indicate that I've got the subset set up correctly. Then, to test this, I manually edit the URL in the browser's location box to: http://www.jhuccp.org/cgi-bin/s.cgi?q=advocacy&cs=&ps=20&o=0&ul=http://www.jhuccp.org/popreporter/% I've tried variations on this, such as putting the URL in quotes, just using '/popreporter/' etc. Still no joy. When I submit it, it returns the same 424 documents as before; no restriction to the /popreporter/ directory is done. I've read in some of the posts to this list that the subset should be set up without the '%', so I also tried that: aspseek@www:~$ mysql -u aspseek12 -p aspseek12 Enter password: mysql> select * from subsets; +-----------+------------------------------------+ | subset_id | mask | +-----------+------------------------------------+ | 1 | http://www.jhuccp.org/popreporter/ | +-----------+------------------------------------+ 1 row in set (0.00 sec) Then I run: aspseek@www:~$ sbin/index -a -m -u "http://www.jhuccp.org/popreporter/%" Loading configuration from /usr/local/aspseek/etc/db.conf Loading configuration from /usr/local/aspseek/etc/ucharset.conf Loading configuration from /usr/local/aspseek/etc/stopwords.conf Loading configuration from /usr/local/aspseek/etc/aspseek.conf Adding URL: http://www.jhuccp.org/popreporter/current.shtml Adding URL: http://www.jhuccp.org/popreporter/subscribe.shtml Adding URL: http://www.jhuccp.org/popreporter/index.shtml Adding URL: http://www.jhuccp.org/popreporter/2002/02-25.shtml <snip> Adding URL: http://www.jhuccp.org/popreporter/2001/06-11.shtml Adding URL: http://www.jhuccp.org/popreporter/2001/06-04.shtml Saving real-time database ... done. Saving delta files [..................................................] done. Deleting 'deleted' records from urlword[s] ... done. (0 records deleted) Saving real-time ... done Saving redirects ... done Splitting href delta file ... done Saving href delta files ... done Saving direct href delta files ... done Calculating ranks [................................................] done. Saving lastmods ... done Generating word site ... done Generating subset http://www.jhuccp.org/popreporter/ ... done (0 URLs) index process finished. aspseek@www:~$ The dlog.log says, "Subset http://www.jhuccp.org/ not found". Yet, the index command suggests that it found plenty. Could someone please set me straight on how this should work? Thank you very much for your help. -Kevin Zembower
