Subsets

Alexander Barkov Tue, 28 Aug 2001 05:32:08 -0700
  Hello!


Long time ago (Fri, 12 Nov 1999) Eric Mings wrote this message:

 ...whether udmsearch can be modified to allow for 
multiple url tables to store information based upon a user criteria 
(such  as storing different subsets of sites indexed) in a table of 
their choice.

Check whole message here:
http://www.mail-archive.com/udmsearch%40web.izhcom.ru/msg00072.html




Just an idea how to easily implement this with MySQL using it's 
MERGE tables. I tested it with 3.2.x sources, it works
just fine. However it should work with 3.1.x too. 
(This thoughts also will be added into documentation)


I'm considering configuration for "single" mode 
and two subsets. Other modes and more subsets are 
to be configured in the same order.



Instalaltion steps.

1. Create two databases "www" and "dbms" and create 
standard mnoGoSearch database structure for them.

2. Create database "collection" and create structure
using MERGE tables (3.2.x structure is shown):

CREATE TABLE dict (
  url_id int(11) NOT NULL default '0',
  word varchar(32) NOT NULL default '',
  intag int(11) NOT NULL default '0',
  KEY url_id(url_id),
  KEY word_url(word)
) TYPE=MERGE UNION=(www.dict,dbms.dict);


CREATE TABLE url (
  rec_id int(11) NOT NULL auto_increment,
  status int(11) NOT NULL default '0',
  url char(128) binary NOT NULL default '',
  content_type char(48) NOT NULL default '',
  title char(128) NOT NULL default '',
  txt char(255) NOT NULL default '',
  docsize int(11) NOT NULL default '0',
  next_index_time int(11) NOT NULL default '0',
  last_mod_time int(11) NOT NULL default '0',
  referrer int(11) NOT NULL default '0',
  tag char(16) NOT NULL default '0',
  hops int(11) NOT NULL default '0',
  category char(16) NOT NULL default '',
  keywords char(255) NOT NULL default '',
  description char(100) NOT NULL default '',
  crc32 int(11) NOT NULL default '0',
  lang char(32) NOT NULL default '',
  charset char(32) NOT NULL default '',
  PRIMARY KEY (rec_id),
  UNIQUE KEY url(url),
  KEY key_crc(crc32)
) TYPE=MERGE UNION=(www.url,dbms.url);




3. Create two indexer.conf's. Only task related command are shown here:

 www.conf:

    UseCRC32URLID yes
    DBAddr mysql://foo:bar@localhost/www/
    Server http://www.apache.org/

 dbms.conf:

    UseCRC32URLID yes
    DBAddr mysql://foo:bar@localhost/dbms/
    Server http://www.apache.org/



 Check an explanation of  UseCRC32URLID indexer.conf command
in create/mysql/url-raid.txt


4.  Index both subsets:

     indexer www.conf
     indexer dbms.conf


5. Edit search.htm:

DBAddr mysql://foo:bar@localhost/collection/
or
DBAddr mysql://foo:bar@localhost/www/
or
DBAddr mysql://foo:bar@localhost/dbms/


That's all....


   Now you are able to search through three databases:

    "www", "dbms" and "collection". "www" and "dbms"
are subsets and "collection" is whole database.



Advantage of this method
------------------------

Quick search through subsections.
Search does not this use JOIN between two tables "dict" and "url"
with tag condition:

   SELECT <fields> 
   FROM dict,url 
   WHERE 
        url.tag='xxx' 
      AND 
         dict.word='word';


This query is used instead:

   SELECT <fields> 
   FROM dict 
   WHERE 
        word='word';


At the same time search through whole database "collection"
shouldn't be slowly comparing with the same data from both
subsets when the only one database is used without MERGE tables.


Disadvantage
------------
 
As far as auto_increment values are independant
for MERGE table parts, indexer have to generate unique URL id's 
itself. CRC32 is used for it. It is pretty unique, however 
according to our tests it gives about 250 non-unique pairs for 
3.5 mln unique URLs. So, the only one URL will be found from a 
pair with the same URL_ID.
___________________________________________
If you want to unsubscribe send "unsubscribe general"
to [EMAIL PROTECTED]
Subsets

Reply via email to