National characters bug fix

Alexander Barkov Wed, 26 Sep 2001 05:31:46 -0700

  Hello!

This patch fixes a bug that national characters
in the code range 128-255 were considered as word
separators when searchd is used.

We also forgot to include searchd.conf-dist and
it's description (searchd.html) into distribution, 
I'm attaching them too.

Index: src/searchd.c
===================================================================
RCS file: /usr/src/CVS/mnogosearch32/src/searchd.c,v
retrieving revision 1.10
diff -u -r1.10 searchd.c
--- src/searchd.c       2001/09/24 08:13:00     1.10
+++ src/searchd.c       2001/09/26 12:46:03
@@ -105,6 +105,8 @@
        
        Agent=UdmAllocAgent(Conf,0,0);
        Agent->Conf->vars=UdmAllocVarList();
+       
+UdmAddStrVar(Agent->Conf->vars,"LocalCharset",Agent->Conf->local_charset?Agent->Conf->local_charset:"utf-8",UDM_VARSRC_GLOBAL);
+       UDM_FREE(Agent->Conf->local_charset);
 
        while(!done){
                size_t dlen=0,ndocs,i;
Index: src/searchtool.c
===================================================================
RCS file: /usr/src/CVS/mnogosearch32/src/searchtool.c,v
retrieving revision 1.28
diff -u -r1.28 searchtool.c
--- src/searchtool.c    2001/09/12 12:43:16     1.28
+++ src/searchtool.c    2001/09/26 12:44:51
@@ -500,7 +500,7 @@
        AddLimits(Agent,vars);
        
        
Agent->Conf->local_charset=strdup(UdmFindStrVar(Agent->Conf->vars,"LocalCharset","utf-8"));
-       
Agent->Conf->browser_charset=strdup(UdmFindStrVar(Agent->Conf->vars,"BrowserCharset","utf-8"));
+       
+Agent->Conf->browser_charset=strdup(UdmFindStrVar(Agent->Conf->vars,"BrowserCharset",Agent->Conf->local_charset));
        Agent->search_mode=SearchMode(UdmFindStrVar(Agent->Conf->vars,"m",NULL));
 
        /* Now set sections weight factors */

#
# Database parameters
#
DBAddr mysql://foo:bar@localhost/mnogosearch/

#DBMode single
#DBMode multi
#DBMode crc
#DBMode crc-multi
#DBMode cache


# Set non-standard /var directory
# for cache mode and built-n database.
#
#VarDir /mnt/d/mnogosearch/var/


# Load stopwords
# 
#Include stopwords.conf



# Load synonyms. File names are either absolute or 
# relative to /etc directory of mnoGoSearch installation.
#
#Synonym synonym/english.syn
#Synonym synonym/russian.syn



# Ispell files. File names are either absolute or 
# relative to /etc directory of mnoGoSearch installation.
#
#Spell en us-ascii ispell/british.xlg
#Affix en us-ascii ispell/english.aff
#
#Spell ru koi8-r ispell/russian.dict
#Affix ru koi8-r ispell/russian.aff

Title: .

Table of Contents

. SearchD support

. Why using searchd
. Starting searchd
. Merging several databases
. Distributed indexing

Chapter . SearchD support

Table of Contents

. Why using searchd
. Starting searchd
. Merging several databases
. Distributed indexing

Beginning from mnoGoSearch version 3.2 searchd support is available.

. Why using searchd

Faster searching when using ISpell and synonyms. Related files are loaded into memory when searchd is started, while search.cgi loads data before every query.
It is possible to distribute words index and web-server between different machines in cache-mode.
Several search databases can be merged.

. Starting searchd

To start using searchd:

Copy PREFIX/etc/searchd.conf-dist to searchd.conf.
Edit searchd.conf.
Add the following command to search.htm:
SearchdAddr hostname or SearchdAddr hostname:port, e.g.

SearchdAddr localhost
or
SearchdAddr localhost:7500

Default port value is 7003
Start searchd:
```
/usr/local/mnogosearch/sbin/searchd &
```

To suppress output to stderr, use -l option. The output will go through syslog only (in case syslog support was not disabled during instalation with --disable-syslog). In case syslog is disabled, it is possible to direct stderr to a file:

/usr/local/mnogosearch/sbin/searchd 2>/var/log/searchd.log &

searchd just like indexer can be used with an option of a configuration file, e.g. relative path to /etc directory of mnoGoSearch installation:


searchd searchd1.conf

or with absolute path:

searchd /usr/local/mnogosearch/etc/searchd1.conf

. Merging several databases

It is possible to indicate several SearchAddr commands in search.htm. In this case search.cgi will send queries via TCP/IP to several searchd's and compile results. In version 3.2.0 up to 256 databases are supported. DBMode and type of databases (both SQL and built-in) may differ with various searchd's.

search.cgi starts with sending queries to every searchd, thus activating parallel searches in every searchd. Then it waits for the results, compiles them and selects best matches.

Thus it is possible to create a distributed across several machines database. Please note that databases should not intersect, i.e. same documents should not be present in several merged databases. Otherwise the document will be duplicated in search results.

. Distributed indexing

Indexing distribution can be done by means of hostname filtering.

Imagine it is necessary to create a search engine, e.g. for .de domain. Search administrator has 28 machines available, and their names for example are:

a.hostname.de
b.hostname.de
...
...
z.hostname.de

indexer.conf is created for every machine. E.g. on a machine a.hostname.de:


# For hostnames starting with www:
Realm http://www.a*.de/


# For hostnames without www:
Realm http://a*.de/

Repeat this action for every machine.

Then indexer is run on every machine (or several indexers) that index their own area.

A search.cgi is installed on every machine and the following lines are added to every corresponding template:

SearchAddr a.hostname.de
SearchAddr b.hostname.de
....
SearchAddr z.hostname.de

Thus search.cgi will send parallel queries to every machine and return best results to user.

In the current version indexing of each area is done independently. If on the server http://a.domane.de/ there is a link to http://b.doname.de/ server, this link will not be transferred from the machine responsible for a to the machine responsible for b.

Since distribution by hostname is used, in case one of the machines is not operational, the information of all the web servers that were indexed on this machine will be unavailable.

It is planned to implement in the future versions communication between "neighbouring" hosts (i.e. the hosts will be able to transver links between each other, as well as other types of distribution - by hash-function from document's URL. That means that one site's pages will be evenly distributed by all the machines of the claster. So in case one of the machines is unavailable, all the sites will still be available on other machines.