- - - - - - - - - - - - - - - - - - - - - - - - - - - -
Name: Maxime
Subject: Re: questions and suggestions on dpsearch

1. After parsing the content of document only raw text data is in the rest (all 
tags are stripped for HTML, for example), and only around IndexDocSizeLimit 
bytes (a bit more, to not break a word if any) of this data is indexed, if a 
value for IndexDocSizeLimit is specified. The value of MaxDocSize is specifying 
the maximal size of raw document (including HTML tags, for example).

2, Yes, cache mode is most suitable for big databases, and it storing the word 
data in compressed form, so you have less space used on disk drive and little 
memory required to load data from disk drive. The no "dictionary" thing in 
dataparksearch.

3. Yes, use StopwordFile command to load a list of stopwords, see 
http://www.dataparksearch.org/dpsearch-indexcmd.en.html#stopwordfile_cmd
You may find some dtopword files in etc/stopwords subdirectory of dpsearch 
distribution.

4. Yes, for example:
NotIndexIf regex body blog

This command will disable indexing of document, where "blog" substring is 
specified anywhere in body section.
- - - - - - - - - - - - - - - - - - - - - - - - - - - -

Read the full topic here:
http://www.dataparksearch.org/cgi-bin/simpleforum.cgi?fid=02;topic_id=1162208106

Reply via email to