- - - - - - - - - - - - - - - - - - - - - - - - - - - - Name: Maxime Subject: Re: questions and suggestions on dpsearch
1. After parsing the content of document only raw text data is in the rest (all tags are stripped for HTML, for example), and only around IndexDocSizeLimit bytes (a bit more, to not break a word if any) of this data is indexed, if a value for IndexDocSizeLimit is specified. The value of MaxDocSize is specifying the maximal size of raw document (including HTML tags, for example). 2, Yes, cache mode is most suitable for big databases, and it storing the word data in compressed form, so you have less space used on disk drive and little memory required to load data from disk drive. The no "dictionary" thing in dataparksearch. 3. Yes, use StopwordFile command to load a list of stopwords, see http://www.dataparksearch.org/dpsearch-indexcmd.en.html#stopwordfile_cmd You may find some dtopword files in etc/stopwords subdirectory of dpsearch distribution. 4. Yes, for example: NotIndexIf regex body blog This command will disable indexing of document, where "blog" substring is specified anywhere in body section. - - - - - - - - - - - - - - - - - - - - - - - - - - - - Read the full topic here: http://www.dataparksearch.org/cgi-bin/simpleforum.cgi?fid=02;topic_id=1162208106
