Greetings all,
Attached is a patch for field-restricted searches. Could people
please test it before I commit it, especially people who have
external parser and can test the new handling of <meta...> tags?
Questions that arose from doing this:
1. Could/should the tests for '|| (t >= 161 && t <= 255)' be made part
of HtIsWordChar? I assume they are for accented letters.
2. Should 'prefix_match_character' really be a string, not a char?
3. What (if anything) should I do with the 'author' field, other than
put the words in the word list?
4. This *doesn't* allow field-restricted parenthesised queries like
'Harry and Potter and title:(fan and club)'. Is that OK?
5. Should 'exact:' inhibit fuzzy rules, as the comment in htsearch.cc
suggests? If not, what should it do?
5. I've made some annotations to STATUS. Could someone else please
check these too, and delete the entries if they are really resolved?
Thanks,
Lachlan
diff -rc -xCVS cvs/htdig/STATUS profile/STATUS
*** cvs/htdig/STATUS Tue Jan 21 09:29:05 2003
--- profile/STATUS Sun Feb 9 09:38:08 2003
***************
*** 1,7 ****
STATUS of ht://Dig branch 3-2-x
RELEASES:
! 3.2.0b5: Next release, tentatively 1 Feb 2003.
3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
--- 1,7 ----
STATUS of ht://Dig branch 3-2-x
RELEASES:
! 3.2.0b5: Next release, First quarter 2003???
3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
***************
*** 22,27 ****
--- 22,28 ----
so this must be some sort of weird htsearch bug) PR#618737.
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
+ Can anyone reproduce this? I can't! -- Lachlan
PENDING PATCHES (available but need work):
* Additional support for Win32.
***************
*** 29,35 ****
* Mifluz merge.
NEEDED FEATURES:
- * Field-restricted searching. (e.g. PR#460833)
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
--- 30,35 ----
***************
*** 45,52 ****
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
- * Add thorough documentation on htsearch restrict/exclude behavior
- (including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
--- 45,50 ----
***************
*** 60,65 ****
--- 58,64 ----
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
+ I've tried. Someone "official" please check and remove this -- Lachlan
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
diff -rc -xCVS cvs/htdig/htcommon/DocumentRef.h profile/htcommon/DocumentRef.h
*** cvs/htdig/htcommon/DocumentRef.h Sat Feb 2 09:49:28 2002
--- profile/htcommon/DocumentRef.h Thu Feb 6 23:35:42 2003
***************
*** 54,59 ****
--- 54,60 ----
char *DocURL() {return docURL;}
time_t DocTime() {return docTime;}
char *DocTitle() {return docTitle;}
+ char *DocAuthor() {return docAuthor;}
char *DocHead() {return docHead;}
int DocHeadIsSet() {return docHeadIsSet;}
char *DocMetaDsc() {return docMetaDsc;}
***************
*** 76,81 ****
--- 77,83 ----
void DocURL(const char *u) {docURL = u;}
void DocTime(time_t t) {docTime = t;}
void DocTitle(const char *t) {docTitle = t;}
+ void DocAuthor(const char *a) {docAuthor = a;}
void DocHead(const char *h) {docHeadIsSet = 1; docHead = h;}
void DocMetaDsc(const char *md) {docMetaDsc = md;}
void DocAccessed(time_t t) {docAccessed = t;}
***************
*** 121,126 ****
--- 123,130 ----
String docMetaDsc;
// This is the title of the document.
String docTitle;
+ // This is the author of the document, as specified in meta information
+ String docAuthor;
// This is a list of Strings, the text of links pointing to this document.
// (e.g. <a href="docURL">description</a>
List descriptions;
diff -rc -xCVS cvs/htdig/htcommon/HtWordReference.h profile/htcommon/HtWordReference.h
*** cvs/htdig/htcommon/HtWordReference.h Sat Feb 2 09:49:28 2002
--- profile/htcommon/HtWordReference.h Sun Feb 9 09:45:23 2003
***************
*** 20,25 ****
--- 20,26 ----
//
// Flags
+ // (If extra flags added, also update htsearch.cc:colonPrefix
//
#define FLAG_TEXT 0
#define FLAG_CAPITAL 1
***************
*** 30,35 ****
--- 31,46 ----
#define FLAG_AUTHOR 32
#define FLAG_LINK_TEXT 64
#define FLAG_URL 128
+
+ // For field-restricted search, at least one of these flags must be set
+ // in document. (255 = OR of the above...)
+ #define FLAGS_MATCH_ONE (255 | FLAG_PLAIN)
+
+ // The following are not stored in the database, but are used by WeightWord
+ #define FLAG_PLAIN 4096
+ #define FLAG_EXACT 8192
+ #define FLAG_HIDDEN 16384
+ #define FLAG_IGNORE 32768
// The remainder are undefined
class HtWordReference : public WordReference
diff -rc -xCVS cvs/htdig/htcommon/defaults.cc profile/htcommon/defaults.cc
*** cvs/htdig/htcommon/defaults.cc Thu Feb 6 23:22:36 2003
--- profile/htcommon/defaults.cc Sat Feb 8 13:35:57 2003
***************
*** 151,158 ****
search form</a> documentation for details on this. \
" }, \
{ "author_factor", "1", \
! "number", "htsearch", "", "??", "Searching:Ranking", "author_factor: 1", " \
! TO BE COMPLETED<br> \
See also <a href=\"#heading_factor\">heading_factor</a>. \
" }, \
{ "authorization", "", \
--- 151,159 ----
search form</a> documentation for details on this. \
" }, \
{ "author_factor", "1", \
! "number", "htsearch", "", "3.2.0b4", "Searching:Ranking", "author_factor: 1", " \
! Weighting applied to words in a <meta name=\"author\" ... > \
! tag.<br> \
See also <a href=\"#heading_factor\">heading_factor</a>. \
" }, \
{ "authorization", "", \
***************
*** 640,982 ****
application/pdf /usr/local/bin/parse_doc.pl \\<br> \
application/msword->text/plain \"/usr/local/bin/mswordtotxt -w\" \\<br> \
application/x-gunzip->user-defined /usr/local/bin/ungzipper", " \
! This attribute is used to specify a list of \
! content-type/parsers that are to be used to parse \
! documents that cannot by parsed by any of the internal \
! parsers. The list of external parsers is examined \
! before the builtin parsers are checked, so this can be \
! used to override the internal behavior without \
! recompiling htdig.<br> \
! The external parsers are specified as pairs of \
! strings. The first string of each pair is the \
! content-type that the parser can handle while the \
! second string of each pair is the path to the external \
! parsing program. If quoted, it may contain parameters, \
! separated by spaces.<br> \
! External parsing can also be done with external \
! converters, which convert one content-type to \
! another. To do this, instead of just specifying \
! a single content-type as the first string \
! of a pair, you specify two types, in the form \
! <em>type1</em><strong>-></strong><em>type2</em>, \
! as a single string with no spaces. The second \
! string will define an external converter \
! rather than an external parser, to convert \
! the first type to the second. If the second \
! type is <strong>user-defined</strong>, then \
! it's up to the converter script to put out a \
! \"Content-Type: <em>type</em>\" header followed \
! by a blank line, to indicate to htdig what type it \
! should expect for the output, much like what a CGI \
! script would do. The resulting content-type must \
! be one that htdig can parse, either internally, \
! or with another external parser or converter.<br> \
! Only one external parser or converter can be \
! specified for any given content-type. However, \
! an external converter for one content-type can be \
! chained to the internal parser for the same type, \
! by appending <strong>-internal</strong> to the \
! second type string (e.g. text/html->text/html-internal) \
! to perform external preprocessing on documents of \
! this type before internal parsing. \
! There are two internal parsers, for text/html and \
! text/plain.<p> \
! The parser program takes four command-line \
! parameters, not counting any parameters already \
! given in the command string:<br> \
! <em>infile content-type URL configuration-file</em><br> \
! <table border=\"1\"> \
! <tr> \
! <th> \
! Parameter \
! </th> \
! <th> \
! Description \
! </th> \
! <th> \
! Example \
! </th> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! infile \
! </td> \
! <td> \
! A temporary file with the contents to be parsed. \
! </td> \
! <td> \
! /var/tmp/htdext.14242 \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! content-type \
! </td> \
! <td> \
! The MIME-type of the contents. \
! </td> \
! <td> \
! text/html \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! URL \
! </td> \
! <td> \
! The URL of the contents. \
! </td> \
! <td> \
! http://www.htdig.org/attrs.html \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! configuration-file \
! </td> \
! <td> \
! The configuration-file in effect. \
! </td> \
! <td> \
! /etc/htdig/htdig.conf \
! </td> \
! </tr> \
! </table><p> \
! The external parser is to write information for \
! htdig on its standard output. Unless it is an \
! external converter, which will output a document \
! of a different content-type, then its output must \
! follow the format described here.<br> \
! The output consists of records, each record terminated \
! with a newline. Each record is a series of (unless \
! expressively allowed to be empty) non-empty tab-separated \
! fields. The first field is a single character \
! that specifies the record type. The rest of the fields \
! are determined by the record type. \
! <table border=\"1\"> \
! <tr> \
! <th> \
! Record type \
! </th> \
! <th> \
! Fields \
! </th> \
! <th> \
! Description \
! </th> \
! </tr> \
! <tr> \
! <th rowspan=\"3\" valign=\"top\"> \
! w \
! </th> \
! <td valign=\"top\"> \
! word \
! </td> \
! <td> \
! A word that was found in the document. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! location \
! </td> \
! <td> \
! A number indicating the normalized location of \
! the word within the document. The number has to \
! fall in the range 0-1000 where 0 means the top of \
! the document. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! heading level \
! </td> \
! <td> \
! A heading level that is used to compute the \
! weight of the word depending on its context in \
! the document itself. The level is in the range of \
! 0-10 and are defined as follows: \
! <dl compact> \
! <dt> \
! 0 \
! </dt> \
! <dd> \
! Normal text \
! </dd> \
! <dt> \
! 1 \
! </dt> \
! <dd> \
! Title text \
! </dd> \
! <dt> \
! 2 \
! </dt> \
! <dd> \
! Heading 1 text \
! </dd> \
! <dt> \
! 3 \
! </dt> \
! <dd> \
! Heading 2 text \
! </dd> \
! <dt> \
! 4 \
! </dt> \
! <dd> \
! Heading 3 text \
! </dd> \
! <dt> \
! 5 \
! </dt> \
! <dd> \
! Heading 4 text \
! </dd> \
! <dt> \
! 6 \
! </dt> \
! <dd> \
! Heading 5 text \
! </dd> \
! <dt> \
! 7 \
! </dt> \
! <dd> \
! Heading 6 text \
! </dd> \
! <dt> \
! 8 \
! </dt> \
! <dd> \
! <em>unused</em> \
! </dd> \
! <dt> \
! 9 \
! </dt> \
! <dd> \
! <em>unused</em> \
! </dd> \
! <dt> \
! 10 \
! </dt> \
! <dd> \
! Keywords \
! </dd> \
! </dl> \
! </td> \
! </tr> \
! <tr> \
! <th rowspan=\"2\" valign=\"top\"> \
! u \
! </th> \
! <td valign=\"top\"> \
! document URL \
! </td> \
! <td> \
! A hyperlink to another document that is \
! referenced by the current document. It must be \
! complete and non-relative, using the URL parameter to \
! resolve any relative references found in the document. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! hyperlink description \
! </td> \
! <td> \
! For HTML documents, this would be the text \
! between the <a href...> and </a> \
! tags. \
! </td> \
! </tr> \
! <tr> \
! <th valign=\"top\"> \
! t \
! </th> \
! <td valign=\"top\"> \
! title \
! </td> \
! <td> \
! The title of the document \
! </td> \
! </tr> \
! <tr> \
! <th valign=\"top\"> \
! h \
! </th> \
! <td valign=\"top\"> \
! head \
! </td> \
! <td> \
! The top of the document itself. This is used to \
! build the excerpt. This should only contain \
! normal ASCII text \
! </td> \
! </tr> \
! <tr> \
! <th valign=\"top\"> \
! a \
! </th> \
! <td valign=\"top\"> \
! anchor \
! </td> \
! <td> \
! The label that identifies an anchor that can be \
! used as a target in an URL. This really only \
! makes sense for HTML documents. \
! </td> \
! </tr> \
! <tr> \
! <th valign=\"top\"> \
! i \
! </th> \
! <td valign=\"top\"> \
! image URL \
! </td> \
! <td> \
! An URL that points at an image that is part of \
! the document. \
! </td> \
! </tr> \
! <tr> \
! <th rowspan=\"3\" valign=\"top\"> \
! m \
! </th> \
! <td valign=\"top\"> \
! http-equiv \
! </td> \
! <td> \
! The HTTP-EQUIV attribute of a \
! <a href=\"meta.html\"><em>META</em> tag</a>. \
! May be empty. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! name \
! </td> \
! <td> \
! The NAME attribute of this \
! <a href=\"meta.html\"><em>META</em> tag</a>. \
! May be empty. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> \
! contents \
! </td> \
! <td> \
! The CONTENTS attribute of this \
! <a href=\"meta.html\"><em>META</em> tag</a>. \
! May be empty. \
! </td> \
! </tr> \
! </table> \
! <p><em>See also FAQ questions <a \
! href=\"FAQ.html#q4.8\">4.8</a> and <a \
! href=\"FAQ.html#q4.9\">4.9</a> for more \
! examples.</em></p> \
" }, \
{ "external_protocols", "", \
"quoted string list", "htdig", "", "3.2.0b1", "External:Protocols", "external_protocols: https /usr/local/bin/handler.pl \\<br> \
--- 641,849 ----
application/pdf /usr/local/bin/parse_doc.pl \\<br> \
application/msword->text/plain \"/usr/local/bin/mswordtotxt -w\" \\<br> \
application/x-gunzip->user-defined /usr/local/bin/ungzipper", " \
! This attribute is used to specify a list of \
! content-type/parsers that are to be used to parse \
! documents that cannot by parsed by any of the internal \
! parsers. The list of external parsers is examined \
! before the builtin parsers are checked, so this can be \
! used to override the internal behavior without \
! recompiling htdig.<br> \
! The external parsers are specified as pairs of \
! strings. The first string of each pair is the \
! content-type that the parser can handle while the \
! second string of each pair is the path to the external \
! parsing program. If quoted, it may contain parameters, \
! separated by spaces.<br> \
! External parsing can also be done with external \
! converters, which convert one content-type to \
! another. To do this, instead of just specifying \
! a single content-type as the first string \
! of a pair, you specify two types, in the form \
! <em>type1</em><strong>-></strong><em>type2</em>, \
! as a single string with no spaces. The second \
! string will define an external converter \
! rather than an external parser, to convert \
! the first type to the second. If the second \
! type is <strong>user-defined</strong>, then \
! it's up to the converter script to put out a \
! \"Content-Type: <em>type</em>\" header followed \
! by a blank line, to indicate to htdig what type it \
! should expect for the output, much like what a CGI \
! script would do. The resulting content-type must \
! be one that htdig can parse, either internally, \
! or with another external parser or converter.<br> \
! Only one external parser or converter can be \
! specified for any given content-type. However, \
! an external converter for one content-type can be \
! chained to the internal parser for the same type, \
! by appending <strong>-internal</strong> to the \
! second type string (e.g. text/html->text/html-internal) \
! to perform external preprocessing on documents of \
! this type before internal parsing. \
! There are two internal parsers, for text/html and \
! text/plain.<p> \
! The parser program takes four command-line \
! parameters, not counting any parameters already \
! given in the command string:<br> \
! <em>infile content-type URL configuration-file</em><br> \
! <table border=\"1\"> \
! <tr> \
! <th> Parameter </th> \
! <th> Description </th> \
! <th> Example </th> \
! </tr> \
! <tr> \
! <td valign=\"top\"> infile </td> \
! <td> A temporary file with the contents to be parsed. </td> \
! <td> /var/tmp/htdext.14242 </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> content-type </td> \
! <td> The MIME-type of the contents. </td> \
! <td> text/html </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> URL </td> \
! <td> The URL of the contents. </td> \
! <td> http://www.htdig.org/attrs.html </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> configuration-file </td> \
! <td> The configuration-file in effect. </td> \
! <td> /etc/htdig/htdig.conf </td> \
! </tr> \
! </table><p> \
! The external parser is to write information for \
! htdig on its standard output. Unless it is an \
! external converter, which will output a document \
! of a different content-type, then its output must \
! follow the format described here.<br> \
! The output consists of records, each record terminated \
! with a newline. Each record is a series of (unless \
! expressively allowed to be empty) non-empty tab-separated \
! fields. The first field is a single character \
! that specifies the record type. The rest of the fields \
! are determined by the record type. \
! <table border=\"1\"> \
! <tr> \
! <th> Record type </th> \
! <th> Fields </th> \
! <th> Description </th> \
! </tr> \
! <tr> \
! <th rowspan=\"3\" valign=\"top\"> w </th> \
! <td valign=\"top\"> word </td> \
! <td> A word that was found in the document. </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> location </td> \
! <td> \
! A number indicating the normalized location of \
! the word within the document. The number has to \
! fall in the range 0-1000 where 0 means the top of \
! the document. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> heading level </td> \
! <td> \
! A heading level that is used to compute the \
! weight of the word depending on its context in \
! the document itself. The level is in the range of \
! 0-11 and are defined as follows: \
! <dl compact> \
! <dt> 0 </dt> <dd> Normal text </dd> \
! <dt> 1 </dt> <dd> Title text </dd> \
! <dt> 2 </dt> <dd> Heading 1 text </dd> \
! <dt> 3 </dt> <dd> Heading 2 text </dd> \
! <dt> 4 </dt> <dd> Heading 3 text </dd> \
! <dt> 5 </dt> <dd> Heading 4 text </dd> \
! <dt> 6 </dt> <dd> Heading 5 text </dd> \
! <dt> 7 </dt> <dd> Heading 6 text </dd> \
! <dt> 8 </dt> <dd> text alternative to images </dd> \
! <dt> 9 </dt> <dd> Keywords </dd> \
! <dt> 10 </dt> <dd> Meta-description </dd> \
! <dt> 11 </dt> <dd> Author </dd> \
! </dl> \
! </td> \
! </tr> \
! <tr> \
! <th rowspan=\"2\" valign=\"top\"> u </th> \
! <td valign=\"top\"> document URL </td> \
! <td> \
! A hyperlink to another document that is \
! referenced by the current document. It must be \
! complete and non-relative, using the URL parameter to \
! resolve any relative references found in the document. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> hyperlink description </td> \
! <td> \
! For HTML documents, this would be the text \
! between the <a href...> and </a> \
! tags. \
! </td> \
! </tr> \
! <tr> \
! <th valign=\"top\"> t </th> \
! <td valign=\"top\"> title </td> \
! <td> The title of the document </td> \
! </tr> \
! <tr> \
! <th valign=\"top\"> h </th> \
! <td valign=\"top\"> head </td> \
! <td> \
! The top of the document itself. This is used to \
! build the excerpt. This should only contain \
! normal ASCII text \
! </td> \
! </tr> \
! <tr> \
! <th valign=\"top\"> a </th> \
! <td valign=\"top\"> anchor </td> \
! <td> \
! The label that identifies an anchor that can be \
! used as a target in an URL. This really only \
! makes sense for HTML documents. \
! </td> \
! </tr> \
! <tr> \
! <th valign=\"top\"> i </th> \
! <td valign=\"top\"> image URL </td> \
! <td> \
! An URL that points at an image that is part of \
! the document. \
! </td> \
! </tr> \
! <tr> \
! <th rowspan=\"3\" valign=\"top\"> m </th> \
! <td valign=\"top\"> http-equiv </td> \
! <td> \
! The HTTP-EQUIV attribute of a \
! <a href=\"meta.html\"><em>META</em> tag</a>. \
! May be empty. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> name </td> \
! <td> \
! The NAME attribute of this \
! <a href=\"meta.html\"><em>META</em> tag</a>. \
! May be empty. \
! </td> \
! </tr> \
! <tr> \
! <td valign=\"top\"> contents </td> \
! <td> \
! The CONTENTS attribute of this \
! <a href=\"meta.html\"><em>META</em> tag</a>. \
! May be empty. \
! </td> \
! </tr> \
! </table> \
! <p><em>See also FAQ questions <a href=\"FAQ.html#q4.8\">4.8</a> and \
! <a href=\"FAQ.html#q4.9\">4.9</a> for more examples.</em></p> \
" }, \
{ "external_protocols", "", \
"quoted string list", "htdig", "", "3.2.0b1", "External:Protocols", "external_protocols: https /usr/local/bin/handler.pl \\<br> \
diff -rc -xCVS cvs/htdig/htdig/ExternalParser.cc profile/htdig/ExternalParser.cc
*** cvs/htdig/htdig/ExternalParser.cc Mon Dec 30 23:42:58 2002
--- profile/htdig/ExternalParser.cc Sat Feb 8 10:58:38 2003
***************
*** 201,207 ****
write(fd, contents->get(), contents->length());
close(fd);
! unsigned int minimum_word_length = config->Value("minimum_word_length", 3);
String line;
char *token1, *token2, *token3;
int loc = 0, hd = 0;
--- 201,207 ----
write(fd, contents->get(), contents->length());
close(fd);
! // unsigned int minimum_word_length = config->Value("minimum_word_length", 3);
String line;
char *token1, *token2, *token3;
int loc = 0, hd = 0;
***************
*** 452,470 ****
{
if (keywordsMatch->CompareWord(name))
{
! char *w = strtok(content, " ,\t\r");
! while (w)
! {
! if (strlen(w) >= minimum_word_length)
! retriever.got_word(w, 1, 9);
! w = strtok(0, " ,\t\r");
! }
}
if (metadatetags->CompareWord(name) &&
config->Boolean("use_doc_date", 0))
{
retriever.got_time(content);
}
else if (mystrcasecmp(name, "htdig-email") == 0)
{
retriever.got_meta_email(content);
--- 452,479 ----
{
if (keywordsMatch->CompareWord(name))
{
! int wordindex = 1;
! addKeywordString (retriever, content, wordindex);
! // // can this be merged with Parser::addKeywordString ?
! // char *w = strtok(content, " ,\t\r");
! // while (w)
! // {
! // if (strlen(w) >= minimum_word_length)
! // retriever.got_word(w, 1, 9);
! // w = strtok(0, " ,\t\r");
! // }
}
if (metadatetags->CompareWord(name) &&
config->Boolean("use_doc_date", 0))
{
retriever.got_time(content);
}
+ else if (mystrcasecmp(name, "author") == 0)
+ {
+ int wordindex = 1;
+ retriever.got_author(content);
+ addString (retriever, content, wordindex, 11);
+ }
else if (mystrcasecmp(name, "htdig-email") == 0)
{
retriever.got_meta_email(content);
***************
*** 495,507 ****
// Now add the words to the word list
// (slot 10 is the new slot for this)
//
! char *w = strtok(content, " \t\r");
! while (w)
! {
! if (strlen(w) >= minimum_word_length)
! retriever.got_word(w, 1, 10);
! w = strtok(0, " \t\r");
! }
}
}
}
--- 504,519 ----
// Now add the words to the word list
// (slot 10 is the new slot for this)
//
! int wordindex = 1;
! addString (retriever, content, wordindex, 10);
! // // can this be merged with Parser::addString ?
! // char *w = strtok(content, " \t\r");
! // while (w)
! // {
! // if (strlen(w) >= minimum_word_length)
! // retriever.got_word(w, 1, 10);
! // w = strtok(0, " \t\r");
! // }
}
}
}
diff -rc -xCVS cvs/htdig/htdig/HTML.cc profile/htdig/HTML.cc
*** cvs/htdig/htdig/HTML.cc Wed Feb 5 22:17:33 2003
--- profile/htdig/HTML.cc Sat Feb 8 11:09:48 2003
***************
*** 45,52 ****
static StringMatch metadatetags;
static StringMatch descriptionMatch;
static StringMatch keywordsMatch;
! static int keywordsCount;
! static int max_keywords;
//*****************************************************************************
--- 45,52 ----
static StringMatch metadatetags;
static StringMatch descriptionMatch;
static StringMatch keywordsMatch;
! //static int keywordsCount;
! //static int max_keywords;
//*****************************************************************************
***************
*** 113,121 ****
StringList keywordNames(config->Find("keywords_meta_tag_names"), " \t");
keywordsMatch.IgnoreCase();
keywordsMatch.Pattern(keywordNames.Join('|'));
! max_keywords = config->Value("max_keywords", -1);
! if (max_keywords < 0)
! max_keywords = (int) ((unsigned int) ~1 >> 1);
// skip_start/end mark sections of text to be ignored by ht://Dig
// Make sure there are equal numbers of each, and warn of deprecated
--- 113,122 ----
StringList keywordNames(config->Find("keywords_meta_tag_names"), " \t");
keywordsMatch.IgnoreCase();
keywordsMatch.Pattern(keywordNames.Join('|'));
! // (now in Parser)
! // max_keywords = config->Value("max_keywords", -1);
! // if (max_keywords < 0)
! // max_keywords = (int) ((unsigned int) ~1 >> 1);
// skip_start/end mark sections of text to be ignored by ht://Dig
// Make sure there are equal numbers of each, and warn of deprecated
***************
*** 180,186 ****
base = 0;
noindex = 0;
nofollow = 0;
! minimumWordLength = config->Value("minimum_word_length", 3);
}
--- 181,187 ----
base = 0;
noindex = 0;
nofollow = 0;
! // minimumWordLength = config->Value("minimum_word_length", 3);
}
***************
*** 495,501 ****
head << word;
}
! if (word.length() >= (int)minimumWordLength && !noindex)
{
retriever.got_word((char*)word, wordindex++, in_heading);
}
--- 496,502 ----
head << word;
}
! if (word.length() >= (int)minimum_word_length && !noindex)
{
retriever.got_word((char*)word, wordindex++, in_heading);
}
***************
*** 755,769 ****
if (!noindex)
{
String tmp = transSGML(keywords);
! char *w = HtWordToken(tmp);
! while (w)
! {
! if (strlen(w) >= minimumWordLength
! && ++keywordsCount <= max_keywords)
! retriever.got_word(w, wordindex++, 9);
! w = HtWordToken(0);
! }
! w = '\0';
}
}
--- 756,762 ----
if (!noindex)
{
String tmp = transSGML(keywords);
! addKeywordString (retriever, tmp, wordindex);
}
}
***************
*** 827,859 ****
// Now add the words to the word list
// Slot 10 is the current slot for this
//
-
if (!noindex)
{
String tmp = transSGML(attrs["content"]);
! char *w = HtWordToken(tmp);
! while (w)
! {
! if (strlen(w) >= minimumWordLength)
! retriever.got_word(w, wordindex++,10);
! w = HtWordToken(0);
! }
! w = '\0';
}
}
if (keywordsMatch.CompareWord(cache) && !noindex)
{
String tmp = transSGML(attrs["content"]);
! char *w = HtWordToken(tmp);
! while (w)
! {
! if (strlen(w) >= minimumWordLength
! && ++keywordsCount <= max_keywords)
! retriever.got_word(w, wordindex++, 9);
! w = HtWordToken(0);
! }
! w = '\0';
}
else if (mystrcasecmp(cache, "htdig-email") == 0)
{
--- 820,843 ----
// Now add the words to the word list
// Slot 10 is the current slot for this
//
if (!noindex)
{
String tmp = transSGML(attrs["content"]);
! addString (retriever, tmp, wordindex, 10);
}
}
if (keywordsMatch.CompareWord(cache) && !noindex)
{
String tmp = transSGML(attrs["content"]);
! addKeywordString (retriever, tmp, wordindex);
! }
! else if (mystrcasecmp(cache, "author") == 0)
! {
! String author = transSGML(attrs["content"]);
! retriever.got_author(author);
! if (!noindex)
! addString (retriever, author, wordindex, 11);
}
else if (mystrcasecmp(cache, "htdig-email") == 0)
{
***************
*** 988,1001 ****
description << tmp << " ";
if (!noindex && !in_title && head.length() < max_head_length)
head << tmp << " ";
! char *w = HtWordToken(tmp);
! while (w && !noindex)
! {
! if (strlen(w) >= minimumWordLength)
! retriever.got_word(w, wordindex++, 8); // slot for img_alt
! w = HtWordToken(0);
! }
! w = '\0';
}
if (!attrs["src"].empty())
{
--- 972,979 ----
description << tmp << " ";
if (!noindex && !in_title && head.length() < max_head_length)
head << tmp << " ";
! if (!noindex)
! addString (retriever, tmp, wordindex, 8); // slot for img_alt
}
if (!attrs["src"].empty())
{
diff -rc -xCVS cvs/htdig/htdig/HTML.h profile/htdig/HTML.h
*** cvs/htdig/htdig/HTML.h Tue Jan 21 09:40:14 2003
--- profile/htdig/HTML.h Sat Feb 8 10:58:41 2003
***************
*** 52,58 ****
int in_heading;
int noindex;
int nofollow;
! unsigned int minimumWordLength;
URL *base;
QuotedStringList skip_start;
QuotedStringList skip_end;
--- 52,58 ----
int in_heading;
int noindex;
int nofollow;
! // unsigned int minimumWordLength;
URL *base;
QuotedStringList skip_start;
QuotedStringList skip_end;
diff -rc -xCVS cvs/htdig/htdig/Parsable.cc profile/htdig/Parsable.cc
*** cvs/htdig/htdig/Parsable.cc Sat Feb 2 09:49:29 2002
--- profile/htdig/Parsable.cc Sat Feb 8 11:08:40 2003
***************
*** 31,36 ****
--- 31,41 ----
max_head_length = config->Value("max_head_length", 0);
max_description_length = config->Value("max_description_length", 50);
max_meta_description_length = config->Value("max_meta_description_length", 0);
+
+ max_keywords = config->Value("max_keywords", -1);
+ if (max_keywords < 0)
+ max_keywords = (int) ((unsigned int) ~1 >> 1);
+ minimum_word_length = config->Value("minimum_word_length", 3);
}
***************
*** 52,55 ****
--- 57,96 ----
{
delete contents;
contents = new String(data, length);
+ }
+
+ //*****************************************************************************
+ // void Parsable::addString(char *s, int& wordindex, int slot)
+ // Add all words in string s in "heading level" slot, incrementing wordindex
+ // along the way. String s is corrupted.
+ //
+ void
+ Parsable::addString(Retriever& retriever, char *s, int& wordindex, int slot)
+ {
+ char *w = HtWordToken(s);
+ while (w)
+ {
+ if (strlen(w) >= minimum_word_length)
+ retriever.got_word(w, wordindex++, slot); // slot for img_alt
+ w = HtWordToken(0);
+ }
+ w = '\0';
+ }
+
+ //*****************************************************************************
+ // void Parsable::addKeywordString(char *s, int& wordindex)
+ // Add all words in string s as keywords, incrementing wordindex
+ // along the way. String s is corrupted.
+ //
+ void
+ Parsable::addKeywordString(Retriever& retriever, char *s, int& wordindex)
+ {
+ char *w = HtWordToken(s);
+ while (w)
+ {
+ if (strlen(w) >= minimum_word_length && ++keywordsCount <= max_keywords)
+ retriever.got_word(w, wordindex++, 9);
+ w = HtWordToken(0);
+ }
+ w = '\0';
}
diff -rc -xCVS cvs/htdig/htdig/Parsable.h profile/htdig/Parsable.h
*** cvs/htdig/htdig/Parsable.h Sat Feb 2 09:49:29 2002
--- profile/htdig/Parsable.h Sat Feb 8 11:07:56 2003
***************
*** 40,51 ****
--- 40,55 ----
// the data that we contain.
//
virtual void setContents(char *data, int length);
+ void addString(Retriever& retriever, char *s, int& wordindex, int slot);
+ void addKeywordString(Retriever& retriever, char *s, int& wordindex);
protected:
String *contents;
int max_head_length;
int max_description_length;
int max_meta_description_length;
+ int max_keywords, keywordsCount;
+ unsigned int minimum_word_length;
};
#endif
diff -rc -xCVS cvs/htdig/htdig/Retriever.cc profile/htdig/Retriever.cc
*** cvs/htdig/htdig/Retriever.cc Mon Dec 30 23:42:58 2002
--- profile/htdig/Retriever.cc Sat Feb 8 11:09:24 2003
***************
*** 77,82 ****
--- 77,83 ----
factor[9] = FLAG_KEYWORDS;
// META description factor
factor[10] = FLAG_DESCRIPTION;
+ factor[11] = FLAG_AUTHOR;
doc = new Document();
minimumWordLength = config->Value("minimum_word_length", 3);
***************
*** 1279,1287 ****
{
if (debug > 3)
cout << "word: " << word << '@' << location << endl;
! if (heading >= 11 || heading < 0) // Current limits for headings
heading = 0; // Assume it's just normal text
! if (trackWords && strlen(word) >= minimumWordLength)
{
String w = word;
HtWordReference wordRef;
--- 1280,1288 ----
{
if (debug > 3)
cout << "word: " << word << '@' << location << endl;
! if (heading >= (int)(sizeof(factor)/sizeof(factor[0])) || heading < 0)
heading = 0; // Assume it's just normal text
! if (trackWords && strlen(word) >= (unsigned int)minimumWordLength)
{
String w = word;
HtWordReference wordRef;
***************
*** 1353,1358 ****
--- 1354,1372 ----
cout << "\ntitle: " << title << endl;
current_title = title;
}
+
+
+ //*****************************************************************************
+ // void Retriever::got_author(const char *e)
+ //
+ void
+ Retriever::got_author(const char *author)
+ {
+ if (debug > 1)
+ cout << "\nauthor: " << author << endl;
+ current_ref->DocAuthor(author);
+ }
+
//*****************************************************************************
// void Retriever::got_time(const char *time)
diff -rc -xCVS cvs/htdig/htdig/Retriever.h profile/htdig/Retriever.h
*** cvs/htdig/htdig/Retriever.h Tue Feb 12 17:12:05 2002
--- profile/htdig/Retriever.h Sat Feb 8 09:52:14 2003
***************
*** 64,69 ****
--- 64,70 ----
void got_word(const char *word, int location, int heading);
void got_href(URL &url, const char *description, int hops = 1);
void got_title(const char *title);
+ void got_author(const char *author);
void got_time(const char *time);
void got_head(const char *head);
void got_meta_dsc(const char *md);
***************
*** 115,121 ****
//
// These are weights for the words. The index is the heading level.
//
! long int factor[11];
int currenthopcount;
//
--- 116,122 ----
//
// These are weights for the words. The index is the heading level.
//
! long int factor[12];
int currenthopcount;
//
diff -rc -xCVS cvs/htdig/htdoc/TODO.html profile/htdoc/TODO.html
*** cvs/htdig/htdoc/TODO.html Sat Feb 2 09:49:29 2002
--- profile/htdoc/TODO.html Sat Feb 8 12:55:54 2003
***************
*** 10,16 ****
TODO list
</h1>
<p>
! ht://Dig Copyright © 1995-2001 <a href="THANKS.html">The ht://Dig Group</a><br>
Please see the file <a href="COPYING">COPYING</a> for
license information.
</p>
--- 10,16 ----
TODO list
</h1>
<p>
! ht://Dig Copyright © 1995-2002 <a href="THANKS.html">The ht://Dig Group</a><br>
Please see the file <a href="COPYING">COPYING</a> for
license information.
</p>
***************
*** 35,41 ****
<li type="bullet">
Phrase searching
</li>
! <li type="square">
Field-based searching
</li>
<li type="bullet">
--- 35,41 ----
<li type="bullet">
Phrase searching
</li>
! <li type="circle">
Field-based searching
</li>
<li type="bullet">
***************
*** 136,141 ****
</li>
</ul>
<hr size="4" noshade>
! Last modified: $Date: 2002/02/01 22:49:29 $
</body>
</html>
--- 136,141 ----
</li>
</ul>
<hr size="4" noshade>
! Last modified: $Date: 2003/02/08 $
</body>
</html>
diff -rc -xCVS cvs/htdig/htdoc/hts_general.html profile/htdoc/hts_general.html
*** cvs/htdig/htdoc/hts_general.html Sat Feb 2 09:49:32 2002
--- profile/htdoc/hts_general.html Sat Feb 8 12:57:15 2003
***************
*** 10,16 ****
htsearch
</h1>
<p>
! ht://Dig Copyright © 1995-2001 <a href="THANKS.html">The ht://Dig Group</a><br>
Please see the file <a href="COPYING">COPYING</a> for
license information.
</p>
--- 10,16 ----
htsearch
</h1>
<p>
! ht://Dig Copyright © 1995-2003 <a href="THANKS.html">The ht://Dig Group</a><br>
Please see the file <a href="COPYING">COPYING</a> for
license information.
</p>
diff -rc -xCVS cvs/htdig/htdoc/hts_method.html profile/htdoc/hts_method.html
*** cvs/htdig/htdoc/hts_method.html Sat Feb 2 09:49:32 2002
--- profile/htdoc/hts_method.html Sat Feb 8 13:39:14 2003
***************
*** 10,16 ****
htsearch
</h1>
<p>
! ht://Dig Copyright © 1995-2001 <a href="THANKS.html">The ht://Dig Group</a><br>
Please see the file <a href="COPYING">COPYING</a> for
license information.
</p>
--- 10,16 ----
htsearch
</h1>
<p>
! ht://Dig Copyright © 1995-2003 <a href="THANKS.html">The ht://Dig Group</a><br>
Please see the file <a href="COPYING">COPYING</a> for
license information.
</p>
***************
*** 24,30 ****
in global terms what goes on when htsearch searches.
</p>
<p>
! htsearch gets a list of words from the HTML form that invoked
it. If htsearch was invoked with boolean expression parsing
enabled, it will do a quick syntax check on the input words.
If there are syntax errors, it will display the syntax error
--- 24,31 ----
in global terms what goes on when htsearch searches.
</p>
<p>
! htsearch gets a list of (case insensitive) words from the HTML
! form that invoked
it. If htsearch was invoked with boolean expression parsing
enabled, it will do a quick syntax check on the input words.
If there are syntax errors, it will display the syntax error
***************
*** 36,46 ****
If the boolean parser was not enabled, the list of words is
converted into a boolean expression by putting either "and"s
or "or"s between the words. (This depends on the search
! type.)
</p>
<p>
! In both cases, each of the words in the list is now expanded
! using the search algorithms that were specified in the
<a href="attrs.html#search_algorithm">search_algorithm</a>
attribute. For example, the endings algorithm will convert a
word like "person" into "person or persons". In this fashion,
--- 37,64 ----
If the boolean parser was not enabled, the list of words is
converted into a boolean expression by putting either "and"s
or "or"s between the words. (This depends on the search
! type.) Phrases within double quotes (") specify that the words
! must occur sequentially within the document.
</p>
<p>
! If a word is immediately preceeded by a field specifer
! (title:, heading:, author:, keyword:, descr:, link:, url:)
! then it will only match documents in which the word occurred
! within field. For example, descr:foo only matches documents
! containing <meta value="description" value="... foo ...">.
! The link: field refers to the text in the hyperlinks to a document,
! rather than text within the document itself. Similarly url:
! (will eventually) refer to the actual URL of the document, not any
! of its contents.
! The prefixes exact: and hidden: are also accepted.
! The former (will) cause the
! <a href="attrs.html#search_algorithm">fuzzy search algorithm</a>
! not to be applied to this word, while the latter causes the word
! not to be displayed in the query string of the results page.
! </p>
! <p>
! Each of the words in the list (but not within a phrase) is now
! expanded using the search algorithms that were specified in the
<a href="attrs.html#search_algorithm">search_algorithm</a>
attribute. For example, the endings algorithm will convert a
word like "person" into "person or persons". In this fashion,
***************
*** 78,84 ****
</p>
<hr size="4" noshade>
! Last modified: $Date: 2002/02/01 22:49:32 $
</body>
</html>
--- 96,102 ----
</p>
<hr size="4" noshade>
! Last modified: $Date: 2003/02/08 $
</body>
</html>
diff -rc -xCVS cvs/htdig/htsearch/WeightWord.cc profile/htsearch/WeightWord.cc
*** cvs/htdig/htsearch/WeightWord.cc Sat Feb 2 09:49:35 2002
--- profile/htsearch/WeightWord.cc Sun Feb 9 09:15:05 2003
***************
*** 33,38 ****
--- 33,40 ----
isExact = 0;
isHidden = 0;
isIgnore = 0;
+
+ flags = FLAGS_MATCH_ONE;
}
***************
*** 45,50 ****
--- 47,53 ----
records = ww->records;
isExact = ww->isExact;
isHidden = ww->isHidden;
+ flags = ww->flags;
word = ww->word;
isIgnore = 0;
}
***************
*** 59,64 ****
--- 62,92 ----
isExact = 0;
isHidden = 0;
isIgnore = 0;
+
+ // allow a match with any field
+ flags = FLAGS_MATCH_ONE;
+
+ set(word);
+ this->weight = weight;
+ }
+
+ //***************************************************************************
+ // WeightWord::WeightWord(char *word, double weight, unsigned int f)
+ //
+ WeightWord::WeightWord(char *word, double weight, unsigned int f)
+ {
+ records = 0;
+
+ flags = f;
+ // if no fields specified, allow a match with any field
+ if (!(flags & FLAGS_MATCH_ONE))
+ flags ^= FLAGS_MATCH_ONE;
+
+ // ideally, these flags should all just be stored in a uint...
+ isExact = ((flags & FLAG_EXACT) != 0);
+ isHidden = ((flags & FLAG_HIDDEN) != 0);
+ isIgnore = ((flags & FLAG_IGNORE) != 0);
+
set(word);
this->weight = weight;
}
***************
*** 77,82 ****
--- 105,111 ----
//
void WeightWord::set(char *word)
{
+ #if 0
isExact = 0;
isHidden = 0;
while (strchr(word, ':'))
***************
*** 104,109 ****
--- 133,139 ----
}
}
+ #endif
this->word = word;
this->word.lowercase();
}
diff -rc -xCVS cvs/htdig/htsearch/WeightWord.h profile/htsearch/WeightWord.h
*** cvs/htdig/htsearch/WeightWord.h Sat Feb 2 09:49:35 2002
--- profile/htsearch/WeightWord.h Sun Feb 9 08:18:57 2003
***************
*** 19,24 ****
--- 19,25 ----
#include "htString.h"
#include "WordRecord.h"
+ #include "HtWordReference.h" // for FLAG_...
class WeightWord : public Object
{
***************
*** 28,33 ****
--- 29,35 ----
//
WeightWord();
WeightWord(char *word, double weight);
+ WeightWord(char *word, double weight, unsigned int flags);
WeightWord(WeightWord *);
virtual ~WeightWord();
***************
*** 37,45 ****
String word;
double weight;
WordRecord *records;
! int isExact;
! int isHidden;
! int isIgnore;
};
#endif
--- 39,48 ----
String word;
double weight;
WordRecord *records;
! unsigned int flags;
! short int isExact;
! short int isHidden;
! short int isIgnore;
};
#endif
diff -rc -xCVS cvs/htdig/htsearch/htsearch.cc profile/htsearch/htsearch.cc
*** cvs/htdig/htsearch/htsearch.cc Wed Feb 5 22:05:58 2003
--- profile/htsearch/htsearch.cc Sun Feb 9 09:44:31 2003
***************
*** 63,68 ****
--- 63,87 ----
StringList collectionList; // List of databases to search on
+ // reconised word prefixes (for field-restricted search and per-word fuzzy
+ // algorithms) in *descending* alphabetical order.
+ // Don't use a dictionary structure, as setup time outweights saving.
+ struct {char *name; unsigned int flag; } colonPrefix [] =
+ {
+ { "url", FLAG_URL },
+ { "title", FLAG_TITLE },
+ { "text", FLAG_PLAIN }, // FLAG_TEXT is 0, i.e. *no* flag...
+ { "link", FLAG_LINK_TEXT },
+ { "keyword", FLAG_KEYWORDS },
+ { "hidden", FLAG_HIDDEN },
+ { "heading", FLAG_HEADING },
+ { "exact", FLAG_EXACT },
+ { "descr", FLAG_DESCRIPTION },
+ // { "cap", FLAG_CAPITAL },
+ { "author", FLAG_AUTHOR },
+ { "", 0 },
+ };
+
//*****************************************************************************
// int main()
//
***************
*** 512,517 ****
--- 531,537 ----
unsigned char t;
String word;
const String prefix_suffix = config->Find("prefix_match_character");
+
while (*pos)
{
while (1)
***************
*** 534,549 ****
tempWords.Add(new WeightWord(s, -1.0));
break;
}
! else if (HtIsWordChar(t) || t == ':' ||
! (strchr(prefix_suffix, t) != NULL) || (t >= 161 && t <= 255))
{
! word = 0;
! while (t && (HtIsWordChar(t) ||
! t == ':' || (strchr(prefix_suffix, t) != NULL) || (t >= 161 && t <= 255)))
{
! word << (char) t;
! t = *pos++;
! }
pos--;
if (boolean && (mystrcasecmp(word.get(), "+") == 0
--- 554,595 ----
tempWords.Add(new WeightWord(s, -1.0));
break;
}
! else if (HtIsWordChar(t) ||
! (strchr(prefix_suffix, t) != NULL) ||
! (t >= 161 && t <= 255))
{
! unsigned int fieldFlag = 0;
! word = 0;
! do // while recognised prefix, followed by ':'
{
! while (t && (HtIsWordChar(t) ||
! (strchr(prefix_suffix, t) != NULL) ||
! (t >= 161 && t <= 255)))
! {
! word << (char) t;
! t = *pos++;
! }
! if (t == ':') // e.g. "author:word" to search
! { // only in author
! word.lowercase();
! t = *pos++;
! if (t && (HtIsWordChar (t) ||
! (strchr(prefix_suffix, t) != NULL) ||
! (t >= 161 && t <= 255)))
! {
! int i, cmp;
! const char *w = word.get();
! // linear search of known prefixes, with "" flag.
! for (i = 0; (cmp = mystrcasecmp (w, colonPrefix[i].name)) < 0; i++)
! ;
! if (cmp == 0) // if prefix found...
! {
! fieldFlag |= colonPrefix [i].flag;
! word = 0;
! }
! }
! }
! } while (!word.length());
pos--;
if (boolean && (mystrcasecmp(word.get(), "+") == 0
***************
*** 565,571 ****
{
// Add word to excerpt matching list
originalPattern << word << "|";
! WeightWord *ww = new WeightWord(word, 1.0);
if(HtWordNormalize(word) & WORD_NORMALIZE_NOTOK)
ww->isIgnore = 1;
tempWords.Add(ww);
--- 611,617 ----
{
// Add word to excerpt matching list
originalPattern << word << "|";
! WeightWord *ww = new WeightWord(word, 1.0, fieldFlag);
if(HtWordNormalize(word) & WORD_NORMALIZE_NOTOK)
ww->isIgnore = 1;
tempWords.Add(ww);
***************
*** 646,651 ****
--- 692,699 ----
{
WeightWord *ww = (WeightWord *) tempWords[i];
if (ww->weight > 0 && !ww->isIgnore && !in_phrase)
+ // I think that should be:
+ // if (ww->weight > 0 && !ww->isIgnore && !in_phrase && !ww->isExact)
{
//
// Apply all the algorithms to the word.
***************
*** 699,707 ****
--- 747,757 ----
{
if (debug > 1)
cout << " " << word->get();
+ // (should be a "copy with changed weight" constructor...)
newWw = new WeightWord(word->get(), fuzzy->getWeight());
newWw->isExact = ww->isExact;
newWw->isHidden = ww->isHidden;
+ newWw->flags = ww->flags;
weightWords.Add(newWw);
}
if (debug > 1)
diff -rc -xCVS cvs/htdig/htsearch/parser.cc profile/htsearch/parser.cc
*** cvs/htdig/htsearch/parser.cc Mon Dec 30 23:42:59 2002
--- profile/htsearch/parser.cc Sun Feb 9 09:18:38 2003
***************
*** 238,244 ****
{
if(!wordList) wordList = new List;
if(debug) cerr << "scoring phrase" << endl;
! score(wordList, weight);
}
break;
}
--- 238,244 ----
{
if(!wordList) wordList = new List;
if(debug) cerr << "scoring phrase" << endl;
! score(wordList, weight, FLAGS_MATCH_ONE); // look in all fields
}
break;
}
***************
*** 381,387 ****
p[maximum_word_length] = '\0';
List* result = words[p];
! score(result, current->weight);
delete result;
}
--- 381,387 ----
p[maximum_word_length] = '\0';
List* result = words[p];
! score(result, current->weight, current->flags);
delete result;
}
***************
*** 510,517 ****
}
//*****************************************************************************
void
! Parser::score(List *wordList, double weight)
{
HtConfiguration* config= HtConfiguration::config();
DocMatch *dm;
--- 510,520 ----
}
//*****************************************************************************
+ // Allocate scores based on words in wordList.
+ // Fields within which the word must appear are specified in flags
+ // (see HtWordReference.h).
void
! Parser::score(List *wordList, double weight, unsigned int flags)
{
HtConfiguration* config= HtConfiguration::config();
DocMatch *dm;
***************
*** 550,555 ****
--- 553,568 ----
//
// ******* Compute the score for the document
//
+
+ // If word not in one of the required fields, skip the entry.
+ // Plain text sets no flag in dbase, so treat it separately.
+ if (!(wr->Flags() & flags) && (wr->Flags() || !(flags & FLAG_PLAIN)))
+ {
+ if (debug > 2)
+ cerr << "Flags " << wr->Flags() << " lack " << flags << endl;
+ continue;
+ }
+
wscore = 0.0;
if (wr->Flags() == FLAG_TEXT) wscore += text_factor;
if (wr->Flags() & FLAG_CAPITAL) wscore += caps_factor;
diff -rc -xCVS cvs/htdig/htsearch/parser.h profile/htsearch/parser.h
*** cvs/htdig/htsearch/parser.h Mon Dec 30 23:42:59 2002
--- profile/htsearch/parser.h Thu Feb 6 21:19:02 2003
***************
*** 56,62 ****
void perform_or();
void perform_phrase(List * &);
! void score(List *, double weight);
List *tokens;
List *result;
--- 56,62 ----
void perform_or();
void perform_phrase(List * &);
! void score(List *, double weight, unsigned int flags);
List *tokens;
List *result;
diff -rc -xCVS cvs/htdig/test/t_htsearch profile/test/t_htsearch
*** cvs/htdig/test/t_htsearch Tue Jan 21 09:40:18 2003
--- profile/test/t_htsearch Sun Feb 9 09:28:30 2003
***************
*** 106,111 ****
--- 106,139 ----
"method=boolean&words=also+or+%22distribution" \
'Expected quotes at the end'
+ try "Unrestricted search for 'group'" \
+ "method=and&words=group" \
+ '4 matches' 'script.html' 'bad_local.htm' 'site3.html' 'site4.html'
+
+ try "Field-restricted search for 'author:group'" \
+ "method=and&words=author:group" \
+ '1 match' 'script.html'
+
+ try "Field-restricted search for 'text:group'" \
+ "method=and&words=text:group" \
+ '3 matches' 'bad_local.htm' 'site3.html' 'site4.html'
+
+ try "Checking prefix parsing using 'text: group'" \
+ "method=and&words=text:%20group" \
+ '1 match' 'script.html'
+
+ try "Checking prefix parsing using 'text::group'" \
+ "method=and&words=text::group" \
+ '1 match' 'script.html'
+
+ try "Checking prefix parsing using 'unknown:group'" \
+ "method=any&words=unknown:group" \
+ '5 matches' 'script.html' 'bad_local.htm' 'site3.html' 'site4.html' 'set1/"'
+
+ try "Field-restricted search for 'descr:cost'" \
+ "method=and&words=descr:cost" \
+ '1 match' 'script.html'
+
config=$testdir/conf/htdig.conf3
try "Testing boolean_keywords and search_rewrite_urls" \