[htdig-dev] Field restricted searching patch

Lachlan Andrew Sat, 08 Feb 2003 16:26:32 -0800

Greetings all,

Attached is a patch for field-restricted searches.  Could people 
please test it before I commit it, especially people who have 
external parser and can test the new handling of <meta...> tags?


Questions that arose from doing this:
1. Could/should the tests for '|| (t >= 161 && t <= 255)' be made part 
of HtIsWordChar?  I assume they are for accented letters.
2. Should 'prefix_match_character' really be a string, not a char?
3. What (if anything) should I do with the 'author' field, other than 
put the words in the word list?
4. This *doesn't* allow field-restricted parenthesised queries like 
'Harry and Potter and title:(fan and club)'.  Is that OK?
5. Should 'exact:' inhibit fuzzy rules, as the comment in htsearch.cc 
suggests?  If not, what should it do?
5. I've made some annotations to STATUS.  Could someone else please 
check these too, and delete the entries if they are really resolved?

Thanks,
Lachlan

diff -rc -xCVS cvs/htdig/STATUS profile/STATUS
*** cvs/htdig/STATUS	Tue Jan 21 09:29:05 2003
--- profile/STATUS	Sun Feb  9 09:38:08 2003
***************
*** 1,7 ****
  STATUS of ht://Dig branch 3-2-x
  
  RELEASES:
!    3.2.0b5: Next release, tentatively 1 Feb 2003.
     3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
     3.2.0b3: Released:  22 Feb 2001.
     3.2.0b2: Released:  11 Apr 2000.
--- 1,7 ----
  STATUS of ht://Dig branch 3-2-x
  
  RELEASES:
!    3.2.0b5: Next release, First quarter 2003???
     3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
     3.2.0b3: Released:  22 Feb 2001.
     3.2.0b2: Released:  11 Apr 2000.
***************
*** 22,27 ****
--- 22,28 ----
      so this must be some sort of weird htsearch bug) PR#618737.
  * META descriptions are somehow added to the database as FLAG_TITLE,
     not FLAG_DESCRIPTION. (PR#618738)
+    Can anyone reproduce this?  I can't! -- Lachlan
  
  PENDING PATCHES (available but need work):
  * Additional support for Win32.
***************
*** 29,35 ****
  * Mifluz merge.
  
  NEEDED FEATURES:
- * Field-restricted searching. (e.g. PR#460833)
  * Quim's new htsearch/qtest query parser framework.
  * File/Database locking. PR#405764.
  
--- 30,35 ----
***************
*** 45,52 ****
  
  DOCUMENTATION:
  * List of supported platforms/compilers is ancient. (PR#405279)
- * Add thorough documentation on htsearch restrict/exclude behavior
-    (including '|' and regex).
  * Document all of htsearch's mappings of input parameters to config attributes
     to template variables. (Relates to PR#405278.)
    Should we make sure these config attributes are all documented in
--- 45,50 ----
***************
*** 60,65 ****
--- 58,64 ----
     PRs# 405280 #405281.
  * TODO.html has not been updated for current TODO list and
     completions.
+    I've tried.  Someone "official" please check and remove this -- Lachlan
  * Htfuzzy could use more documentation on what each fuzzy algorithm
     does. PR#405714.
  * Document the list of all installed files and default
diff -rc -xCVS cvs/htdig/htcommon/DocumentRef.h profile/htcommon/DocumentRef.h
*** cvs/htdig/htcommon/DocumentRef.h	Sat Feb  2 09:49:28 2002
--- profile/htcommon/DocumentRef.h	Thu Feb  6 23:35:42 2003
***************
*** 54,59 ****
--- 54,60 ----
      char		*DocURL()			{return docURL;}
      time_t		DocTime()			{return docTime;}
      char		*DocTitle()			{return docTitle;}
+     char		*DocAuthor()			{return docAuthor;}
      char		*DocHead()			{return docHead;}
      int			DocHeadIsSet()			{return docHeadIsSet;}
      char                *DocMetaDsc()                   {return docMetaDsc;}
***************
*** 76,81 ****
--- 77,83 ----
      void		DocURL(const char *u)		{docURL = u;}
      void		DocTime(time_t t)		{docTime = t;}
      void		DocTitle(const char *t)		{docTitle = t;}
+     void		DocAuthor(const char *a)	{docAuthor = a;}
      void		DocHead(const char *h)		{docHeadIsSet = 1; docHead = h;}
      void                DocMetaDsc(const char *md)      {docMetaDsc = md;}
      void		DocAccessed(time_t t)		{docAccessed = t;}
***************
*** 121,126 ****
--- 123,130 ----
      String              docMetaDsc;
      // This is the title of the document.
      String		docTitle;
+     // This is the author of the document, as specified in meta information
+     String		docAuthor;
      // This is a list of Strings, the text of links pointing to this document.
      // (e.g. <a href="docURL">description</a>
      List		descriptions;
diff -rc -xCVS cvs/htdig/htcommon/HtWordReference.h profile/htcommon/HtWordReference.h
*** cvs/htdig/htcommon/HtWordReference.h	Sat Feb  2 09:49:28 2002
--- profile/htcommon/HtWordReference.h	Sun Feb  9 09:45:23 2003
***************
*** 20,25 ****
--- 20,26 ----
  
  //
  // Flags
+ // (If extra flags added, also update  htsearch.cc:colonPrefix
  // 
  #define FLAG_TEXT 0
  #define FLAG_CAPITAL 1
***************
*** 30,35 ****
--- 31,46 ----
  #define FLAG_AUTHOR 32
  #define FLAG_LINK_TEXT 64
  #define FLAG_URL 128
+ 
+ // For field-restricted search, at least one of these flags must be set
+ // in document.  (255 = OR of the above...)
+ #define FLAGS_MATCH_ONE (255 | FLAG_PLAIN)
+ 
+ // The following are not stored in the database, but are used by WeightWord
+ #define FLAG_PLAIN 4096
+ #define FLAG_EXACT 8192
+ #define FLAG_HIDDEN 16384
+ #define FLAG_IGNORE 32768
  // The remainder are undefined
  
  class HtWordReference : public WordReference
diff -rc -xCVS cvs/htdig/htcommon/defaults.cc profile/htcommon/defaults.cc
*** cvs/htdig/htcommon/defaults.cc	Thu Feb  6 23:22:36 2003
--- profile/htcommon/defaults.cc	Sat Feb  8 13:35:57 2003
***************
*** 151,158 ****
  	search form</a> documentation for details on this. \
  " }, \
  { "author_factor", "1",  \
! 	"number", "htsearch", "", "??", "Searching:Ranking", "author_factor: 1", " \
! 	TO BE COMPLETED<br> \
  	See also <a href=\"#heading_factor\">heading_factor</a>. \
  " }, \
  { "authorization", "",  \
--- 151,159 ----
  	search form</a> documentation for details on this. \
  " }, \
  { "author_factor", "1",  \
! 	"number", "htsearch", "", "3.2.0b4", "Searching:Ranking", "author_factor: 1", " \
! 	Weighting applied to words in a &lt;meta name=\"author\" ... &gt; \
! 	tag.<br> \
  	See also <a href=\"#heading_factor\">heading_factor</a>. \
  " }, \
  { "authorization", "",  \
***************
*** 640,982 ****
  	application/pdf /usr/local/bin/parse_doc.pl \\<br> \
  	application/msword-&gt;text/plain \"/usr/local/bin/mswordtotxt -w\" \\<br> \
  	application/x-gunzip-&gt;user-defined /usr/local/bin/ungzipper", " \
! 			This attribute is used to specify a list of \
! 			content-type/parsers that are to be used to parse \
! 			documents that cannot by parsed by any of the internal \
! 			parsers. The list of external parsers is examined \
! 			before the builtin parsers are checked, so this can be \
! 			used to override the internal behavior without \
! 			recompiling htdig.<br> \
! 			 The external parsers are specified as pairs of \
! 			strings. The first string of each pair is the \
! 			content-type that the parser can handle while the \
! 			second string of each pair is the path to the external \
! 			parsing program. If quoted, it may contain parameters, \
! 			separated by spaces.<br> \
! 			 External parsing can also be done with external \
! 			converters, which convert one content-type to \
! 			another. To do this, instead of just specifying \
! 			a single content-type as the first string \
! 			of a pair, you specify two types, in the form \
! 			<em>type1</em><strong>-&gt;</strong><em>type2</em>, \
! 			as a single string with no spaces. The second \
! 			string will define an external converter \
! 			rather than an external parser, to convert \
! 			the first type to the second. If the second \
! 			type is <strong>user-defined</strong>, then \
! 			it's up to the converter script to put out a \
! 			\"Content-Type:&nbsp;<em>type</em>\" header followed \
! 			by a blank line, to indicate to htdig what type it \
! 			should expect for the output, much like what a CGI \
! 			script would do. The resulting content-type must \
! 			be one that htdig can parse, either internally, \
! 			or with another external parser or converter.<br> \
! 			 Only one external parser or converter can be \
! 			specified for any given content-type. However, \
! 			an external converter for one content-type can be \
! 			chained to the internal parser for the same type, \
! 			by appending <strong>-internal</strong> to the \
! 			second type string (e.g. text/html->text/html-internal) \
! 			to perform external preprocessing on documents of \
! 			this type before internal parsing. \
! 			There are two internal parsers, for text/html and \
! 			text/plain.<p> \
! 			 The parser program takes four command-line \
! 			parameters, not counting any parameters already \
! 			given in the command string:<br> \
! 			<em>infile content-type URL configuration-file</em><br> \
! 			<table border=\"1\"> \
! 			  <tr> \
! 				<th> \
! 				  Parameter \
! 				</th> \
! 				<th> \
! 				  Description \
! 				</th> \
! 				<th> \
! 				  Example \
! 				</th> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  infile \
! 				</td> \
! 				<td> \
! 				  A temporary file with the contents to be parsed. \
! 				</td> \
! 				<td> \
! 				  /var/tmp/htdext.14242 \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  content-type \
! 				</td> \
! 				<td> \
! 				  The MIME-type of the contents. \
! 				</td> \
! 				<td> \
! 				  text/html \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  URL \
! 				</td> \
! 				<td> \
! 				  The URL of the contents. \
! 				</td> \
! 				<td> \
! 				  http://www.htdig.org/attrs.html \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  configuration-file \
! 				</td> \
! 				<td> \
! 				  The configuration-file in effect. \
! 				</td> \
! 				<td> \
! 				  /etc/htdig/htdig.conf \
! 				</td> \
! 			  </tr> \
! 			</table><p> \
! 			The external parser is to write information for \
! 			htdig on its standard output. Unless it is an \
! 			external converter, which will output a document \
! 			of a different content-type, then its output must \
! 			follow the format described here.<br> \
! 			 The output consists of records, each record terminated \
! 			with a newline. Each record is a series of (unless \
! 			expressively allowed to be empty) non-empty tab-separated \
! 			fields. The first field is a single character \
! 			that specifies the record type. The rest of the fields \
! 			are determined by the record type. \
! 			<table border=\"1\"> \
! 			  <tr> \
! 				<th> \
! 				  Record type \
! 				</th> \
! 				<th> \
! 				  Fields \
! 				</th> \
! 				<th> \
! 				  Description \
! 				</th> \
! 			  </tr> \
! 			  <tr> \
! 				<th rowspan=\"3\" valign=\"top\"> \
! 				  w \
! 				</th> \
! 				<td valign=\"top\"> \
! 				  word \
! 				</td> \
! 				<td> \
! 				  A word that was found in the document. \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  location \
! 				</td> \
! 				<td> \
! 				  A number indicating the normalized location of \
! 				  the word within the document. The number has to \
! 				  fall in the range 0-1000 where 0 means the top of \
! 				  the document. \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  heading level \
! 				</td> \
! 				<td> \
! 				  A heading level that is used to compute the \
! 				  weight of the word depending on its context in \
! 				  the document itself. The level is in the range of \
! 				  0-10 and are defined as follows: \
! 				  <dl compact> \
! 					<dt> \
! 					  0 \
! 					</dt> \
! 					<dd> \
! 					  Normal text \
! 					</dd> \
! 					<dt> \
! 					  1 \
! 					</dt> \
! 					<dd> \
! 					  Title text \
! 					</dd> \
! 					<dt> \
! 					  2 \
! 					</dt> \
! 					<dd> \
! 					  Heading 1 text \
! 					</dd> \
! 					<dt> \
! 					  3 \
! 					</dt> \
! 					<dd> \
! 					  Heading 2 text \
! 					</dd> \
! 					<dt> \
! 					  4 \
! 					</dt> \
! 					<dd> \
! 					  Heading 3 text \
! 					</dd> \
! 					<dt> \
! 					  5 \
! 					</dt> \
! 					<dd> \
! 					  Heading 4 text \
! 					</dd> \
! 					<dt> \
! 					  6 \
! 					</dt> \
! 					<dd> \
! 					  Heading 5 text \
! 					</dd> \
! 					<dt> \
! 					  7 \
! 					</dt> \
! 					<dd> \
! 					  Heading 6 text \
! 					</dd> \
! 					<dt> \
! 					  8 \
! 					</dt> \
! 					<dd> \
! 					  <em>unused</em> \
! 					</dd> \
! 					<dt> \
! 					  9 \
! 					</dt> \
! 					<dd> \
! 					  <em>unused</em> \
! 					</dd> \
! 					<dt> \
! 					  10 \
! 					</dt> \
! 					<dd> \
! 					  Keywords \
! 					</dd> \
! 				  </dl> \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<th rowspan=\"2\" valign=\"top\"> \
! 				  u \
! 				</th> \
! 				<td valign=\"top\"> \
! 				  document URL \
! 				</td> \
! 				<td> \
! 				  A hyperlink to another document that is \
! 				  referenced by the current document.  It must be \
! 				  complete and non-relative, using the URL parameter to \
! 				  resolve any relative references found in the document. \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  hyperlink description \
! 				</td> \
! 				<td> \
! 				  For HTML documents, this would be the text \
! 				  between the &lt;a href...&gt; and &lt;/a&gt; \
! 				  tags. \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<th valign=\"top\"> \
! 				  t \
! 				</th> \
! 				<td valign=\"top\"> \
! 				  title \
! 				</td> \
! 				<td> \
! 				  The title of the document \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<th valign=\"top\"> \
! 				  h \
! 				</th> \
! 				<td valign=\"top\"> \
! 				  head \
! 				</td> \
! 				<td> \
! 				  The top of the document itself. This is used to \
! 				  build the excerpt. This should only contain \
! 				  normal ASCII text \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<th valign=\"top\"> \
! 				  a \
! 				</th> \
! 				<td valign=\"top\"> \
! 				  anchor \
! 				</td> \
! 				<td> \
! 				  The label that identifies an anchor that can be \
! 				  used as a target in an URL. This really only \
! 				  makes sense for HTML documents. \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<th valign=\"top\"> \
! 				  i \
! 				</th> \
! 				<td valign=\"top\"> \
! 				  image URL \
! 				</td> \
! 				<td> \
! 				  An URL that points at an image that is part of \
! 				  the document. \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<th rowspan=\"3\" valign=\"top\"> \
! 				  m \
! 				</th> \
! 				<td valign=\"top\"> \
! 				  http-equiv \
! 				</td> \
! 				<td> \
! 				  The HTTP-EQUIV attribute of a \
! 				  <a href=\"meta.html\"><em>META</em> tag</a>. \
! 				  May be empty. \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  name \
! 				</td> \
! 				<td> \
! 				  The NAME attribute of this \
! 				  <a href=\"meta.html\"><em>META</em> tag</a>. \
! 				  May be empty. \
! 				</td> \
! 			  </tr> \
! 			  <tr> \
! 				<td valign=\"top\"> \
! 				  contents \
! 				</td> \
! 				<td> \
! 				  The CONTENTS attribute of this \
! 				  <a href=\"meta.html\"><em>META</em> tag</a>. \
! 				  May be empty. \
! 				</td> \
! 			  </tr> \
! 			</table> \
! 	<p><em>See also FAQ questions <a \
! 	href=\"FAQ.html#q4.8\">4.8</a> and <a \
! 	href=\"FAQ.html#q4.9\">4.9</a> for more \
! 	examples.</em></p> \
  " }, \
  { "external_protocols", "", \
  	"quoted string list", "htdig", "", "3.2.0b1", "External:Protocols", "external_protocols: https /usr/local/bin/handler.pl \\<br> \
--- 641,849 ----
  	application/pdf /usr/local/bin/parse_doc.pl \\<br> \
  	application/msword-&gt;text/plain \"/usr/local/bin/mswordtotxt -w\" \\<br> \
  	application/x-gunzip-&gt;user-defined /usr/local/bin/ungzipper", " \
! 	This attribute is used to specify a list of \
! 	content-type/parsers that are to be used to parse \
! 	documents that cannot by parsed by any of the internal \
! 	parsers. The list of external parsers is examined \
! 	before the builtin parsers are checked, so this can be \
! 	used to override the internal behavior without \
! 	recompiling htdig.<br> \
! 	 The external parsers are specified as pairs of \
! 	strings. The first string of each pair is the \
! 	content-type that the parser can handle while the \
! 	second string of each pair is the path to the external \
! 	parsing program. If quoted, it may contain parameters, \
! 	separated by spaces.<br> \
! 	 External parsing can also be done with external \
! 	converters, which convert one content-type to \
! 	another. To do this, instead of just specifying \
! 	a single content-type as the first string \
! 	of a pair, you specify two types, in the form \
! 	<em>type1</em><strong>-&gt;</strong><em>type2</em>, \
! 	as a single string with no spaces. The second \
! 	string will define an external converter \
! 	rather than an external parser, to convert \
! 	the first type to the second. If the second \
! 	type is <strong>user-defined</strong>, then \
! 	it's up to the converter script to put out a \
! 	\"Content-Type:&nbsp;<em>type</em>\" header followed \
! 	by a blank line, to indicate to htdig what type it \
! 	should expect for the output, much like what a CGI \
! 	script would do. The resulting content-type must \
! 	be one that htdig can parse, either internally, \
! 	or with another external parser or converter.<br> \
! 	 Only one external parser or converter can be \
! 	specified for any given content-type. However, \
! 	an external converter for one content-type can be \
! 	chained to the internal parser for the same type, \
! 	by appending <strong>-internal</strong> to the \
! 	second type string (e.g. text/html->text/html-internal) \
! 	to perform external preprocessing on documents of \
! 	this type before internal parsing. \
! 	There are two internal parsers, for text/html and \
! 	text/plain.<p> \
! 	 The parser program takes four command-line \
! 	parameters, not counting any parameters already \
! 	given in the command string:<br> \
! 	<em>infile content-type URL configuration-file</em><br> \
! 	<table border=\"1\"> \
! 	  <tr> \
! 		<th> Parameter </th> \
! 		<th> Description </th> \
! 		<th> Example </th> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> infile </td> \
! 		<td> A temporary file with the contents to be parsed.  </td> \
! 		<td> /var/tmp/htdext.14242 </td> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> content-type </td> \
! 		<td> The MIME-type of the contents.  </td> \
! 		<td> text/html </td> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> URL </td> \
! 		<td> The URL of the contents.  </td> \
! 		<td> http://www.htdig.org/attrs.html </td> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> configuration-file </td> \
! 		<td> The configuration-file in effect.  </td> \
! 		<td> /etc/htdig/htdig.conf </td> \
! 	  </tr> \
! 	</table><p> \
! 	The external parser is to write information for \
! 	htdig on its standard output. Unless it is an \
! 	external converter, which will output a document \
! 	of a different content-type, then its output must \
! 	follow the format described here.<br> \
! 	 The output consists of records, each record terminated \
! 	with a newline. Each record is a series of (unless \
! 	expressively allowed to be empty) non-empty tab-separated \
! 	fields. The first field is a single character \
! 	that specifies the record type. The rest of the fields \
! 	are determined by the record type. \
! 	<table border=\"1\"> \
! 	  <tr> \
! 		<th> Record type </th> \
! 		<th> Fields </th> \
! 		<th> Description </th> \
! 	  </tr> \
! 	  <tr> \
! 		<th rowspan=\"3\" valign=\"top\"> w </th> \
! 		<td valign=\"top\"> word </td> \
! 		<td> A word that was found in the document.  </td> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> location </td> \
! 		<td> \
! 		  A number indicating the normalized location of \
! 		  the word within the document. The number has to \
! 		  fall in the range 0-1000 where 0 means the top of \
! 		  the document. \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> heading level </td> \
! 		<td> \
! 		  A heading level that is used to compute the \
! 		  weight of the word depending on its context in \
! 		  the document itself. The level is in the range of \
! 		  0-11 and are defined as follows: \
! 		  <dl compact> \
! 			<dt> 0 </dt> <dd> Normal text </dd> \
! 			<dt> 1 </dt> <dd> Title text </dd> \
! 			<dt> 2 </dt> <dd> Heading 1 text </dd> \
! 			<dt> 3 </dt> <dd> Heading 2 text </dd> \
! 			<dt> 4 </dt> <dd> Heading 3 text </dd> \
! 			<dt> 5 </dt> <dd> Heading 4 text </dd> \
! 			<dt> 6 </dt> <dd> Heading 5 text </dd> \
! 			<dt> 7 </dt> <dd> Heading 6 text </dd> \
! 			<dt> 8 </dt> <dd> text alternative to images </dd> \
! 			<dt> 9 </dt> <dd> Keywords </dd> \
! 			<dt> 10 </dt> <dd> Meta-description </dd> \
! 			<dt> 11 </dt> <dd> Author </dd> \
! 		  </dl> \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<th rowspan=\"2\" valign=\"top\"> u </th> \
! 		<td valign=\"top\"> document URL </td> \
! 		<td> \
! 		  A hyperlink to another document that is \
! 		  referenced by the current document.  It must be \
! 		  complete and non-relative, using the URL parameter to \
! 		  resolve any relative references found in the document. \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> hyperlink description </td> \
! 		<td> \
! 		  For HTML documents, this would be the text \
! 		  between the &lt;a href...&gt; and &lt;/a&gt; \
! 		  tags. \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<th valign=\"top\"> t </th> \
! 		<td valign=\"top\"> title </td> \
! 		<td> The title of the document </td> \
! 	  </tr> \
! 	  <tr> \
! 		<th valign=\"top\"> h </th> \
! 		<td valign=\"top\"> head </td> \
! 		<td> \
! 		  The top of the document itself. This is used to \
! 		  build the excerpt. This should only contain \
! 		  normal ASCII text \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<th valign=\"top\"> a </th> \
! 		<td valign=\"top\"> anchor </td> \
! 		<td> \
! 		  The label that identifies an anchor that can be \
! 		  used as a target in an URL. This really only \
! 		  makes sense for HTML documents. \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<th valign=\"top\"> i </th> \
! 		<td valign=\"top\"> image URL </td> \
! 		<td> \
! 		  An URL that points at an image that is part of \
! 		  the document. \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<th rowspan=\"3\" valign=\"top\"> m </th> \
! 		<td valign=\"top\"> http-equiv </td> \
! 		<td> \
! 		  The HTTP-EQUIV attribute of a \
! 		  <a href=\"meta.html\"><em>META</em> tag</a>. \
! 		  May be empty. \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> name </td> \
! 		<td> \
! 		  The NAME attribute of this \
! 		  <a href=\"meta.html\"><em>META</em> tag</a>. \
! 		  May be empty. \
! 		</td> \
! 	  </tr> \
! 	  <tr> \
! 		<td valign=\"top\"> contents </td> \
! 		<td> \
! 		  The CONTENTS attribute of this \
! 		  <a href=\"meta.html\"><em>META</em> tag</a>. \
! 		  May be empty. \
! 		</td> \
! 	  </tr> \
! 	</table> \
! 	<p><em>See also FAQ questions <a href=\"FAQ.html#q4.8\">4.8</a> and \
! 	<a href=\"FAQ.html#q4.9\">4.9</a> for more examples.</em></p> \
  " }, \
  { "external_protocols", "", \
  	"quoted string list", "htdig", "", "3.2.0b1", "External:Protocols", "external_protocols: https /usr/local/bin/handler.pl \\<br> \
diff -rc -xCVS cvs/htdig/htdig/ExternalParser.cc profile/htdig/ExternalParser.cc
*** cvs/htdig/htdig/ExternalParser.cc	Mon Dec 30 23:42:58 2002
--- profile/htdig/ExternalParser.cc	Sat Feb  8 10:58:38 2003
***************
*** 201,207 ****
      write(fd, contents->get(), contents->length());
      close(fd);
  
!     unsigned int minimum_word_length = config->Value("minimum_word_length", 3);
      String	line;
      char	*token1, *token2, *token3;
      int		loc = 0, hd = 0;
--- 201,207 ----
      write(fd, contents->get(), contents->length());
      close(fd);
  
! //  unsigned int minimum_word_length = config->Value("minimum_word_length", 3);
      String	line;
      char	*token1, *token2, *token3;
      int		loc = 0, hd = 0;
***************
*** 452,470 ****
  		  {
  		    if (keywordsMatch->CompareWord(name))
  		    {
! 		      char	*w = strtok(content, " ,\t\r");
! 		      while (w)
! 		      {
! 			if (strlen(w) >= minimum_word_length)
! 			  retriever.got_word(w, 1, 9);
! 			w = strtok(0, " ,\t\r");
! 		      }
  		    }
  		    if (metadatetags->CompareWord(name) &&
  					config->Boolean("use_doc_date", 0))
  		    {
  		      retriever.got_time(content);
  		    }
  		    else if (mystrcasecmp(name, "htdig-email") == 0)
  		    {
  		      retriever.got_meta_email(content);
--- 452,479 ----
  		  {
  		    if (keywordsMatch->CompareWord(name))
  		    {
! 			int wordindex = 1;
! 			addKeywordString (retriever, content, wordindex);
! //			// can this be merged with Parser::addKeywordString ?
! //		      char	*w = strtok(content, " ,\t\r");
! //		      while (w)
! //		      {
! //			if (strlen(w) >= minimum_word_length)
! //			  retriever.got_word(w, 1, 9);
! //			w = strtok(0, " ,\t\r");
! //		      }
  		    }
  		    if (metadatetags->CompareWord(name) &&
  					config->Boolean("use_doc_date", 0))
  		    {
  		      retriever.got_time(content);
  		    }
+ 		    else if (mystrcasecmp(name, "author") == 0)
+ 		    {
+ 			int wordindex = 1;
+ 			retriever.got_author(content);
+ 			addString (retriever, content, wordindex, 11);
+ 		    }
  		    else if (mystrcasecmp(name, "htdig-email") == 0)
  		    {
  		      retriever.got_meta_email(content);
***************
*** 495,507 ****
  		      // Now add the words to the word list
  		      // (slot 10 is the new slot for this)
  		      //
! 		      char	  *w = strtok(content, " \t\r");
! 		      while (w)
! 		      {
! 			if (strlen(w) >= minimum_word_length)
! 			  retriever.got_word(w, 1, 10);
! 			w = strtok(0, " \t\r");
! 		      }
  		    }
  		  }
  		}
--- 504,519 ----
  		      // Now add the words to the word list
  		      // (slot 10 is the new slot for this)
  		      //
! 		      int wordindex = 1;
! 		      addString (retriever, content, wordindex, 10);
! //		      // can this be merged with Parser::addString ?
! //		      char	  *w = strtok(content, " \t\r");
! //		      while (w)
! //		      {
! //			if (strlen(w) >= minimum_word_length)
! //			  retriever.got_word(w, 1, 10);
! //			w = strtok(0, " \t\r");
! //		      }
  		    }
  		  }
  		}
diff -rc -xCVS cvs/htdig/htdig/HTML.cc profile/htdig/HTML.cc
*** cvs/htdig/htdig/HTML.cc	Wed Feb  5 22:17:33 2003
--- profile/htdig/HTML.cc	Sat Feb  8 11:09:48 2003
***************
*** 45,52 ****
  static StringMatch	metadatetags;
  static StringMatch	descriptionMatch;
  static StringMatch	keywordsMatch;
! static int		keywordsCount;
! static int		max_keywords;
  
  
  //*****************************************************************************
--- 45,52 ----
  static StringMatch	metadatetags;
  static StringMatch	descriptionMatch;
  static StringMatch	keywordsMatch;
! //static int		keywordsCount;
! //static int		max_keywords;
  
  
  //*****************************************************************************
***************
*** 113,121 ****
      StringList keywordNames(config->Find("keywords_meta_tag_names"), " \t");
      keywordsMatch.IgnoreCase();
      keywordsMatch.Pattern(keywordNames.Join('|'));
!     max_keywords = config->Value("max_keywords", -1);
!     if (max_keywords < 0)
! 	max_keywords = (int) ((unsigned int) ~1 >> 1);
  
      // skip_start/end mark sections of text to be ignored by ht://Dig
      // Make sure there are equal numbers of each, and warn of deprecated
--- 113,122 ----
      StringList keywordNames(config->Find("keywords_meta_tag_names"), " \t");
      keywordsMatch.IgnoreCase();
      keywordsMatch.Pattern(keywordNames.Join('|'));
! //    (now in Parser)
! //    max_keywords = config->Value("max_keywords", -1);
! //    if (max_keywords < 0)
! //	max_keywords = (int) ((unsigned int) ~1 >> 1);
  
      // skip_start/end mark sections of text to be ignored by ht://Dig
      // Make sure there are equal numbers of each, and warn of deprecated
***************
*** 180,186 ****
      base = 0;
      noindex = 0;
      nofollow = 0;
!     minimumWordLength = config->Value("minimum_word_length", 3);
  }
  
  
--- 181,187 ----
      base = 0;
      noindex = 0;
      nofollow = 0;
! //    minimumWordLength = config->Value("minimum_word_length", 3);
  }
  
  
***************
*** 495,501 ****
  		  head << word;
  	    }
  
! 	    if (word.length() >= (int)minimumWordLength && !noindex)
  	    {
  	      retriever.got_word((char*)word, wordindex++, in_heading);
  	    }
--- 496,502 ----
  		  head << word;
  	    }
  
! 	    if (word.length() >= (int)minimum_word_length && !noindex)
  	    {
  	      retriever.got_word((char*)word, wordindex++, in_heading);
  	    }
***************
*** 755,769 ****
  		if (!noindex)
  		  {
  		    String tmp = transSGML(keywords);
! 		    char	*w = HtWordToken(tmp);
! 		    while (w)
! 		      {
! 			if (strlen(w) >= minimumWordLength
! 				&& ++keywordsCount <= max_keywords)
! 			  retriever.got_word(w, wordindex++, 9);
! 			w = HtWordToken(0);
! 		      }
! 		    w = '\0';
  		  }
  	    }
  	
--- 756,762 ----
  		if (!noindex)
  		  {
  		    String tmp = transSGML(keywords);
! 		    addKeywordString (retriever, tmp, wordindex);
  		  }
  	    }
  	
***************
*** 827,859 ****
  		   // Now add the words to the word list
  		   // Slot 10 is the current slot for this
  		   //
- 
  		   if (!noindex)
  		     {
  		       String tmp = transSGML(attrs["content"]);
! 		       char        *w = HtWordToken(tmp);
! 		       while (w)
! 			 {
! 			   if (strlen(w) >= minimumWordLength)
! 			     retriever.got_word(w, wordindex++,10);
! 			   w = HtWordToken(0);
! 			 }
! 		       w = '\0';
  		     }
  		}
  
  		if (keywordsMatch.CompareWord(cache) && !noindex)
  		{
  		    String tmp = transSGML(attrs["content"]);
! 		    char	*w = HtWordToken(tmp);
! 		    while (w)
! 		    {
! 			if (strlen(w) >= minimumWordLength
! 				&& ++keywordsCount <= max_keywords)
! 			  retriever.got_word(w, wordindex++, 9);
! 			w = HtWordToken(0);
! 		    }
! 		    w = '\0';
  		}
  		else if (mystrcasecmp(cache, "htdig-email") == 0)
  		{
--- 820,843 ----
  		   // Now add the words to the word list
  		   // Slot 10 is the current slot for this
  		   //
  		   if (!noindex)
  		     {
  		       String tmp = transSGML(attrs["content"]);
! 		       addString (retriever, tmp, wordindex, 10);
  		     }
  		}
  
  		if (keywordsMatch.CompareWord(cache) && !noindex)
  		{
  		    String tmp = transSGML(attrs["content"]);
! 		    addKeywordString (retriever, tmp, wordindex);
! 		}
! 		else if (mystrcasecmp(cache, "author") == 0)
! 		{
! 		    String author = transSGML(attrs["content"]);
! 		    retriever.got_author(author);
! 		    if (!noindex)
! 			addString (retriever, author, wordindex, 11);
  		}
  		else if (mystrcasecmp(cache, "htdig-email") == 0)
  		{
***************
*** 988,1001 ****
  		    description << tmp << " ";
  		if (!noindex && !in_title && head.length() < max_head_length)
  		    head << tmp << " ";
! 		char *w = HtWordToken(tmp);
! 		while (w && !noindex)
! 		  {
! 		    if (strlen(w) >= minimumWordLength)
! 		      retriever.got_word(w, wordindex++, 8); // slot for img_alt
! 		    w = HtWordToken(0);
! 		  }
! 		w = '\0';
  	      }
  	    if (!attrs["src"].empty())
  	      {
--- 972,979 ----
  		    description << tmp << " ";
  		if (!noindex && !in_title && head.length() < max_head_length)
  		    head << tmp << " ";
! 		if (!noindex)
! 		    addString (retriever, tmp, wordindex, 8);	// slot for  img_alt
  	      }
  	    if (!attrs["src"].empty())
  	      {
diff -rc -xCVS cvs/htdig/htdig/HTML.h profile/htdig/HTML.h
*** cvs/htdig/htdig/HTML.h	Tue Jan 21 09:40:14 2003
--- profile/htdig/HTML.h	Sat Feb  8 10:58:41 2003
***************
*** 52,58 ****
      int			in_heading;
      int			noindex;
      int                 nofollow;
!     unsigned int	minimumWordLength;
      URL			*base;
      QuotedStringList	skip_start;
      QuotedStringList	skip_end;
--- 52,58 ----
      int			in_heading;
      int			noindex;
      int                 nofollow;
! //    unsigned int	minimumWordLength;
      URL			*base;
      QuotedStringList	skip_start;
      QuotedStringList	skip_end;
diff -rc -xCVS cvs/htdig/htdig/Parsable.cc profile/htdig/Parsable.cc
*** cvs/htdig/htdig/Parsable.cc	Sat Feb  2 09:49:29 2002
--- profile/htdig/Parsable.cc	Sat Feb  8 11:08:40 2003
***************
*** 31,36 ****
--- 31,41 ----
      max_head_length = config->Value("max_head_length", 0);
      max_description_length = config->Value("max_description_length", 50);
      max_meta_description_length = config->Value("max_meta_description_length", 0);
+ 
+     max_keywords = config->Value("max_keywords", -1);
+     if (max_keywords < 0)
+ 	max_keywords = (int) ((unsigned int) ~1 >> 1);
+     minimum_word_length = config->Value("minimum_word_length", 3);
  }
  
  
***************
*** 52,55 ****
--- 57,96 ----
  {
      delete contents;
      contents = new String(data, length);
+ }
+ 
+ //*****************************************************************************
+ // void Parsable::addString(char *s, int& wordindex, int slot)
+ //   Add all words in string s in "heading level" slot, incrementing  wordindex
+ //   along the way.  String  s  is corrupted.
+ //
+ void
+ Parsable::addString(Retriever& retriever, char *s, int& wordindex, int slot)
+ {
+     char *w = HtWordToken(s);
+     while (w)
+     {
+ 	if (strlen(w) >= minimum_word_length)
+ 	    retriever.got_word(w, wordindex++, slot); // slot for img_alt
+ 	w = HtWordToken(0);
+     }
+     w = '\0';
+ }
+ 
+ //*****************************************************************************
+ // void Parsable::addKeywordString(char *s, int& wordindex)
+ //   Add all words in string  s  as keywords, incrementing  wordindex
+ //   along the way.  String  s  is corrupted.
+ //
+ void
+ Parsable::addKeywordString(Retriever& retriever, char *s, int& wordindex)
+ {
+     char	*w = HtWordToken(s);
+     while (w)
+     {
+ 	if (strlen(w) >= minimum_word_length && ++keywordsCount <= max_keywords)
+ 	    retriever.got_word(w, wordindex++, 9);
+ 	w = HtWordToken(0);
+     }
+     w = '\0';
  }
diff -rc -xCVS cvs/htdig/htdig/Parsable.h profile/htdig/Parsable.h
*** cvs/htdig/htdig/Parsable.h	Sat Feb  2 09:49:29 2002
--- profile/htdig/Parsable.h	Sat Feb  8 11:07:56 2003
***************
*** 40,51 ****
--- 40,55 ----
      // the data that we contain.
      //
      virtual void	setContents(char *data, int length);
+     void addString(Retriever& retriever, char *s, int& wordindex, int slot);
+     void addKeywordString(Retriever& retriever,  char *s, int& wordindex);
  	
  protected:
      String		*contents;
      int			max_head_length;
      int			max_description_length;
      int			max_meta_description_length;
+     int			max_keywords, keywordsCount;
+     unsigned int	minimum_word_length;
  };
  
  #endif
diff -rc -xCVS cvs/htdig/htdig/Retriever.cc profile/htdig/Retriever.cc
*** cvs/htdig/htdig/Retriever.cc	Mon Dec 30 23:42:58 2002
--- profile/htdig/Retriever.cc	Sat Feb  8 11:09:24 2003
***************
*** 77,82 ****
--- 77,83 ----
      factor[9] = FLAG_KEYWORDS;
      // META description factor
      factor[10] = FLAG_DESCRIPTION;
+     factor[11] = FLAG_AUTHOR;
  	
      doc = new Document();
      minimumWordLength = config->Value("minimum_word_length", 3);
***************
*** 1279,1287 ****
  {
      if (debug > 3)
  	cout << "word: " << word << '@' << location << endl;
!     if (heading >= 11 || heading < 0) // Current limits for headings
        heading = 0;  // Assume it's just normal text
!     if (trackWords && strlen(word) >= minimumWordLength)
      {
        String w = word;
        HtWordReference wordRef;
--- 1280,1288 ----
  {
      if (debug > 3)
  	cout << "word: " << word << '@' << location << endl;
!     if (heading >= (int)(sizeof(factor)/sizeof(factor[0])) || heading < 0)
        heading = 0;  // Assume it's just normal text
!     if (trackWords && strlen(word) >= (unsigned int)minimumWordLength)
      {
        String w = word;
        HtWordReference wordRef;
***************
*** 1353,1358 ****
--- 1354,1372 ----
  	cout << "\ntitle: " << title << endl;
      current_title = title;
  }
+ 
+ 
+ //*****************************************************************************
+ // void Retriever::got_author(const char *e)
+ //
+ void
+ Retriever::got_author(const char *author)
+ {
+     if (debug > 1)
+ 	cout << "\nauthor: " << author << endl;
+     current_ref->DocAuthor(author);
+ }
+ 
  
  //*****************************************************************************
  // void Retriever::got_time(const char *time)
diff -rc -xCVS cvs/htdig/htdig/Retriever.h profile/htdig/Retriever.h
*** cvs/htdig/htdig/Retriever.h	Tue Feb 12 17:12:05 2002
--- profile/htdig/Retriever.h	Sat Feb  8 09:52:14 2003
***************
*** 64,69 ****
--- 64,70 ----
      void		got_word(const char *word, int location, int heading);
      void		got_href(URL &url, const char *description, int hops = 1);
      void		got_title(const char *title);
+     void		got_author(const char *author);
      void		got_time(const char *time);
      void		got_head(const char *head);
      void		got_meta_dsc(const char *md);
***************
*** 115,121 ****
      //
      // These are weights for the words.  The index is the heading level.
      //
!     long int		factor[11];
      int			currenthopcount;
  
      //
--- 116,122 ----
      //
      // These are weights for the words.  The index is the heading level.
      //
!     long int		factor[12];
      int			currenthopcount;
  
      //
diff -rc -xCVS cvs/htdig/htdoc/TODO.html profile/htdoc/TODO.html
*** cvs/htdig/htdoc/TODO.html	Sat Feb  2 09:49:29 2002
--- profile/htdoc/TODO.html	Sat Feb  8 12:55:54 2003
***************
*** 10,16 ****
  	  TODO list
  	</h1>
  	<p>
! 	  ht://Dig Copyright &copy; 1995-2001 <a href="THANKS.html">The ht://Dig Group</a><br>
  	  Please see the file <a href="COPYING">COPYING</a> for
  	  license information.
  	</p>
--- 10,16 ----
  	  TODO list
  	</h1>
  	<p>
! 	  ht://Dig Copyright &copy; 1995-2002 <a href="THANKS.html">The ht://Dig Group</a><br>
  	  Please see the file <a href="COPYING">COPYING</a> for
  	  license information.
  	</p>
***************
*** 35,41 ****
  		<li type="bullet">
  		Phrase searching
  		</li>
! 		<li type="square">
  		Field-based searching
  		</li>
  		<li type="bullet">
--- 35,41 ----
  		<li type="bullet">
  		Phrase searching
  		</li>
! 		<li type="circle">
  		Field-based searching
  		</li>
  		<li type="bullet">
***************
*** 136,141 ****
  	  </li>
  	</ul>
  	<hr size="4" noshade>
! 		Last modified: $Date: 2002/02/01 22:49:29 $
    </body>
  </html>
--- 136,141 ----
  	  </li>
  	</ul>
  	<hr size="4" noshade>
! 		Last modified: $Date: 2003/02/08 $
    </body>
  </html>
diff -rc -xCVS cvs/htdig/htdoc/hts_general.html profile/htdoc/hts_general.html
*** cvs/htdig/htdoc/hts_general.html	Sat Feb  2 09:49:32 2002
--- profile/htdoc/hts_general.html	Sat Feb  8 12:57:15 2003
***************
*** 10,16 ****
  	  htsearch
  	</h1>
  	<p>
! 	  ht://Dig Copyright &copy; 1995-2001 <a href="THANKS.html">The ht://Dig Group</a><br>
  	  Please see the file <a href="COPYING">COPYING</a> for
  	  license information.
  	</p>
--- 10,16 ----
  	  htsearch
  	</h1>
  	<p>
! 	  ht://Dig Copyright &copy; 1995-2003 <a href="THANKS.html">The ht://Dig Group</a><br>
  	  Please see the file <a href="COPYING">COPYING</a> for
  	  license information.
  	</p>
diff -rc -xCVS cvs/htdig/htdoc/hts_method.html profile/htdoc/hts_method.html
*** cvs/htdig/htdoc/hts_method.html	Sat Feb  2 09:49:32 2002
--- profile/htdoc/hts_method.html	Sat Feb  8 13:39:14 2003
***************
*** 10,16 ****
  	  htsearch
  	</h1>
  	<p>
! 	  ht://Dig Copyright &copy; 1995-2001 <a href="THANKS.html">The ht://Dig Group</a><br>
  	  Please see the file <a href="COPYING">COPYING</a> for
  	  license information.
  	</p>
--- 10,16 ----
  	  htsearch
  	</h1>
  	<p>
! 	  ht://Dig Copyright &copy; 1995-2003 <a href="THANKS.html">The ht://Dig Group</a><br>
  	  Please see the file <a href="COPYING">COPYING</a> for
  	  license information.
  	</p>
***************
*** 24,30 ****
  	  in global terms what goes on when htsearch searches.
  	</p>
  	<p>
! 	  htsearch gets a list of words from the HTML form that invoked
  	  it. If htsearch was invoked with boolean expression parsing
  	  enabled, it will do a quick syntax check on the input words.
  	  If there are syntax errors, it will display the syntax error
--- 24,31 ----
  	  in global terms what goes on when htsearch searches.
  	</p>
  	<p>
! 	  htsearch gets a list of (case insensitive) words from the HTML
! 	  form that invoked
  	  it. If htsearch was invoked with boolean expression parsing
  	  enabled, it will do a quick syntax check on the input words.
  	  If there are syntax errors, it will display the syntax error
***************
*** 36,46 ****
  	  If the boolean parser was not enabled, the list of words is
  	  converted into a boolean expression by putting either "and"s
  	  or "or"s between the words. (This depends on the search
! 	  type.)
  	</p>
  	<p>
! 	  In both cases, each of the words in the list is now expanded
! 	  using the search algorithms that were specified in the
  	  <a href="attrs.html#search_algorithm">search_algorithm</a>
  	  attribute. For example, the endings algorithm will convert a
  	  word like "person" into "person or persons". In this fashion,
--- 37,64 ----
  	  If the boolean parser was not enabled, the list of words is
  	  converted into a boolean expression by putting either "and"s
  	  or "or"s between the words. (This depends on the search
! 	  type.)  Phrases within double quotes (") specify that the words
! 	  must occur sequentially within the document.
  	</p>
  	<p>
! 	  If a word is immediately preceeded by a field specifer
! 	  (title:, heading:, author:, keyword:, descr:, link:, url:)
! 	  then it will only match documents in which the word occurred
! 	  within field.  For example, descr:foo only matches documents
! 	  containing &lt;meta value="description" value="... foo ..."&gt;.
! 	  The link: field refers to the text in the hyperlinks to a document,
! 	  rather than text within the document itself.  Similarly url:
! 	  (will eventually) refer to the actual URL of the document, not any
! 	  of its contents.
! 	  The prefixes exact: and hidden: are also accepted.
! 	  The former (will) cause the
! 	  <a href="attrs.html#search_algorithm">fuzzy search algorithm</a>
! 	  not to be applied to this word, while the latter causes the word
! 	  not to be displayed in the query string of the results page.
! 	</p>
! 	<p>
! 	  Each of the words in the list (but not within a phrase) is now
! 	  expanded using the search algorithms that were specified in the
  	  <a href="attrs.html#search_algorithm">search_algorithm</a>
  	  attribute. For example, the endings algorithm will convert a
  	  word like "person" into "person or persons". In this fashion,
***************
*** 78,84 ****
  	</p>
  	<hr size="4" noshade>
  
! 	Last modified: $Date: 2002/02/01 22:49:32 $
  
    </body>
  </html>
--- 96,102 ----
  	</p>
  	<hr size="4" noshade>
  
! 	Last modified: $Date: 2003/02/08 $
  
    </body>
  </html>
diff -rc -xCVS cvs/htdig/htsearch/WeightWord.cc profile/htsearch/WeightWord.cc
*** cvs/htdig/htsearch/WeightWord.cc	Sat Feb  2 09:49:35 2002
--- profile/htsearch/WeightWord.cc	Sun Feb  9 09:15:05 2003
***************
*** 33,38 ****
--- 33,40 ----
      isExact = 0;
      isHidden = 0;
      isIgnore = 0;
+ 
+     flags = FLAGS_MATCH_ONE;
  }
  
  
***************
*** 45,50 ****
--- 47,53 ----
      records = ww->records;
      isExact = ww->isExact;
      isHidden = ww->isHidden;
+     flags = ww->flags;
      word = ww->word;
      isIgnore = 0;
  }
***************
*** 59,64 ****
--- 62,92 ----
      isExact = 0;
      isHidden = 0;
      isIgnore = 0;
+ 
+     // allow a match with any field
+     flags = FLAGS_MATCH_ONE;
+ 
+     set(word);
+     this->weight = weight;
+ }
+ 
+ //***************************************************************************
+ // WeightWord::WeightWord(char *word, double weight, unsigned int f)
+ //
+ WeightWord::WeightWord(char *word, double weight, unsigned int f)
+ {
+     records = 0;
+ 
+     flags = f;
+     // if no fields specified, allow a match with any field
+     if (!(flags & FLAGS_MATCH_ONE))
+ 	flags ^= FLAGS_MATCH_ONE;
+ 
+     // ideally, these flags should all just be stored in a uint...
+     isExact = ((flags & FLAG_EXACT) != 0);
+     isHidden = ((flags & FLAG_HIDDEN) != 0);
+     isIgnore = ((flags & FLAG_IGNORE) != 0);
+ 
      set(word);
      this->weight = weight;
  }
***************
*** 77,82 ****
--- 105,111 ----
  //
  void WeightWord::set(char *word)
  {
+ #if 0
      isExact = 0;
      isHidden = 0;
      while (strchr(word, ':'))
***************
*** 104,109 ****
--- 133,139 ----
  	}
  		
      }
+ #endif
      this->word = word;
      this->word.lowercase();
  }
diff -rc -xCVS cvs/htdig/htsearch/WeightWord.h profile/htsearch/WeightWord.h
*** cvs/htdig/htsearch/WeightWord.h	Sat Feb  2 09:49:35 2002
--- profile/htsearch/WeightWord.h	Sun Feb  9 08:18:57 2003
***************
*** 19,24 ****
--- 19,25 ----
  
  #include "htString.h"
  #include "WordRecord.h"
+ #include "HtWordReference.h"	// for FLAG_...
  
  class WeightWord : public Object
  {
***************
*** 28,33 ****
--- 29,35 ----
      //
      WeightWord();
      WeightWord(char *word, double weight);
+     WeightWord(char *word, double weight, unsigned int flags);
      WeightWord(WeightWord *);
      
      virtual		~WeightWord();
***************
*** 37,45 ****
      String		word;
      double		weight;
      WordRecord		*records;
!     int			isExact;
!     int			isHidden;
!     int			isIgnore;
  };
  
  #endif
--- 39,48 ----
      String		word;
      double		weight;
      WordRecord		*records;
!     unsigned int	flags;
!     short int		isExact;
!     short int		isHidden;
!     short int		isIgnore;
  };
  
  #endif
diff -rc -xCVS cvs/htdig/htsearch/htsearch.cc profile/htsearch/htsearch.cc
*** cvs/htdig/htsearch/htsearch.cc	Wed Feb  5 22:05:58 2003
--- profile/htsearch/htsearch.cc	Sun Feb  9 09:44:31 2003
***************
*** 63,68 ****
--- 63,87 ----
  
  StringList              collectionList; // List of databases to search on
  
+ // reconised word prefixes (for field-restricted search and per-word fuzzy
+ // algorithms) in *descending* alphabetical order.
+ // Don't use a dictionary structure, as setup time outweights saving.
+ struct {char *name; unsigned int flag; } colonPrefix [] =
+ {
+     { "url",     FLAG_URL },
+     { "title",   FLAG_TITLE },
+     { "text",    FLAG_PLAIN },		// FLAG_TEXT is 0, i.e. *no* flag...
+     { "link",    FLAG_LINK_TEXT },
+     { "keyword", FLAG_KEYWORDS },
+     { "hidden",  FLAG_HIDDEN },
+     { "heading", FLAG_HEADING },
+     { "exact",   FLAG_EXACT },
+     { "descr",   FLAG_DESCRIPTION },
+ //    { "cap",     FLAG_CAPITAL },
+     { "author",  FLAG_AUTHOR },
+     { "",  0 },
+ };
+ 
  //*****************************************************************************
  // int main()
  //
***************
*** 512,517 ****
--- 531,537 ----
      unsigned char	t;
      String		word;
      const String	prefix_suffix = config->Find("prefix_match_character");
+ 
      while (*pos)
      {
  	while (1)
***************
*** 534,549 ****
  		tempWords.Add(new WeightWord(s, -1.0));
  		break;
  	    }
! 	    else if (HtIsWordChar(t) || t == ':' ||
! 			 (strchr(prefix_suffix, t) != NULL) || (t >= 161 && t <= 255))
  	    {
! 		word = 0;
! 		while (t && (HtIsWordChar(t) ||
! 			     t == ':' || (strchr(prefix_suffix, t) != NULL) || (t >= 161 && t <= 255)))
  		{
! 		    word << (char) t;
! 		    t = *pos++;
! 		}
  
  		pos--;
  		if (boolean && (mystrcasecmp(word.get(), "+") == 0
--- 554,595 ----
  		tempWords.Add(new WeightWord(s, -1.0));
  		break;
  	    }
! 	    else if (HtIsWordChar(t) ||
! 		    	(strchr(prefix_suffix, t) != NULL) ||
! 			(t >= 161 && t <= 255))
  	    {
! 		unsigned int fieldFlag = 0;
! 		word =  0;
! 		do	// while recognised prefix, followed by ':'
  		{
! 		    while (t && (HtIsWordChar(t) ||
! 				 (strchr(prefix_suffix, t) != NULL) ||
! 				 (t >= 161 && t <= 255)))
! 		    {
! 			word << (char) t;
! 			t = *pos++;
! 		    }
! 		    if (t == ':')	// e.g. "author:word" to search
! 		    {			// only in author 
! 			word.lowercase();
! 			t = *pos++;
! 			if (t && (HtIsWordChar (t) ||
! 				     (strchr(prefix_suffix, t) != NULL) ||
! 				     (t >= 161 && t <= 255)))
! 			{
! 			    int i, cmp;
! 			    const char *w = word.get();
! 			    // linear search of known prefixes, with "" flag.
! 			    for (i = 0; (cmp = mystrcasecmp (w, colonPrefix[i].name)) < 0; i++)
! 				;
! 			    if (cmp == 0)	// if prefix found...
! 			    {
! 				fieldFlag |= colonPrefix [i].flag;
! 				word = 0;
! 			    }
! 			}
! 		    }
! 		} while (!word.length());
  
  		pos--;
  		if (boolean && (mystrcasecmp(word.get(), "+") == 0
***************
*** 565,571 ****
  		{
  		    // Add word to excerpt matching list
  		    originalPattern << word << "|";
! 		    WeightWord	*ww = new WeightWord(word, 1.0);
  		    if(HtWordNormalize(word) & WORD_NORMALIZE_NOTOK)
  			ww->isIgnore = 1;
  		    tempWords.Add(ww);
--- 611,617 ----
  		{
  		    // Add word to excerpt matching list
  		    originalPattern << word << "|";
! 		    WeightWord	*ww = new WeightWord(word, 1.0, fieldFlag);
  		    if(HtWordNormalize(word) & WORD_NORMALIZE_NOTOK)
  			ww->isIgnore = 1;
  		    tempWords.Add(ww);
***************
*** 646,651 ****
--- 692,699 ----
      {
  	WeightWord	*ww = (WeightWord *) tempWords[i];
  	if (ww->weight > 0 && !ww->isIgnore && !in_phrase)
+ //		I think that should be:
+ //	if (ww->weight > 0 && !ww->isIgnore && !in_phrase && !ww->isExact)
  	{
  	    //
  	    // Apply all the algorithms to the word.
***************
*** 699,707 ****
--- 747,757 ----
  	{
  	    if (debug > 1)
  	      cout << " " << word->get();
+ 	    // (should be a "copy with changed weight" constructor...)
  	    newWw = new WeightWord(word->get(), fuzzy->getWeight());
  	    newWw->isExact = ww->isExact;
  	    newWw->isHidden = ww->isHidden;
+ 	    newWw->flags = ww->flags;
  	    weightWords.Add(newWw);
  	}
  	if (debug > 1)
diff -rc -xCVS cvs/htdig/htsearch/parser.cc profile/htsearch/parser.cc
*** cvs/htdig/htsearch/parser.cc	Mon Dec 30 23:42:59 2002
--- profile/htsearch/parser.cc	Sun Feb  9 09:18:38 2003
***************
*** 238,244 ****
  	      {
                  if(!wordList) wordList = new List;
  		if(debug) cerr << "scoring phrase" << endl;
! 		score(wordList, weight);
  	      }
  	      break;
  	    }
--- 238,244 ----
  	      {
                  if(!wordList) wordList = new List;
  		if(debug) cerr << "scoring phrase" << endl;
! 		score(wordList, weight, FLAGS_MATCH_ONE); // look in all fields
  	      }
  	      break;
  	    }
***************
*** 381,387 ****
  	p[maximum_word_length] = '\0';
  
      List* result = words[p];
!     score(result, current->weight);
      delete result;
  }
  
--- 381,387 ----
  	p[maximum_word_length] = '\0';
  
      List* result = words[p];
!     score(result, current->weight, current->flags);
      delete result;
  }
  
***************
*** 510,517 ****
  }
  
  //*****************************************************************************
  void
! Parser::score(List *wordList, double weight)
  {
  	HtConfiguration* config= HtConfiguration::config();
      DocMatch	*dm;
--- 510,520 ----
  }
  
  //*****************************************************************************
+ // Allocate scores based on words in  wordList.
+ // Fields within which the word must appear are specified in  flags
+ // (see HtWordReference.h).
  void
! Parser::score(List *wordList, double weight, unsigned int flags)
  {
  	HtConfiguration* config= HtConfiguration::config();
      DocMatch	*dm;
***************
*** 550,555 ****
--- 553,568 ----
  	//
  	// *******  Compute the score for the document
  	//
+ 
+ 	// If word not in one of the required fields, skip the entry.
+ 	// Plain text sets no flag in dbase, so treat it separately.
+ 	if (!(wr->Flags() & flags) && (wr->Flags() || !(flags & FLAG_PLAIN)))
+ 	{
+ 	    if (debug > 2)
+ 		cerr << "Flags " << wr->Flags() << " lack " << flags << endl;
+ 	    continue;
+ 	}
+ 
  	wscore = 0.0;
  	if (wr->Flags() == FLAG_TEXT)		wscore += text_factor;
  	if (wr->Flags() & FLAG_CAPITAL)		wscore += caps_factor;
diff -rc -xCVS cvs/htdig/htsearch/parser.h profile/htsearch/parser.h
*** cvs/htdig/htsearch/parser.h	Mon Dec 30 23:42:59 2002
--- profile/htsearch/parser.h	Thu Feb  6 21:19:02 2003
***************
*** 56,62 ****
      void		perform_or();
      void		perform_phrase(List * &);
  
!     void		score(List *, double weight);
  
      List		*tokens;
      List		*result;
--- 56,62 ----
      void		perform_or();
      void		perform_phrase(List * &);
  
!     void		score(List *, double weight, unsigned int flags);
  
      List		*tokens;
      List		*result;
diff -rc -xCVS cvs/htdig/test/t_htsearch profile/test/t_htsearch
*** cvs/htdig/test/t_htsearch	Tue Jan 21 09:40:18 2003
--- profile/test/t_htsearch	Sun Feb  9 09:28:30 2003
***************
*** 106,111 ****
--- 106,139 ----
      "method=boolean&words=also+or+%22distribution" \
      'Expected quotes at the end'
  
+ try "Unrestricted search for 'group'" \
+     "method=and&words=group" \
+     '4 matches' 'script.html' 'bad_local.htm' 'site3.html' 'site4.html'
+ 
+ try "Field-restricted search for 'author:group'" \
+     "method=and&words=author:group" \
+     '1 match' 'script.html'
+ 
+ try "Field-restricted search for 'text:group'" \
+     "method=and&words=text:group" \
+     '3 matches' 'bad_local.htm' 'site3.html' 'site4.html'
+ 
+ try "Checking prefix parsing using 'text: group'" \
+     "method=and&words=text:%20group" \
+     '1 match' 'script.html'
+ 
+ try "Checking prefix parsing using 'text::group'" \
+     "method=and&words=text::group" \
+     '1 match' 'script.html'
+ 
+ try "Checking prefix parsing using 'unknown:group'" \
+     "method=any&words=unknown:group" \
+     '5 matches' 'script.html' 'bad_local.htm' 'site3.html' 'site4.html' 'set1/"'
+ 
+ try "Field-restricted search for 'descr:cost'" \
+     "method=and&words=descr:cost" \
+     '1 match' 'script.html'
+ 
  config=$testdir/conf/htdig.conf3
  
  try "Testing boolean_keywords and search_rewrite_urls" \

[htdig-dev] Field restricted searching patch

Reply via email to