Here are patches for conv_doc.pl and parse_doc.pl.

In both scripts, I've changed "break" to "last".  Perl doesn't
have a break statement.  (The -w flag would have caught this.)

In parse_doc.pl, I replaced the code for parsing a line into
a list of words with a much simpler expression using "split".
The intent of the original code was hard to grasp, but it was
spitting out "words" that included multiple punctuation
characters, and filling up my word database with gibberish.
I also streamlined parse_doc.pl in a few other places,
without (I hope) changing the output.  I made the changes
to parse_doc.pl before I realized that it had been mostly
superseded by conv_doc.pl, but I'm including the patch anyway,
for whatever it's worth.

-- 
Warren Jones

---------------------------- snip snip ---------------------------- 

Index: conv_doc.pl
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/contrib/conv_doc.pl,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 conv_doc.pl
*** conv_doc.pl 1999/12/15 22:03:07     1.1.1.1
--- conv_doc.pl 2000/01/12 22:02:43
***************
*** 131,137 ****
                  s/</\&lt\;/g;
                  s/>/\&gt\;/g;
                  $title = $_;
!                 break;
              }
          }
          close INFO;
--- 131,137 ----
                  s/</\&lt\;/g;
                  s/>/\&gt\;/g;
                  $title = $_;
!                 last;
              }
          }
          close INFO;
***************
*** 190,196 ****
  open(CAT, "$cvtcmd |") || die "$cvtr doesn't want to be opened using pipe.\n";
  while (<CAT>) {
      while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
!         $_ .= <CAT> || break;
          s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
      }
      s/[\255]/-/g;                       # replace dashes with hyphens
--- 190,196 ----
  open(CAT, "$cvtcmd |") || die "$cvtr doesn't want to be opened using pipe.\n";
  while (<CAT>) {
      while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
!         $_ .= <CAT> || last;
          s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
      }
      s/[\255]/-/g;                       # replace dashes with hyphens


Index: parse_doc.pl
===================================================================
RCS file: /home/wjones/src/CVS.repo/htdig/contrib/parse_doc.pl,v
retrieving revision 1.1.1.1
diff -c -r1.1.1.1 parse_doc.pl
*** parse_doc.pl        1999/11/29 20:02:20     1.1.1.1
--- parse_doc.pl        2000/01/12 21:48:18
***************
*** 70,76 ****
  @allwords = ();
  @temp = ();
  $x = 0;
- @fields = ();
  $calc = 0;
  $dehyphenate = 0;
  $title = "";
--- 70,75 ----
***************
*** 122,128 ****
                                  $title =~ s/&/\&amp\;/g;
                                  $title =~ s/</\&lt\;/g;
                                  $title =~ s/>/\&gt\;/g;
!                                 break;
                          }
                  }
                  close INFO;
--- 121,127 ----
                                  $title =~ s/&/\&amp\;/g;
                                  $title =~ s/</\&lt\;/g;
                                  $title =~ s/>/\&gt\;/g;
!                                 last;
                          }
                  }
                  close INFO;
***************
*** 153,174 ****
  open(CAT, "$parsecmd") || die "Hmmm. $parser doesn't want to be opened using 
pipe.\n";
  while (<CAT>) {
          while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
!                 $_ .= <CAT> || break;
                  s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
          }
          $head .= " " . $_;
!         
s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+\s+|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+$/
 /g;    # replace reading-chars with space (only at end or begin of word, but allow 
multiple characters)
! #       
s/\s[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]\s|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]$/
 /g;    # replace reading-chars with space (only at end or begin of word)
! #       s/[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]/ /g;      # rigorously replace all by 
<[EMAIL PROTECTED]>
! #       s/[\-\255]/ /g;                                 # replace hyphens with space
!         s/[\255]/-/g;                                   # replace dashes with hyphens
!         @fields = split;                                # split up line
!         next if (@fields == 0);                         # skip if no fields (does it 
speed up?)
!         for ($x=0; $x<@fields; $x++) {                  # check each field if string 
length >= 3
!                 if (length($fields[$x]) >= $minimum_word_length) {
!                         push @allwords, $fields[$x];    # add to list
!                 }
!         }
  }
  
  close CAT;
--- 152,166 ----
  open(CAT, "$parsecmd") || die "Hmmm. $parser doesn't want to be opened using 
pipe.\n";
  while (<CAT>) {
          while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
!                 $_ .= <CAT> || last;
                  s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
          }
          $head .= " " . $_;
!       # Delete valid punctuation.  These are the default values
!       # for valid_punctuation, and should be changed other values
!       # are specified in the config file.
!       tr{-\255._/!#$%^&'}{}d;
!       push @allwords, grep { length >= $minimum_word_length } split /\W+/;
  }
  
  close CAT;
***************
*** 207,215 ****
  
  #############################################
  # now the words
! for ($x=0; $x<@allwords; $x++) {
!         $calc=int(1000*$x/@allwords);           # calculate rel. position (0-1000)
!         print "w\t$allwords[$x]\t$calc\t0\n";   # print out word, rel. pos. and text 
type (0)
  }
  
  $calc=@allwords;
--- 199,208 ----
  
  #############################################
  # now the words
! $x = 0;
! for ( @allwords ) {
!     # print out word, rel. pos. and text type (0)
!     printf "w\t%s\t%d\t0\n", $_, 1000*$x++/@allwords;
  }
  
  $calc=@allwords;

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to