Re: [htdig] %20 and access to result pages

Gilles Detillieux Thu, 12 Jul 2001 07:37:05 -0700
According to David Adams:
> > I am using version 3.1.5 of htdig and I am wondering about two
> > things:
> >
> > 1) Is it somehow possible to convert the annoying %20 into a space
> > in the bold written title of each search result? Documents like
> > this
> > 'PDF Document
> > The%20Creation%20Of%20A%20Knowledge%20Management%20Process%20Model.pdf'
> > are hard and hardly to read. I would appreciate very much to
> > convert the %20 within this string (not the URL) into a space.
> >
> 
> If you are using doc2html.pl to index PDF files, then you need this addition
> which I emailed to this list a few days ago:
> 
> After
> 
>   $Name = $URL;
>   $Name =~ s#^.*/##;
> 
> add
> 
>   $Name =~ s/%([A-F0-9][A-F0-9])/pack("C", hex($1))/gie;
> 
> and re-index.
> 
> A similar change should work with the other external parser/converter Perl
> scripts.

If I'm not mistaken, the change above is also needed in pdf2html.pl and
swf2html.pl, if you use these.  I'm making those changes in the cvs tree
for the htdig contrib subdirectory.  The patch below will make the
corresponding change to conv_doc.pl and parse_doc.pl, as well as another
bug fix.  I'm also changing these in the cvs tree.

> > 2) When I have more than 100 results, how can I access results >
> > 100?
> >
> 
> There is a config. file attribute to change the number of results returned
> per page, increase it to (say) 20.

That attribute is matches_per_page, which can also be overridden with the
"matchesperpage" CGI input parameter in the search form.  The other option
is to increase the number of pages, using maximum_pages, but then you might
want to change the page buttons to text links so they all match, as the
distribution only comes with 10 page buttons.

See http://www.htdig.org/attrs.html and http://www.htdig.org/hts_form.html

Here's the patch to conv_doc.pl and parse_doc.pl, which you can apply to
the 3.1.5 or 3.2.0b3 (or later snapshots) in the main source directory with
the command "patch -p0 < this-message".

--- contrib/conv_doc.pl.orig    Tue Feb 15 15:17:56 2000
+++ contrib/conv_doc.pl Thu Jul 12 09:38:29 2001
@@ -39,6 +39,9 @@
 # Added:        test for null device on Win32 env.   <[EMAIL PROTECTED]>
 # 2000/01/12
 # Changed:      "break" to "last" (no break in Perl) <[EMAIL PROTECTED]>
+# 2001/07/12
+# Changed:      fix "last" handling in dehyphenation <[EMAIL PROTECTED]>
+# Added:        handle %xx codes in title from URL   <[EMAIL PROTECTED]>
 #########################################
 #
 # set this to your MS Word to text converter
@@ -182,6 +185,7 @@ print "<HTML>\n<head>\n";
 # print out the title, if it's set, and not just a file name, or make one up
 if ($title eq "" || $title =~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) {
     @parts = split(/\//, $ARGV[2]);         # get the file basename
+    $parts[-1] =~ s/%([A-F0-9][A-F0-9])/pack("C", hex($1))/gie;
     $title = "$type Document $parts[-1]";   # use it in title
 }
 print "<title>$title</title>\n";
@@ -192,8 +196,9 @@ print "</head>\n<body>\n";
 open(CAT, "$cvtcmd |") || die "$cvtr doesn't want to be opened using pipe.\n";
 while (<CAT>) {
     while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
-        $_ .= <CAT> || last;
-        s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
+        $_ .= <CAT>;
+        last if eof;
+        s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s
     }
     s/[\255]/-/g;                       # replace dashes with hyphens
     s/\f/\n/g;                          # replace form feed
--- contrib/parse_doc.pl.orig   Tue Feb 15 15:28:38 2000
+++ contrib/parse_doc.pl        Thu Jul 12 09:38:21 2001
@@ -38,6 +38,9 @@
 # Changed:      "break" to "last" (no break in Perl) <[EMAIL PROTECTED]>
 # Changed:      code for parsing a line into a list of
 #               words, to use "split", other streamlining.
+# 2001/07/12
+# Changed:      fix "last" handling in dehyphenation <[EMAIL PROTECTED]>
+# Added:        handle %xx codes in title from URL   <[EMAIL PROTECTED]>
 #########################################
 #
 # set this to your MS Word to text converter
@@ -157,8 +160,9 @@ die "Hmm. $parser is absent or unwilling
 open(CAT, "$parsecmd") || die "Hmmm. $parser doesn't want to be opened using pipe.\n";
 while (<CAT>) {
         while (/[A-Za-z\300-\377]-\s*$/ && $dehyphenate) {
-                $_ .= <CAT> || last;
-                s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/
+                $_ .= <CAT>;
+                last if eof;
+                s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s
         }
         $head .= " " . $_;
 #       
s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+\s+|^[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+$/
 /g;    # replace reading-chars with space (only at end or begin of word, but allow 
multiple characters)
@@ -191,6 +195,7 @@ if ($title !~ /^$/ && $title !~ /^[A-G]:
         print "t\t$title\n";
 } else {                                        # otherwise generate a title
         @temp = split(/\//, $ARGV[2]);          # get the filename, get rid of 
basename
+        $temp[-1] =~ s/%([A-F0-9][A-F0-9])/pack("C", hex($1))/gie;
         print "t\t$type Document $temp[-1]\n";  # print it
 }
 
--- contrib/doc2html/doc2html.pl.orig   Tue Jun  5 04:12:02 2001
+++ contrib/doc2html/doc2html.pl        Thu Jul 12 09:41:51 2001
@@ -184,6 +184,7 @@ sub init {
   $URL = $ARGV[2] || '?';
   $Name = $URL;
   $Name =~ s#^.*/##;
+  $Name =~ s/%([A-F0-9][A-F0-9])/pack("C", hex($1))/gie;
   
   if ($Verbose and not $LOG) { print STDERR "\n$Prog: [$MIME_type] " }
   if ($LOG) { print STDERR "$URL [$MIME_type] " }

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] %20 and access to result pages

Reply via email to