According to =?iso-8859-1?Q?Josep_Llaurad=F3_Selvas?=:
> I get the next lines as output of htdig (with -vvv):
...
> Header line: HTTP/1.1 200 OK
> Header line: Date: Fri, 26 Jan 2001 09:43:09 GMT
> Header line: Server: Apache/1.3.9 (Unix) Debian/GNU PHP/4.0.3pl1
> Header line: Last-Modified: Mon, 30 Oct 2000 15:29:29 GMT
> Translated Mon, 30 Oct 2000 15:29:29 GMT to 2000-10-30 15:29:29 (100)
> And converted to Mon, 30 Oct 2000 15:29:29
> Header line: ETag: "44222-5400-39fd93d9"
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 21504
> Header line: Connection: close
> Header line: Content-Type: application/msword; charset=iso-8859-1
>  not HTML
> 
> ------------------------------------------------------------------------------
> 
> As I can understand, the htdig don't get Word documents, 'cos closes the
> connection to the server saying that the document isn't an HTML document.
> 
> In my config file I have the following lines:
> 
> external_parsers: application/msword->text/html /usr/share/htdig/parse_doc.pl \
>                   application/postscript->text/html /usr/share/htdig/parse_doc.pl \
>                   application/pdf->text/html /usr/share/htdig/parse_doc.pl
> 
> but the parser seems to miss the external parsers or something like
> that. I have tried the parse_doc and the conv_doc perl scripts and don't
> run, also the syntax:
> 
>       'application/msword->text/html' and the 'application/msword'
> 
> without any satisfactory result. What I'm doing wrong?
> 
> I'm using Debian GNU/Linux 2.2 Potato with htdig-3.1.5-2

First of all, get rid of parse_doc.pl, and use conv_doc.pl or doc2html.pl.
That won't solve this particular problem, but it will avoid a whole lot
of other problems down the road.

The syntax 'application/msword->text/html' is for external converters,
like conv_doc.pl and doc2html.pl, which output HTML code, rather than
an external parser which outputs its own unique codes for parsed text.
See http://htdig.sourceforge.net/attrs.html#external_parsers for details.

The problem with htdig not recognising the external parser definitions
is a known bug, which seems to show up on recent Debian configurations.
There are two fixes for it.  One is to configure Apache not to add the
"charset=" string to the Content-Type header, which is what confuses
htdig.

Here is the suggestion from Elijah Kagan <[EMAIL PROTECTED]>:

  There are two parameters in Apache config file that tell it to add a
  charset field by default. They are: AddDefaultCharset and
  AddDefaultCharsetName. The first one should be set to off to prevent
  Apache from replying with a charset field set after the content type.

  After disabling AddDefaultCharset htdig worked as expected.

The other fix is to apply a patch to htdig/ExternalParser.cc to make
it correctly handle these headers.  Applying this patch is a bit more
work than the quick fix above, but it gets to the root of the problem.
This patch also fixes a few other bugs in the external parser support.
It's backported from the 3.2.0b3 development code.  The patch archive
seems to be down right now, so I'll repost the patch here...

According to Elijah Kagan:
> I run htdig 3.1.5.
> I tried both the Debian package and a compiled one with the same result.
> I am absolutely sure there is something stupid I forgot to put into the
> configuration.

OK, after getting to the bottom of this (I think!), I have backported
the 3.2.0b3 development code for htdig/ExternalParser.cc to version
3.1.5, to fix this and other problems.  Please give this patch file
a try and let me know if it works.  You will probably get a warning
about the wait() function being implicitly declared, unless you manually
define HAVE_WAIT_H or HAVE_SYS_WAIT_H (depending on whether your system
has <wait.h> or <sys/wait.h>).  Also, if your system has the mkstemp()
function, you may want to define HAVE_MKSTEMP manually as well, as this
will enhance security.  I didn't have time to figure out how to patch
aclocal.m4 and configure to add tests for all of these.

The patch fixes the following problems in external_parsers support in
3.1.5:
  - it got confused by "; charset=..." in the Content-Type header,
    as described in "http://www.htdig.org/mail/2000/09/index.html#75".
  - security problems with using popen(), and therefore the shell,
    to parse URL and content-type strings from untrusted sources
    (now uses pipe/fork/exec instead of popen) - PR#542, PR#951.
  - used predictable temporary file name, which could be exploited
    via symlinks - fixed if mkstemp() exists & HAVE_MKSTEMP is defined.
  - binary output from an external converter could get mangled.
  - error messages were sometimes ambiguous or missing altogether.
  - didn't open temporary file in binary mode for non-Unix systems
    (attempts were made to fix this, but it's not clear yet whether
     the security fixes and pipe/fork/exec will port well to Cygwin).

Here's the patch, which you can apply in the main source directory for
htdig-3.1.5 using "patch -p0 < this-file":

--- htdig/ExternalParser.cc.orig        Thu Feb 24 20:29:10 2000
+++ htdig/ExternalParser.cc     Mon Jan 15 17:16:50 2001
@@ -1,14 +1,24 @@
 //
 // ExternalParser.cc
 //
-// Implementation of ExternalParser
-// Allows external programs to parse unknown document formats.
-// The parser is expected to return the document in a specific format.
-// The format is documented in http://www.htdig.org/attrs.html#external_parser
+// ExternalParser: Implementation of ExternalParser
+//                 Allows external programs to parse unknown document formats.
+//                 The parser is expected to return the document in a 
+//                 specific format. The format is documented 
+//                 in http://www.htdig.org/attrs.html#external_parser
 //
-#if RELEASE
-static char RCSid[] = "$Id: ExternalParser.cc,v 1.9.2.3 1999/11/24 02:14:09 grdetil 
Exp $";
-#endif
+// Part of the ht://Dig package   <http://www.htdig.org/>
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// <http://www.gnu.org/copyleft/gpl.html>
+//
+// $Id: ExternalParser.cc,v 1.9.2.4 2001/01/15 17:16:50 grdetil Exp $
+//
+
+#ifdef HAVE_CONFIG_H
+#include "htconfig.h"
+#endif /* HAVE_CONFIG_H */
 
 #include "ExternalParser.h"
 #include "HTML.h"
@@ -19,9 +29,18 @@ static char RCSid[] = "$Id: ExternalPars
 #include "QuotedStringList.h"
 #include "URL.h"
 #include "Dictionary.h"
+#include "good_strtok.h"
+
 #include <ctype.h>
 #include <stdio.h>
-#include "good_strtok.h"
+#include <unistd.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#ifdef HAVE_WAIT_H
+#include <wait.h>
+#elif HAVE_SYS_WAIT_H
+#include <sys/wait.h>
+#endif
 
 static Dictionary      *parsers = 0;
 static Dictionary      *toTypes = 0;
@@ -32,9 +51,18 @@ extern String                configFile;
 //
 ExternalParser::ExternalParser(char *contentType)
 {
+  String mime;
+  int sep;
+
     if (canParse(contentType))
     {
-       currentParser = ((String *)parsers->Find(contentType))->get();
+        String mime = contentType;
+       mime.lowercase();
+       sep = mime.indexOf(';');
+       if (sep != -1)
+         mime = mime.sub(0, sep).get();
+       
+       currentParser = ((String *)parsers->Find(mime))->get();
     }
     ExternalParser::contentType = contentType;
 }
@@ -89,6 +117,8 @@ ExternalParser::readLine(FILE *in, Strin
 int
 ExternalParser::canParse(char *contentType)
 {
+  int                  sep;
+
     if (!parsers)
     {
        parsers = new Dictionary();
@@ -97,7 +127,6 @@ ExternalParser::canParse(char *contentTy
        QuotedStringList        qsl(config["external_parsers"], " \t");
        String                  from, to;
        int                     i;
-       int                     sep;
 
        for (i = 0; qsl[i]; i += 2)
        {
@@ -109,11 +138,22 @@ ExternalParser::canParse(char *contentTy
                to = from.sub(sep+2).get();
                from = from.sub(0, sep).get();
            }
+           from.lowercase();
+           sep = from.indexOf(';');
+           if (sep != -1)
+             from = from.sub(0, sep).get();
+
            parsers->Add(from, new String(qsl[i + 1]));
            toTypes->Add(from, new String(to));
        }
     }
-    return parsers->Exists(contentType);
+
+    String mime = contentType;
+    mime.lowercase();
+    sep = mime.indexOf(';');
+    if (sep != -1)
+      mime = mime.sub(0, sep).get();
+    return parsers->Exists(mime);
 }
 
 //*****************************************************************************
@@ -132,52 +172,128 @@ ExternalParser::parse(Retriever &retriev
     // Write the contents to a temporary file.
     //
     String      path = getenv("TMPDIR");
+    int                fd;
     if (path.length() == 0)
       path = "/tmp";
-    path << "/htdext." << getpid();
+#ifndef HAVE_MKSTEMP
+    path << "/htdext." << getpid(); // This is unfortunately predictable
 
-    FILE       *fl = fopen(path, "w");
-    if (!fl)
+#ifdef O_BINARY
+    fd = open((char*)path, O_WRONLY|O_CREAT|O_EXCL|O_BINARY);
+#else
+    fd = open((char*)path, O_WRONLY|O_CREAT|O_EXCL);
+#endif
+#else
+    path << "/htdex.XXXXXX";
+    fd = mkstemp((char*)path);
+    // can we force binary mode somehow under Cygwin, if it has mkstemp?
+#endif
+    if (fd < 0)
     {
-       return;
+      if (debug)
+       cout << "External parser error: Can't create temp file "
+            << (char *)path << endl;
+      return;
     }
     
-    fwrite(contents->get(), 1, contents->length(), fl);
-    fclose(fl);
-    
-    //
-    // Now start the external parser.
-    //
-    String     command = currentParser;
-    command << ' ' << path << ' ' << contentType << " \"" << base.get() <<
-       "\" " << configFile;
-
-    FILE       *input = popen(command, "r");
-    if (!input)
-    {
-       unlink(path);
-       return;
-    }
-
-    unsigned int minimum_word_length
-      = config.Value("minimum_word_length", 3);
+    write(fd, contents->get(), contents->length());
+    close(fd);
 
+    unsigned int minimum_word_length = config.Value("minimum_word_length", 3);
     String     line;
     char       *token1, *token2, *token3;
-    int                loc, hd;
+    int                loc = 0, hd = 0;
     URL                url;
-    String     convertToType = ((String *)toTypes->Find(contentType))->get();
+    String mime = contentType;
+    mime.lowercase();
+    int        sep = mime.indexOf(';');
+    if (sep != -1)
+      mime = mime.sub(0, sep).get();
+    String     convertToType = ((String *)toTypes->Find(mime))->get();
     int                get_hdr = (mystrcasecmp(convertToType, "user-defined") == 0);
     int                get_file = (convertToType.length() != 0);
     String     newcontent;
-    while (readLine(input, line))
+
+    StringList cpargs(currentParser);
+    char   **parsargs = new char * [cpargs.Count() + 5];
+    int    argi;
+    for (argi = 0; argi < cpargs.Count(); argi++)
+       parsargs[argi] = (char *)cpargs[argi];
+    parsargs[argi++] = path.get();
+    parsargs[argi++] = contentType.get();
+    parsargs[argi++] = (char *)base.get();
+    parsargs[argi++] = configFile.get();
+    parsargs[argi++] = 0;
+
+    int    stdout_pipe[2];
+    int           fork_result = -1;
+    int           fork_try;
+
+    if (pipe(stdout_pipe) == -1)
+    {
+      if (debug)
+       cout << "External parser error: Can't create pipe!" << endl;
+      unlink((char*)path);
+      delete [] parsargs;
+      return;
+    }
+
+    for (fork_try = 4; --fork_try >= 0;)
+    {
+      fork_result = fork(); // Fork so we can execute in the child process
+      if (fork_result != -1)
+       break;
+      if (fork_try)
+       sleep(3);
+    }
+    if (fork_result == -1)
+    {
+      if (debug)
+       cout << "Fork Failure in ExternalParser" << endl;
+      unlink((char*)path);
+      delete [] parsargs;
+      return;
+    }
+
+    if (fork_result == 0) // Child process
+    {
+       close(STDOUT_FILENO); // Close handle STDOUT to replace with pipe
+       dup(stdout_pipe[1]);
+       close(stdout_pipe[0]);
+       close(stdout_pipe[1]);
+       close(STDIN_FILENO); // Close STDIN to replace with file
+       open((char*)path, O_RDONLY);
+
+       // Call External Parser
+       execv(parsargs[0], parsargs);
+
+       exit(EXIT_FAILURE);
+    }
+
+    // Parent Process
+    delete [] parsargs;
+    close(stdout_pipe[1]); // Close STDOUT for writing
+#ifdef O_BINARY
+    FILE *input = fdopen(stdout_pipe[0], "rb");
+#else
+    FILE *input = fdopen(stdout_pipe[0], "r");
+#endif
+    if (input == NULL)
+    {
+      if (debug)
+       cout << "Fdopen Failure in ExternalParser" << endl;
+      unlink((char*)path);
+      return;
+    }
+
+    while ((!get_file || get_hdr) && readLine(input, line))
     {
        if (get_hdr)
        {
            line.chop('\r');
            if (line.length() == 0)
                get_hdr = FALSE;
-           else if (mystrncasecmp(line, "content-type:", 13) == 0)
+           else if (mystrncasecmp((char*)line, "content-type:", 13) == 0)
            {
                token1 = line.get() + 13;
                while (*token1 && isspace(*token1))
@@ -187,24 +303,9 @@ ExternalParser::parse(Retriever &retriev
            }
            continue;
        }
-       if (get_file)
-       {
-           if (newcontent.length() == 0 &&
-               !canParse(convertToType) &&
-               mystrncasecmp(convertToType, "text/", 5) != 0 &&
-               mystrncasecmp(convertToType, "application/pdf", 15) != 0)
-           {
-               if (mystrcasecmp(convertToType, "user-defined") == 0)
-                   cerr << "External parser error: no Content-Type given\n";
-               else
-                   cerr << "External parser error: can't parse Content-Type \""
-                        << convertToType << "\"\n";
-               cerr << " URL: " << base.get() << "\n";
-               break;
-           }
-           newcontent << line << '\n';
-           continue;
-       }
+#ifdef O_BINARY
+       line.chop('\r');
+#endif
        token1 = strtok(line, "\t");
        if (token1 == NULL)
            token1 = "";
@@ -223,7 +324,7 @@ ExternalParser::parse(Retriever &retriev
                        (hd = atoi(token3)) >= 0 && hd < 12)
                  retriever.got_word(token1, loc, hd);
                else
-                 cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << 
base.get() << "\n";
+                 cerr<< "External parser error: expected word in line "<<line<<"\n" 
+<< " URL: " << base.get() << "\n";
                break;
                
            case 'u':   // href
@@ -236,7 +337,7 @@ ExternalParser::parse(Retriever &retriev
                  retriever.got_href(url, token2);
                }
                else
-                 cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << 
base.get() << "\n";
+                 cerr<< "External parser error: expected URL in line "<<line<<"\n" << 
+" URL: " << base.get() << "\n";
                break;
                
            case 't':   // title
@@ -244,7 +345,7 @@ ExternalParser::parse(Retriever &retriev
                if (token1 != NULL)
                  retriever.got_title(token1);
                else
-                 cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << 
base.get() << "\n";
+                 cerr<< "External parser error: expected title in line "<<line<<"\n" 
+<< " URL: " << base.get() << "\n";
                break;
                
            case 'h':   // head
@@ -252,7 +353,7 @@ ExternalParser::parse(Retriever &retriev
                if (token1 != NULL)
                  retriever.got_head(token1);
                else
-                 cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << 
base.get() << "\n";
+                 cerr<< "External parser error: expected text in line "<<line<<"\n" 
+<< " URL: " << base.get() << "\n";
                break;
                
            case 'a':   // anchor
@@ -260,7 +361,7 @@ ExternalParser::parse(Retriever &retriev
                if (token1 != NULL)
                  retriever.got_anchor(token1);
                else
-                 cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << 
base.get() << "\n";
+                 cerr<< "External parser error: expected anchor in line "<<line<<"\n" 
+<< " URL: " << base.get() << "\n";
                break;
                
            case 'i':   // image url
@@ -268,7 +369,7 @@ ExternalParser::parse(Retriever &retriev
                if (token1 != NULL)
                  retriever.got_image(token1);
                else
-                 cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << 
base.get() << "\n";
+                 cerr<< "External parser error: expected image URL in line 
+"<<line<<"\n" << " URL: " << base.get() << "\n";
                break;
 
            case 'm':   // meta
@@ -310,7 +411,7 @@ ExternalParser::parse(Retriever &retriev
                    if (mystrcasecmp(httpEquiv, "refresh") == 0
                        && *content != '\0')
                    {
-                     char *q = mystrcasestr(content, "url=");
+                     char *q = (char*)mystrcasestr(content, "url=");
                      if (q && *q)
                      {
                        q += 4; // skiping "URL="
@@ -365,7 +466,7 @@ ExternalParser::parse(Retriever &retriev
                        meta_dsc = meta_dsc.sub(0, max_meta_description_length).get();
                      if (debug > 1)
                        cout << "META Description: " << content << endl;
-                     retriever.got_meta_dsc(meta_dsc);
+                     retriever.got_meta_dsc((char*)meta_dsc);
 
                      //
                      // Now add the words to the word list
@@ -382,17 +483,42 @@ ExternalParser::parse(Retriever &retriev
                  }
                }
                else
-                 cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << 
base.get() << "\n";
+                 cerr<< "External parser error: expected metadata in line 
+"<<line<<"\n" << " URL: " << base.get() << "\n";
                break;
              }
 
            default:
-               cerr<< "External parser error in line:"<<line<<"\n" << " URL: " << 
base.get() << "\n";
+                 cerr<< "External parser error: unknown field in line "<<line<<"\n" 
+<< " URL: " << base.get() << "\n";
                break;
        }
+    } // while(readLine)
+    if (get_file)
+    {
+       if (!canParse(convertToType) &&
+           mystrncasecmp((char*)convertToType, "text/", 5) != 0 &&
+           mystrncasecmp((char*)convertToType, "application/pdf", 15) != 0)
+       {
+           if (mystrcasecmp((char*)convertToType, "user-defined") == 0)
+               cerr << "External parser error: no Content-Type given\n";
+           else
+               cerr << "External parser error: can't parse Content-Type \""
+                    << convertToType << "\"\n";
+           cerr << " URL: " << base.get() << "\n";
+       }
+       else
+       {
+           char        buffer[2048];
+           int         length;
+           while ((length = fread(buffer, 1, sizeof(buffer), input)) > 0)
+               newcontent.append(buffer, length);
+       }
     }
-    pclose(input);
-    unlink(path);
+    fclose(input);
+    // close(stdout_pipe[0]); // This is closed for us by the fclose()
+    int rpid, status;
+    while ((rpid = wait(&status)) != fork_result && rpid != -1)
+       ;
+    unlink((char*)path);
 
     if (newcontent.length() > 0)
     {
@@ -407,19 +533,19 @@ ExternalParser::parse(Retriever &retriev
            currentParser = ((String *)parsers->Find(contentType))->get();
            parsable = this;
        }
-       else if (mystrncasecmp(contentType, "text/html", 9) == 0)
+       else if (mystrncasecmp((char*)contentType, "text/html", 9) == 0)
        {
            if (!html)
                html = new HTML();
            parsable = html;
        }
-       else if (mystrncasecmp(contentType, "text/plain", 10) == 0)
+       else if (mystrncasecmp((char*)contentType, "text/plain", 10) == 0)
        {
            if (!plaintext)
                plaintext = new Plaintext();
            parsable = plaintext;
        }
-       else if (mystrncasecmp(contentType, "application/pdf", 15) == 0)
+       else if (mystrncasecmp((char*)contentType, "application/pdf", 15) == 0)
        {
            if (!pdf)
                pdf = new PDF();
@@ -432,7 +558,7 @@ ExternalParser::parse(Retriever &retriev
            parsable = plaintext;
            if (debug)
                cout << "External parser error: \"" << contentType <<
-                       "\" not a recognized type.  Assuming text\n";
+                       "\" not a recognized type.  Assuming text/plain\n";
        }
        parsable->setContents(newcontent.get(), newcontent.length());
        parsable->parse(retriever, base);

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to