[htdig] PATCH: htdump/htload for 3.1.5

Gilles Detillieux Mon, 09 Apr 2001 15:13:38 -0700
OK, I hinted at it last week, and worked on it a bit Friday and quite a
bit more today.  The following patch introduces htdump/htload utilities to
the 3.1.5 version of htdig.  To keep it easier to install (i.e. to avoid
messing with autoconf or the makefiles), I set it up as an extension to
the htdig program, selected by symbolic links to htdig (or copies of it)
with the names htdump and htload.

htdump will dump out an ASCII version of db.docdb into db.docs, and
htload will load in an ASCII version of the database from db.docs
into db.docdb.  They don't do anything about the wordlist, because
db.wordlist is already in ASCII form, and they don't do anything about
db.docs.index and db.words.db because htmerge can regenerate these from
db.docdb and db.wordlist.

In the process, I also fixed the problem with META descriptions containing
newlines, returns or tabs (bug #405771), because fields in the ASCII
version of the database shouldn't contain any of these characters. They
are now replaced with spaces.

I also changed the output of htdig -t to be the same format as htdump,
as it is in 3.2.0b3, to get all the DocumentRef fields out.  I also
don't sort the file because this is most likely unnecessary and could
potentially cause problems (this too is consistent with the changes
in 3.2).  I added a -m option to htdig for compatibility with 3.2.0b3,
because it meshed nicely with the other changes I made to htdig.cc
and String.cc.  Finally, I added a readLine() method to String.cc,
and also fixed what was reported to be a problem with the String '='
operator while I was in there.

Please note: this doesn't mean you can now htdump a 3.1.5 database and
htload it into 3.2.0b3 format, nor vice-versa.  The reason is the format
and content of db.wordlist is very different from the db.worddump file
that htdump 3.2.0b3 produces.  3.2's worddump has much more information
about the words, including positions of all words, including repeated
ones.  It wouldn't be possible to convert a 3.1.5 db.wordlist into a
db.worddump file for 3.2.0b3 and have phrase searching work, because of
the missing information, so you really need to redig.  However, it should
be possible to write a filter that would convert a db.worddump into a
db.wordlist, converting the format and mapping flags to the appropriate
weight, so you can dig with 3.2 and carry the db back to 3.1.5.  I haven't
written this filter, though, and I don't plan to.

As always, you can apply this patch in the htdig-3.1.5 main source
directory using the command "patch -p0 < this-message-file".

--- htcommon/DocumentDB.cc.noload       Thu Feb 24 20:29:10 2000
+++ htcommon/DocumentDB.cc      Mon Apr  9 15:20:18 2001
@@ -3,7 +3,13 @@
 //
 // Implementation of DocumentDB
 //
+// $Id: DocumentDB.cc,v 1.11 1999/02/17 05:03:52 ghutchis Exp $
 //
+// Part of the ht://Dig package   <http://www.htdig.org/>
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// <http://www.gnu.org/copyleft/gpl.html>
 //
 
 #include "DocumentDB.h"
@@ -183,35 +189,25 @@ int DocumentDB::Delete(char *u)
 
 
 //*****************************************************************************
-// int DocumentDB::CreateSearchDB(char *filename)
-//   Create an extract from our database which can be used by the
-//   search engine.  The extract will consist of lines with fields
-//   separated by tabs.  The fields are:
-//        docID
-//        docURL
-//        docTime
-//        docHead
-//        docMetaDsc
-//        descriptions (separated by tabs)
+// int DocumentDB::DumpDB(char *filename, int verbose)
+//   Create an extract from our database which can be used by an
+//   external application. The extract will consist of lines with fields
+//   separated by tabs. 
 //
-//   The extract will be sorted by docID.
+//   The extract will likely not be sorted by anything in particular
 //
-int DocumentDB::CreateSearchDB(char *filename)
+int DocumentDB::DumpDB(char *filename, int verbose)
 {
     DocumentRef                *ref;
     List               *descriptions, *anchors;
     char               *key;
     String             data;
     FILE               *fl;
-    String             command = SORT_PROG;
-    String             tmpdir = getenv("TMPDIR");
 
-    command << " -n -o" << filename;
-    if (tmpdir.length())
-    {
-       command << " -T " << tmpdir;
+    if((fl = fopen(filename, "w")) == 0) {
+      perror(form("DocumentDB::DumpDB: opening %s for writing", filename));
+      return NOTOK;
     }
-    fl = popen(command, "w");
 
     dbf->Start_Get();
     while ((key = dbf->Get_Next()))
@@ -227,11 +223,16 @@ int DocumentDB::CreateSearchDB(char *fil
            fprintf(fl, "\ta:%d", ref->DocState());
            fprintf(fl, "\tm:%d", (int) ref->DocTime());
            fprintf(fl, "\ts:%d", ref->DocSize());
-           fprintf(fl, "\th:%s", ref->DocHead());
+           fprintf(fl, "\tH:%s", ref->DocHead());
            fprintf(fl, "\th:%s", ref->DocMetaDsc());
            fprintf(fl, "\tl:%d", (int) ref->DocAccessed());
            fprintf(fl, "\tL:%d", ref->DocLinks());
-           fprintf(fl, "\tI:%d", ref->DocImageSize());
+           fprintf(fl, "\tb:%d", ref->DocBackLinks());
+           fprintf(fl, "\tc:%d", ref->DocHopCount());
+           fprintf(fl, "\tg:%d", ref->DocSig());
+           fprintf(fl, "\te:%s", ref->DocEmail());
+           fprintf(fl, "\tn:%s", ref->DocNotification());
+           fprintf(fl, "\tS:%s", ref->DocSubject());
            fprintf(fl, "\td:");
            descriptions = ref->Descriptions();
            String      *description;
@@ -261,13 +262,129 @@ int DocumentDB::CreateSearchDB(char *fil
        }
     }
 
-    int        sortRC = pclose(fl);
-    if (sortRC)
+    fclose(fl);
+
+    return OK;
+}
+
+//*****************************************************************************
+// int DocumentDB::LoadDB(char *filename, int verbose)
+//   Load an extract to our database from an ASCII file
+//   The extract will consist of lines with fields separated by tabs. 
+//   The lines need not be sorted in any fashion.
+//
+int DocumentDB::LoadDB(char *filename, int verbose)
+{
+    FILE       *input;
+    DocumentRef ref;
+    StringList descriptions, anchors;
+    char       *token, field;
+    String     data;
+
+    if((input = fopen(filename, "r")) == 0) {
+      perror(form("DocumentDB::LoadDB: opening %s for reading", filename));
+      return NOTOK;
+    }
+
+    while (data.readLine(input))
     {
-       cerr << "Document sort failed\n\n";
-       exit(1);
+       token = strtok(data, "\t");
+       if (token == NULL)
+         continue;
+
+       ref.DocID(atoi(token));
+       
+       if (verbose)
+         cout << "\t loading document ID: " << ref.DocID() << endl;
+
+       while ( (token = strtok(0, "\t")) )
+         {
+           field = *token;
+           token += 2;
+
+           if (verbose > 2)
+               cout << "\t field: " << field;
+
+           switch(field)
+             {
+               case 'u': // URL
+                 ref.DocURL(token);
+                 break;
+               case 't': // Title
+                 ref.DocTitle(token);
+                 break;
+               case 'a': // State
+                 ref.DocState((ReferenceState)atoi(token));
+                 break;
+               case 'm': // Modified
+                 ref.DocTime(atoi(token));
+                 break;
+               case 's': // Size
+                 ref.DocSize(atoi(token));
+                 break;
+               case 'H': // Head
+                 ref.DocHead(token);
+                 break;
+               case 'h': // Meta Description
+                 ref.DocMetaDsc(token);
+                 break;
+               case 'l': // Accessed
+                 ref.DocAccessed(atoi(token));
+                 break;
+               case 'L': // Links
+                 ref.DocLinks(atoi(token));
+                 break;
+               case 'b': // BackLinks
+                 ref.DocBackLinks(atoi(token));
+                 break;
+               case 'c': // HopCount
+                 ref.DocHopCount(atoi(token));
+                 break;
+               case 'g': // Signature
+                 ref.DocSig(atoi(token));
+                 break;
+               case 'e': // E-mail
+                 ref.DocEmail(token);
+                 break;
+               case 'n': // Notification
+                 ref.DocNotification(token);
+                 break;
+               case 'S': // Subject
+                 ref.DocSubject(token);
+                 break;
+               case 'd': // Descriptions
+                 descriptions.Create(token, '\001');
+                 ref.Descriptions(descriptions);
+                 break;
+               case 'A': // Anchors
+                 anchors.Create(token, '\001');
+                 ref.DocAnchors(anchors);
+                 break;
+               default:
+                 break;
+             }
+
+         }
+       
+
+       // We must be careful if the document already exists
+       // So we'll delete the old document and add the new one
+       if (Exists(ref.DocURL()))
+         {
+           Delete(ref.DocURL());
+         }
+       Add(ref);
+
+       // If we add a record with an ID past nextDocID, update it
+       if (ref.DocID() > nextDocID)
+         nextDocID = ref.DocID() + 1;
+
+       descriptions.Destroy();
+       anchors.Destroy();
     }
-    return 0;
+
+    fclose(input);
+    return OK;
 }
 
 
--- htcommon/DocumentDB.h.noload        Thu Feb 24 20:29:10 2000
+++ htcommon/DocumentDB.h       Mon Apr  9 14:00:15 2001
@@ -8,31 +8,18 @@
 //
 // $Id: DocumentDB.h,v 1.5 1999/01/25 01:53:42 hp Exp $
 //
-// $Log: DocumentDB.h,v $
-// Revision 1.5  1999/01/25 01:53:42  hp
-// Provide a clean upgrade from old databses without "url_part_aliases" and
-// "common_url_parts" through the new option "uncoded_db_compatible".
-//
-// Revision 1.4  1999/01/14 01:09:11  ghutchis
-// Small speed improvements based on gprof.
-//
-// Revision 1.3  1999/01/14 00:30:10  ghutchis
-// Added IncNextDocID to allow big changes in NextDocID, such as when merging
-// databases.
-//
-// Revision 1.2  1998/01/05 00:47:27  turtle
-// reformatting
-//
-// Revision 1.1.1.1  1997/02/03 17:11:07  turtle
-// Initial CVS
-//
+// Part of the ht://Dig package   <http://www.htdig.org/>
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// <http://www.gnu.org/copyleft/gpl.html>
 //
 #ifndef _DocumentDB_h_
 #define _DocumentDB_h_
 
 #include "DocumentRef.h"
-#include <List.h>
-#include <Database.h>
+#include "List.h"
+#include "Database.h"
 
 
 class DocumentDB
@@ -45,11 +32,6 @@ public:
     ~DocumentDB();
 
     //
-    // The database used for searching is generated from our internal database:
-    //
-    int                        CreateSearchDB(char *filename);
-
-    //
     // Standard database operations
     //
     int                        Open(char *filename);
@@ -75,6 +57,13 @@ public:
     // We will need to be able to iterate over the complete database.
     //
     List               *URLs();        // This returns a list of all the URLs
+
+    // Dump the database out to an ASCII text file
+    int                        DumpDB(char *filename, int verbose = 0);
+
+    // Read in the database from an ASCII text file
+    // (created by DumpDB)
+    int                        LoadDB(char *filename, int verbose = 0);
 
     //
     // Set compatibility mode (try to support when database
--- htdig/htdig.cc.noload       Thu Feb 24 20:29:10 2000
+++ htdig/htdig.cc      Mon Apr  9 15:53:03 2001
@@ -4,7 +4,13 @@
 // Indexes the web sites specified in the config file
 // generating several databases to be used by htmerge
 //
+// $Id: htdig.cc,v 1.3.2.6 1999/12/06 21:06:01 grdetil Exp $
 //
+// Part of the ht://Dig package   <http://www.htdig.org/>
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// <http://www.gnu.org/copyleft/gpl.html>
 //
 
 #include "Document.h"
@@ -33,6 +39,7 @@ StringMatch             badquerystr;
 FILE                   *urls_seen = NULL;
 FILE                   *images_seen = NULL;
 String                 configFile = DEFAULT_CONFIG_FILE;
+String                 minimalFile = 0;
 
 void usage();
 void reportError(char *msg);
@@ -49,13 +56,32 @@ main(int ac, char **av)
     int                        initial = 0;
     int                        alt_work_area = 0;
     int                        create_text_database = 0;
+    int                        create_text_database_only = 0;
+    int                        load_text_database = 0;
+    char               *arg0, *s;
     char               *max_hops = 0;
     RetrieverLog       flag  = Retriever_noLog;
+
+    // Find argument 0 basename, to see who we're called as
+    arg0 = av[0];
+    s = strrchr(arg0, '/');
+    if (s != NULL)
+       arg0 = s+1;
+    // For Cygwin on Win32 systems...
+    s = strrchr(arg0, '\\');
+    if (s != NULL)
+       arg0 = s+1;
+
+    // Select function based on argument 0
+    if (mystrncasecmp(arg0, "htdump", 6) == 0)
+       create_text_database_only = create_text_database = 1;
+    else if (mystrncasecmp(arg0, "htload", 6) == 0)
+       load_text_database = 1;
        
     //
     // Parse command line arguments
     //
-    while ((c = getopt(ac, av, "lsc:vith:u:a")) != -1)
+    while ((c = getopt(ac, av, "lsm:c:vith:u:a")) != -1)
     {
         int pos;
        switch (c)
@@ -67,6 +93,11 @@ main(int ac, char **av)
                debug++;
                break;
            case 'i':
+               if (create_text_database_only)
+               {
+                   cerr << "htdump: -i option not allowed for dumping\n";
+                   break;
+               }
                initial++;
                break;
            case 't':
@@ -86,6 +117,10 @@ main(int ac, char **av)
            case 'a':
                alt_work_area++;
                break;
+           case 'm':
+               minimalFile = optarg;
+               max_hops = "0";
+               break;
            case 'l':
                flag = Retriever_logUrl;
                break;
@@ -219,13 +254,21 @@ main(int ac, char **av)
     String             filename = config["doc_db"];
     if (initial)
        unlink(filename);
-    if (docs.Open(filename) < 0)
+    if (create_text_database_only)
+    {
+       if (docs.Read(filename) < 0)
+       {
+           reportError(form("Unable to open document database '%s'",
+                        filename.get()));
+       }
+    }
+    else if (docs.Open(filename) < 0)
     {
        reportError(form("Unable to open/create document database '%s'",
                         filename.get()));
     }
 
-    if (initial)
+    if (initial && !load_text_database)
     {
        filename = config["word_list"];
        unlink(filename);
@@ -238,20 +281,54 @@ main(int ac, char **av)
     // URLs?
     //
     Retriever  retriever(flag);
-    List       *list = docs.URLs();
-    retriever.Initial(*list);
-    delete list;
-
-    // Add start_url to the initial list of the retriever.
-    // Don't check a URL twice!
-    // Beware order is important, if this bugs you could change 
-    // previous line retriever.Initial(*list, 0) to Initial(*list,1)
-    retriever.Initial(config["start_url"], 1);
+    if (minimalFile.length() == 0)
+      {
+       List    *list = docs.URLs();
+       retriever.Initial(*list);
+       delete list;
+
+       // Add start_url to the initial list of the retriever.
+       // Don't check a URL twice!
+       // Beware order is important, if this bugs you could change 
+       // previous line retriever.Initial(*list, 0) to Initial(*list,1)
+       retriever.Initial(config["start_url"], 1);
+      }
+
+    // Handle list of URLs given as minimal file (-m file), or on given
+    // given file name (stdin, if optional "-" argument given).
+    if (minimalFile.length() != 0 || optind < ac)
+      {
+       FILE    *input;
+       String  str;
+       if (minimalFile.length() != 0)
+         {
+           if (strcmp(minimalFile.get(), "-") == 0)
+               input = stdin;
+           else
+               input = fopen(minimalFile.get(), "r");
+         }
+       else if (strcmp(av[optind], "-") == 0)
+           input = stdin;
+       else
+           input = fopen(av[optind], "r");
+       if (input)
+         {
+           while (str.readLine(input))
+             {
+               str.chop("\r\n\t ");
+               if (str.length() > 0)
+                   retriever.Initial(str, 1);
+             }
+           if (input != stdin)
+               fclose(input);
+         }
+      }
 
     //
     // Go do it!
     //
-    retriever.Start();
+    if (!create_text_database_only && !load_text_database)
+       retriever.Start();
 
     //
     // All done with parsing.
@@ -265,7 +342,16 @@ main(int ac, char **av)
        filename = config["doc_list"];
        if (initial)
            unlink(filename);
-       docs.CreateSearchDB(filename);
+       docs.DumpDB(filename, debug);
+    }
+
+    //
+    // For htload, read in a text version of the document database.
+    //
+    if (load_text_database)
+    {
+       filename = config["doc_list"];
+       docs.LoadDB(filename, debug);
     }
 
     //
@@ -291,7 +377,8 @@ main(int ac, char **av)
 //
 void usage()
 {
-    cout << "usage: htdig [-l][-v][-i][-c configfile][-t]\n";
+    cout << "usage: htdig [-v][-i][-c configfile][-t][-h hopcount][-s] \\\n";
+    cout << "           [-u username:password][-a][-l][-m minimalfile][file]\n";
     cout << "This program is part of ht://Dig " << VERSION << "\n\n";
     cout << "Options:\n";
 
@@ -334,6 +421,17 @@ void usage()
     cout << "\t\tReads in the progress of any previous interrupted digs\n";
     cout << "\t\tfrom the log file and write the progress out if\n";
     cout << "\t\tinterrupted by a signal.\n\n";
+
+    cout << "\t-m minimalfile  (or just a file name at end of arguments)\n";
+    cout << "\t\tTells htdig to read URLs from the supplied file and index\n";
+    cout << "\t\tthem in place of (or in addition to) the existing URLs in\n";
+    cout << "\t\tthe database and the start_url.  With the -m, only the\n";
+    cout << "\t\tURLs specified are added to the database.  A file name of\n";
+    cout << "\t\t'-' indicates the standard input.\n\n";
+
+    cout << "or usage: htdump [-v][-c configfile][-a]\n";
+    cout << "or usage: htload [-v][-i][-c configfile][-a]\n";
+    cout << "\t\tto dump/load docdb to/from ASCII text database.\n\n";
 
     exit(0);
 }
--- htdig/HTML.cc.noload        Sat May 13 21:40:10 2000
+++ htdig/HTML.cc       Mon Apr  9 16:17:09 2001
@@ -849,9 +849,13 @@ HTML::do_tag(Retriever &retriever, Strin
                  {
                    //
                    // We need to do two things. First grab the description
+                   // and clean it up
                    //
                    meta_dsc = transSGML(conf["content"]);
-                  if (meta_dsc.length() > max_meta_description_length)
+                   meta_dsc.replace('\n', ' ');
+                   meta_dsc.replace('\r', ' ');
+                   meta_dsc.replace('\t', ' ');
+                   if (meta_dsc.length() > max_meta_description_length)
                     meta_dsc = meta_dsc.sub(0, max_meta_description_length).get();
                   if (debug > 1)
                     cout << "META Description: " << conf["content"] << endl;
--- htdoc/htdig.html.noload     Thu Feb 24 20:29:10 2000
+++ htdoc/htdig.html    Mon Apr  9 17:09:43 2001
@@ -10,7 +10,7 @@
          htdig
        </h1>
        <p>
-         ht://Dig Copyright &copy; 1995-2000 The ht://Dig Group<br>
+         ht://Dig Copyright &copy; 1995-2001 <a href="THANKS.html">The ht://Dig 
+Group</a><br>
          Please see the file <a href="COPYING">COPYING</a> for
          license information.
        </p>
@@ -89,6 +89,14 @@
                         progress out if interrupted by a signal.
                  </dd>
                  <dt>
+                       -m [url_file]
+                 </dt>
+                 <dd>
+                       Minimal. Only index the URLs in the file provided and
+                       no others. The url_file can be a "-", causing htdig
+                       to read the URLs from the STDIN.
+                 </dd>
+                 <dt>
                        -s
                  </dt>
                  <dd>
@@ -103,6 +111,42 @@
                        information can be extracted from it for purposes other
                        than searching. One could gather some interesting
                        statistics from this database.
+                       <p>Each line in the file starts with the document id 
+                       followed by a list of
+                       <strong>\t<em>fieldname</em>:<em>value</em></strong>.
+                       The fields always appear in the order listed below:
+                       </p>
+                       <table border=0>
+                       <tr> <th>fieldname</th><th>value</th></tr>
+                       <tr> <td>u</td><td>URL</td></tr>
+                       <tr> <td>t</td><td>Title</td></tr>
+                       <tr> <td>a</td><td>State (0 = normal, 1 = not found, 2
+                       = not indexed, 3 = obsolete)</td></tr>
+                       <tr> <td>m</td><td>Last modification time as reported
+                       by the server</td></tr> 
+                       <tr> <td>s</td><td>Size in bytes</td></tr>
+                       <tr> <td>H</td><td>Excerpt</td></tr>
+                       <tr> <td>h</td><td>Meta description</td></tr>
+                       <tr> <td>l</td><td>Time of last retrieval</td></tr>
+                       <tr> <td>L</td><td>Count of the links in the document
+                       (outgoing links)</td></tr>
+                       <tr> <td>b</td><td>Count of the links to the document
+                       (incoming links or backlinks)</td></tr>
+                       <tr> <td>c</td><td>HopCount of this document</td></tr>
+                       <tr> <td>g</td><td>Signature of the document used for
+                       duplicate-detection</td></tr>
+                       <tr> <td>e</td><td>E-mail address to use for a
+                       notification message from htnotify</td></tr>
+                       <tr> <td>n</td><td>Date to send out a notification
+                       e-mail message</td></tr>
+                       <tr> <td>S</td><td>Subject for a notification e-mail
+                       message</td></tr>
+                       <tr> <td>d</td><td>The text of links pointing to this
+                       document. (e.g. &lt;a
+                       href=&quot;docURL&quot;&gt;description&lt;/a&gt;)</td></tr>
+                       <tr> <td>A</td><td>Anchors in the document (i.e. &lt;A
+                       NAME=...)</td></tr>
+                       </table>
                  </dd>
                  <dt>
                        -u <em>username:password</em>
@@ -122,7 +166,35 @@
                        program. Using more than 2 is probably only useful for
                        debugging purposes. The default verbose mode (using
                        only one -v) gives a nice progress report while
-                       digging.
+                       digging. This progress report can be a bit
+                       cryptic, so here is a brief explanation. A line
+                       is shown for each URL, with 3 numbers before the
+                       URL and some symbols after the URL. The first
+                       number is the number of documents parsed so
+                       far, the second is the DocID for this document,
+                       and the third is the hop count of the document
+                       (number of hops from one of the start_url
+                       documents). After the URL, it shows a "*" for
+                       a link in the document that it already visited,
+                       a "+" for a new link it just queued, and a "-"
+                       for a link it rejected for any of a number of
+                       reasons. To find out what those reasons are,
+                       you need to run htdig with at least 3 -v options,
+                       i.e. -vvv. If there are no "*", "+" or "-" symbols
+                       after the URL, it doesn't mean the document was
+                       not parsed or was empty, but only that no links
+                       to other documents were found within it. With
+                       more verbose output, these symbols will get
+                       interspersed in several lines of debugging output.
+                 </dd>
+                 <dt>
+                       url_file (at end of arguments, after options)
+                 </dt>
+                 <dd>
+                       Get the list URLs to start indexing from the file
+                       provided. This will override the default start_url.
+                       The url_file can be a "-", causing htdig to read
+                       the URLs from the STDIN.
                  </dd>
                </dl>
          </dd>
@@ -159,11 +231,8 @@
          </dd>
        </dl>
        <hr size="4" noshade>
-       <address>
-         <a href="author.html">Andrew Scherpbier &lt;[EMAIL PROTECTED]&gt;</a>
-       </address>
 
-Last modified: $Date: 2000/02/17 22:05:21 $
+       Last modified: $Date: 2001/04/09 17:09:37 $
 
   </body>
 </html>
--- htlib/String.cc.noload      Thu Feb 24 20:29:11 2000
+++ htlib/String.cc     Mon Apr  9 14:05:07 2001
@@ -3,6 +3,12 @@
 //
 // $Id: String.cc,v 1.16.2.3 1999/11/26 21:59:26 grdetil Exp $
 //
+// Part of the ht://Dig package   <http://www.htdig.org/>
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// <http://www.gnu.org/copyleft/gpl.html>
+//
 #if RELEASE
 static char    RCSid[] = "$Id: String.cc,v 1.16.2.3 1999/11/26 21:59:26 grdetil Exp 
$";
 #endif
@@ -91,9 +97,16 @@ String::~String()
 
 void String::operator = (const String &s)
 {
-    allocate_space(s.length());
-    Length = s.length();
-    copy_data_from(s.Data, Length);
+    if (s.length() > 0) 
+    {
+       allocate_space(s.length());
+       Length = s.length();
+       copy_data_from(s.Data, Length);
+    }
+    else
+    {
+       Length = 0;
+    }
 }
 
 void String::operator = (char *s)
@@ -622,3 +635,38 @@ void String::debug(ostream &o)
 }
 
 
+int String::readLine(FILE *in)
+{
+    Length = 0;
+    allocate_fix_space(2048);
+
+    while (fgets(Data + Length, Allocated - Length, in))
+    {
+       Length += strlen(Data + Length);
+       if (Length == 0)
+           continue;
+       if (Data[Length - 1] == '\n')
+       {
+           //
+           // A full line has been read.  Return it.
+           //
+           chop('\n');
+           return 1;
+       }
+       if (Allocated > Length + 1)
+       {
+           //
+           // Not all available space filled. Probably EOF?
+           //
+           continue;
+       }
+       //
+       // Only a partial line was read. Increase available space in 
+       // string and read some more.
+       //
+       reallocate_space(Allocated << 1);
+    }
+    chop('\n');
+
+    return Length > 0;
+}
--- htlib/htString.h.noload     Thu Feb 24 20:29:11 2000
+++ htlib/htString.h    Mon Apr  9 15:14:48 2001
@@ -3,11 +3,18 @@
 //
 // $Id: htString.h,v 1.5 1999/02/01 04:02:25 hp Exp $
 //
+// Part of the ht://Dig package   <http://www.htdig.org/>
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// <http://www.gnu.org/copyleft/gpl.html>
+//
 #ifndef __String_h
 #define __String_h
 
 #include "Object.h"
 #include <stdarg.h>
+#include <stdio.h>
 
 class ostream;
 
@@ -138,6 +145,8 @@ public:
     friend int         operator >= (String &a, String &b);
 
     friend ostream     &operator << (ostream &o, String &s);
+
+    int                        readLine(FILE *in);
 
     void               lowercase();
     void               uppercase();

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
[htdig] PATCH: htdump/htload for 3.1.5

Reply via email to