monitoring the parser status during a pluck; estimates of how many pages

Bill Janssen Wed, 12 Jun 2002 19:09:44 -0700

We've got two feature requests pending that I've been playing with.
The first is to provide better feedback to something like the desktop
tool.  Currently it watches the messages (designed for a human) go by
and tries to figure out something from them.  I'd like to do better.
The second is to provide a way to estimate the total number of docs
that need to be gathered, by remembering how many were gathered in a
previous pluck of the same home URL, and displaying that figure.


I thought I'd kill both with one stone.  Here's how it works: on the
command line (or as the config option "status_file"), you can specify
the name of a file.  While the parser is running, it will keep
updating that file, with a single line, which will contain 3 integers,
space separated.  The first is the number collected so far, the second
is the number of links still to process, and the third is the estimate
of the total, if any, or zero if none.  The parser will keep
overwriting this line, so that the file will always contain only one
line.  An observer (like the desktop) can just watch this file change,
and get up-to-date information about the status so far.  Here's a
simple Python script, for example, that watches such a file:

 #!/usr/bin/env python

 import os, sys, time, string

 filename = sys.argv[1]

 if os.path.exists(filename):
    fp = open(filename, 'r')
    while 1:
        fp.seek(0)
        line = fp.readline()
        if line:
            tokens = string.split(line)
            if len(tokens) == 3:
                sys.stdout.write("%d collected, %d to go" %
                                 (string.atoi(tokens[0]),
                                  string.atoi(tokens[1])))
                estimate = string.atoi(tokens[2])
                if estimate > 0:
                    sys.stdout.write(", %d estimated" % estimate)
                sys.stdout.write("\n")
        time.sleep(1)
    
A C routine to do the same thing is extremely similar.

Now, if you run a second pluck with the same statusfile, the parser
will open the file, and read the first integer in it, and treat that
integer as the estimate for the new pluck.  That way the status from
the last pluck will automatically become the estimate for the new
pluck.  Or the observer could set the estimate to a specific value
before running the new pluck.

Will that work for everyone?

Bill

monitoring the parser status during a pluck; estimates of how many pages

Reply via email to