Re: Once again, we revisit the missing --stay{$FOO} options

Alys Mon, 15 Oct 2001 23:34:44 -0700

On Sun, Oct 14, 2001 at 10:50:57PM -0700, David A. Desrosiers wrote:
[snip]
>       In any case, there's a missing option here (and always has been
> missing); --stayondomain. With --stayondomain=slashdot.org, for example,
> images.slashdot.org, www.slashdot.org, banjo.slashdot.org, and slashdot.org
> can be maintained, and you can "package up" the content so that it never
> leaves this domain. I could spider it to a maxdepth of 100, and be assured
> that it would never get out of hand and go offsite (yes, the file would be
> large, but it would be very self-contained).
[snip]
>       --stayondomain: Will never leave the network you specify, so that
>                      www.foo.com, images.foo.com, and foo.com will all be
>                      assumed to be included in the same "pluck". Content
>                      from all "member domains" will be included.



Ummm... I've had some mostly-completed code that does pretty much
that sitting on my PC for a while now. I got to the stage of testing
it and thinking about the finer points but then was distracted by
an increase in my workload that still hasn't let up much. Sorry...

A diff -ur output is below, in case you find it of any use. The diff
command compares these two directories:
/usr/lib/python1.5/site-packages/PyPlucker-1.1-pure/  # unmodified code
/usr/lib/python1.5/site-packages/PyPlucker/           # my version

The changes have been made on what's probably an old version now:
Spider.py   $Id: Spider.py,v 1.31 2001/02/08 22:20:37 janssen Exp $

I'd be happy to make the same changes to the newest version if you
want me to, but I wouldn't be able to start for at least three weeks
(a massive project deadline is coming up). You might already have
written better code yourself though; this was my first attempt
at Python.

It has been tested a bit, but could probably do with some more. I've
been using my modified version every day for quite a while now with
no problems, but I haven't actually used the new stayondomain option
much at all.

The code also includes stayoffdomain and stayoffhost options which
I found useful for downloading the slashdot home page and the
non-slashdot articles without the slashdot comment pages.

There's also a small, unrelated hack to allow date/time information
in the filename and database name; I've left that in the diff output
to avoid messing up the line numbering.

In my code, the domain name is assumed to be the host
name without its first element.  e.g., for the host name
www.healthywaterways.env.qld.gov.au, the domain name is
healthywaterways.env.qld.gov.au. I did it that way instead of letting
the domain name be the last two parts of the host name, because that
obviously wouldn't work for hosts like
www.healthywaterways.env.qld.gov.au (you really DON'T want to
download all the Australian government web pages at once...).

The only exception to this is if the host name contains only two
parts, in which case the domain name will be the same as the host
name.  e.g., if the host name is specified as either www.cnet.com
or cnet.com, then the domain name will be cnet.com.

This is probably okay as a first approximation for the domain for
most web sites, but it would be nice to be able to give the user
some control over this. I can think of a few ways of doing this;
the user could:

        1. specify the domain name explicitly (he could then
        choose, for example, healthywaterways.env.qld.gov.au or
        env.qld.gov.au);

        2. specify how many elements should be chopped off the
        front of the host to make the domain (e.g., chop off
        1 element to get healthywaterways.env.qld.gov.au or 2
        elements to get env.qld.gov.au);

        3. specify how many elements counting back from the end
        of the host are used to make the domain (e.g., 3 elements
        get qld.gov.au, 4 elements to get env.qld.gov.au);

        4. any of the above.

I'm not at all sure which of those options should be coded,
or what would be the least confusing way to present them in the
config files and command line parameters. I could come up with a
system for specifying them myself, but I have little experience
with designing software for mass use and so my system might not be
intuitive to anyone but myself. :)

Anyways, below my sig is the diff output, in case it's of any use.
Don't hesitate to ask me if you have any questions.

Alys

--
Alice Harris
Internet Services, CITEC, Brisbane, Australia
+61 7 322 22578
[EMAIL PROTECTED], [EMAIL PROTECTED]



diff -ru /usr/lib/python1.5/site-packages/PyPlucker/Spider.py 
/usr/lib/python1.5/site-packages/PyPlucker-1.1-pure/Spider.py
--- /usr/lib/python1.5/site-packages/PyPlucker/Spider.py        Fri May 25 10:31:25 
2001
+++ /usr/lib/python1.5/site-packages/PyPlucker-1.1-pure/Spider.py       Thu May 31 
+09:23:59 2001
@@ -1,19 +1,10 @@
 #!/usr/bin/env python
-# TODO: check what happens if OFF and ON are used together for both HOST and DOMAIN
-# TODO: check combinations of domain/host off/on - what makes sense?
 
 """
 Spider.py   $Id: Spider.py,v 1.31 2001/02/08 22:20:37 janssen Exp $
 
 Recursivly gets documents and thus collects a document set.
 
-Modified by Alice Harris <[EMAIL PROTECTED]>:
- - 2001/05/08 - add STAY[ON/OFF]DOMAIN options
- - 2001/04/24 - allow filename and database name (db_file and db_name)
-               to contain date/time information using strftime
- - 2001/04/06 - add STAYOFFHOST option
-
-
 
 Copyright 1999, 2000 by Holger Duerer <[EMAIL PROTECTED]>
 
@@ -66,9 +57,6 @@
         self._max_depth = None
         self._new_max_depth = None
         self._stay_on_host = None
-        self._stay_off_host = None
-        self._stay_on_domain = None
-        self._stay_off_domain = None
         self._stay_below = None
         self._maxwidth = None
         self._maxheight = None
@@ -86,12 +74,6 @@
         res = res + " BPP=%d" % self._bpp
         if self._stay_on_host:
             res = res + " STAYONHOST"
-        if self._stay_off_host:
-            res = res + " STAYOFFHOST"
-        if self._stay_on_domain:
-            res = res + " STAYONDOMAIN"
-        if self._stay_off_domain:
-            res = res + " STAYOFFDOMAIN"
         if self._stay_below:
             res = res + (" STAYBELOW=\"%s\"" % self._stay_below)
         res = res + " " + repr (self._dict)
@@ -145,24 +127,6 @@
             self.set_stay_on_host (self._url.get_host ())
 
 
-        # STAYOFFHOST processing
-        # This attribute is only evaluated *after* the link has been taken
-        if after_taken and dict.has_key ('stayoffhost'):
-            self.set_stay_off_host (self._url.get_host ())
-
-
-        # STAYONDOMAIN processing
-        # This attribute is only evaluated *after* the link has been taken
-        if after_taken and dict.has_key ('stayondomain'):
-            self.set_stay_on_domain (self._url.get_domain ())
-
-
-        # STAYOFFDOMAIN processing
-        # This attribute is only evaluated *after* the link has been taken
-        if after_taken and dict.has_key ('stayoffdomain'):
-            self.set_stay_off_domain (self._url.get_domain ())
-
-
         # STAYBELOW processing
         # This attribute is only evaluated *after* the link has been taken
         if after_taken and dict.has_key ('staybelow'):
@@ -192,12 +156,6 @@
             new.set_max_depth (self._max_depth)
         if self._stay_on_host is not None:
             new.set_stay_on_host (self._stay_on_host)
-        if self._stay_off_host is not None:
-            new.set_stay_off_host (self._stay_off_host)
-        if self._stay_on_domain is not None:
-            new.set_stay_on_domain (self._stay_on_domain)
-        if self._stay_off_domain is not None:
-            new.set_stay_off_domain (self._stay_off_domain)
         if self._stay_below is not None:
             new.set_stay_below (self._stay_below)
         if self._maxwidth is not None:
@@ -272,18 +230,6 @@
         self._stay_on_host = host
 
 
-    def set_stay_off_host (self, host):
-        self._stay_off_host = host
-
-
-    def set_stay_on_domain (self, domain):
-        self._stay_on_domain = domain
-
-
-    def set_stay_off_domain (self, domain):
-        self._stay_off_domain = domain
-
-
     def set_stay_below (self, urlpart):
         self._stay_below = urlpart
 
@@ -307,22 +253,6 @@
         if self._stay_on_host is not None:
             if self._stay_on_host != self._url.get_host():
                 # Got to another host
-                # link is on different host, so ignore it
-                return 0
-
-        if self._stay_off_host is not None:
-            if self._stay_off_host == self._url.get_host():
-                # link is on same host, so ignore it
-                return 0
-
-        if self._stay_on_domain is not None:
-            if self._stay_on_domain != self._url.get_domain():
-                # link is on different domain, so ignore it
-                return 0
-
-        if self._stay_off_domain is not None:
-            if self._stay_off_domain == self._url.get_domain():
-                # link is on same domain, so ignore it
                 return 0
 
         if self._stay_below is not None:
@@ -391,12 +321,6 @@
                       'bpp': "%d" % bpp}
         if config.get_bool ('home_stayonhost', 0):
             attributes['stayonhost'] = 1
-        if config.get_bool ('home_stayoffhost', 0):
-            attributes['stayoffhost'] = 1
-        if config.get_bool ('home_stayondomain', 0):
-            attributes['stayondomain'] = 1
-        if config.get_bool ('home_stayoffdomain', 0):
-            attributes['stayoffdomain'] = 1
         tmp = config.get_string ('home_staybelow')
         if tmp is not None:
             attributes['staybelow'] = tmp
@@ -745,14 +669,10 @@
        if not (os.path.exists(pluckerdir) and os.path.isdir(pluckerdir)):
            sys.stderr.write("Error:  Plucker directory does not exist:  " + cachedir 
+ "\n")
            return 1
-        import time
         dbfile = config.get_string ('db_file')
-        dbfile = time.strftime(dbfile,time.localtime(time.time()))
         filename = os.path.join (pluckerdir, dbfile+".pdb")
         db_name = config.get_string ('db_name')
-        if db_name:
-            db_name = time.strftime(db_name,time.localtime(time.time()))
-        else: 
+        if not db_name:
             db_name = os.path.basename (dbfile) # use basename in case only file name 
is given
 
     spider = Spider (retriever.retrieve,
@@ -872,12 +792,7 @@
         print "    --category=<category-name>:"
         print "                   Put <category-name> in the database as the default"
        print "                   viewer category for the database."
-        print "    --stayonhost:  Do not follow URLs that are external to the host"
-        print "    --stayoffhost: Do not follow URLs that are on the host"
-        print "    --stayondomain:"
-       print "                   Do not follow URLs that are external to the domain"
-        print "    --stayoffdomain:"
-       print "                   Do not follow URLs are on the domain"
+        print "    --stayonhost:  Do not follow external URLs"
         print "    --staybelow=<url-prefix>:"
        print "                   Automatically exclude any URL that doesn't begin 
with <url-prefix>."
        print "    --maxheight=<n>:"
@@ -910,9 +825,6 @@
         zlib_compression = None
         no_url_info = None
         stayonhost = None
-        stayoffhost = None
-        stayondomain = None
-        stayoffdomain = None
        staybelow = None
         category = None
        maxwidth = None
@@ -927,8 +839,7 @@
                                       "maxdepth=", "db-name=",
                                       "extra-section=", "verbosity=", 
                                       "zlib-compression", "doc-compression", 
-                                      "no-urlinfo", "stayonhost", "stayoffhost", 
-                                      "stayondomain", "stayoffdomain", "staybelow=", 
"category=",
+                                      "no-urlinfo", "stayonhost", "staybelow=", 
+"category=",
                                       "maxheight=", "maxwidth=", "alt-maxheight=", 
"alt-maxwidth=",
                                      "compression=", "home-url=", "update-cache"])
         if args:
@@ -979,12 +890,6 @@
                 no_url_info = 'true'
             elif opt == "--stayonhost":
                 stayonhost = 'true'
-            elif opt == "--stayoffhost":
-                stayoffhost = 'true'
-            elif opt == "--stayondomain":
-                stayondomain = 'true'
-            elif opt == "--stayoffdomain":
-                stayoffdomain = 'true'
             elif opt == "--staybelow":
                 staybelow = arg
             elif opt == "--category":
@@ -1068,12 +973,6 @@
         config.set ('no_url_info', no_url_info)
     if stayonhost:
         config.set ('home_stayonhost', stayonhost)
-    if stayoffhost:
-        config.set ('home_stayoffhost', stayoffhost)
-    if stayondomain:
-        config.set ('home_stayondomain', stayondomain)
-    if stayoffdomain:
-        config.set ('home_stayoffdomain', stayoffdomain)
     if staybelow:
         config.set ('home_staybelow', staybelow)
     if maxheight is not None:
diff -ru /usr/lib/python1.5/site-packages/PyPlucker/Url.py 
/usr/lib/python1.5/site-packages/PyPlucker-1.1-pure/Url.py
--- /usr/lib/python1.5/site-packages/PyPlucker/Url.py   Fri May 25 18:07:14 2001
+++ /usr/lib/python1.5/site-packages/PyPlucker-1.1-pure/Url.py  Thu May 31 09:23:59 
+2001
@@ -1,7 +1,4 @@
 #!/usr/bin/env python
-# TODO: make sure that all calls to this are still going to work
-# TODO: note: can't change urlparse, urlunparse - standard modules
-# TODO: expand tabs to spaces and check
 
 """
 Url.py   $Id: Url.py,v 1.12 2000/09/25 18:47:39 nordstrom Exp $
@@ -34,7 +31,6 @@
             # Simple copy constructor: make it more efficient
             self._protocol = url._protocol
             self._host = url._host
-            self._domain = url._domain
             self._path = url._path
             self._params = url._params
             self._query = url._query
@@ -54,7 +50,6 @@
             self._params = params
             self._query = query
             self._fragment = fragment
-            self._domain = self.get_domain_from_host ()
 
 
     def as_string (self, with_fragment):
@@ -77,40 +72,11 @@
     def __repr__ (self):
         return "URL (%s)" % repr (self.as_string (with_fragment=1))
 
-    #def get_domain_from_host (self):
-        #import re
-        ## TODO: when splitting up host names, take care of ports: 'www.cwi.nl:80',
-        #match = re.match( r'^\w+\.(.+)$', self._host )
-        #if match:
-            #domain = match.group(1)
-        #else:
-            #domain = ""
-        #return domain
-
-    def get_domain_from_host (self):
-               # TODO: make the minimum length (3) and the starting index (1) 
user-specified options
-               host_elements = string.split (self._host, ".")
-               #print self._host, "    ", 
-               #print host_elements, len (host_elements)
-               if ( len (host_elements) < 3 ):
-                       #print "SAME"
-                       # e.g., the host (and domain) are slashdot.org
-                       domain = self._host
-               else:
-                       #print "DIFFERENT"
-                       # e.g., the host is www.slashdot.org; the domain is 
slashdot.org
-                       domain = string.join ( host_elements[1:] , ".")
-               #print "   ", domain
-               return domain
-
     def get_protocol (self):
         return self._protocol
             
     def get_host (self):
         return self._host
-            
-    def get_domain (self):
-        return self._domain
             
     def get_path (self):
         return self._path

Re: Once again, we revisit the missing --stay{$FOO} options

Reply via email to