Okay, I've finally figured it out with some tips from David Resnick in the
Sitescooper Forum. So, for anyone else out there, like me, who has been
trying to figure it out, has tried and given up, or new and wants a
step-by-step guide, here it is:

        First of all, there are a number of ways to run Sitescooper and Plucker. I
will only deal with the way I am currently doing it, because it is easy to
understand and makes logical sense when trying to understand what is going
on.

1. I have modified the context-menu-convert.bat file that is included with
Plucker 1.6 and renamed it to ss.bat (ss for Sitescooper). The only
modification I made was to the last command in the file. It was this:

"%PH%\parser\python\vm\python.exe"
"%PH%\parser\python\PyPlucker\Spider.py" -H file:%1 -f %1 -s CONTEXT_DEFAULT

        It is now this:

START "C:\Insert\Path\To\Sitescooper\Here\sitescooper.pl" -mplucker -nodates

        The -mplucker switch is what causes Sitescooper to output it's processed
pages (websites) into multipage Plucker format (just like the real web pages
they are taken from). I also use the -nodates switch to prevent the file
names from including the date. The Plucker reader already provides this
information for me.

2. I have installed "cygwin 1.5.5-1 with utilities" for Windows. This is for
the diff tool.

3. In the "How to do diffs" section of sitescooper.cf I have included this
command:

Diff: C:\cygwin\bin\diff.exe

        This calls cygwin's diff tool, which is recommended in the Sitescooper
documentation, instead of the default one.

4. In the Makedoc, iSilo, ImageViewer section of sitescooper.cf I have added
the following command:

Plucker: "C:\Program Files\Plucker\parser\python\vm\python" "C:\Program
Files\Plucker\parser\python\pyplucker\spider.py"

        This calls python to process the sites that will be indicated at a later
step.

5. I create my site files in the site_samples directory. And example is:

URL: http://www.cnn.com/mostpopular/
Name: CNN Most Popular
Levels: 2

StoryURL: http://www.cnn.com/2003/.*\.(html|htm)

StoryStart: endclickprintexclude
StoryEnd: endclickprintinclude

        This is just a text file with a name like cnn.site. It tells Sitescooper to
go to the web page (URL:) http://www.cnn.com/mostpopular/, Name the file it
creates (Name:) "CNN Most Popular", scoop 2 levels (Levels:) of the site
(the front page and one link deep from that page), only include links from
the homepage (StoryURL:) that reside in the http://www.cnn.com/2003/
directory, where the filenames include any number of characters (.*),
followed by a . (\.), and end in either html or htm ((html|htm)). So, for
instance, http://www.cnn.com/2003/thebigstory.htm and
http://www.cnn.com/2003/thebigstory.html would both be scooped if there was
a link on the homepage to them, but http://www.cnn.com/2002/thebigstory.htm
and http://www.cnn.com/2002/thebigstory.html would not be plucked since they
reside in the /2002 directory. In addition StoryStart: and StoryEnd: both
refer to the actual .htm(l) code used in the story. In the case of CNN I
have found that endclickprintexclude and endclickprintinclude work nicely.
This removes the junk on the story pages.

6. In the site_choices.txt file I have added references, modeled after the
other references in this file, to my site files and put an X in the
brackets. An example is:

    [X] CNN Top Stories
      URL: http://www.cnn.com/mostpopular/
      Filename: [samples]/cnn.site
      (CNN Most Popular Stories)

7. Finally, I created a shortcut to ss.bat. I have Windows scheduled to run
it unattended so I never have to do it manually. When I sync each day all my
sites are loaded onto my palm and accessible whenever I have time to browse
through them.

        Well, that's about it. I hope I haven't forgotten anything. It's entirely
possible that I am not doing everything as efficiently as possible either,
but I'm still learning and at least I have it working very well now. I sure
hope others will find this post useful.

        Thank you to those who offered ideas to help me solve my problems getting
this combination working. Your time is much appreciated.

-Joseph

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to