Re: Exclude/include links

Alexander R. Pruss Fri, 30 May 2003 05:49:36 -0700

On Thu, 29 May 2003, David A. Desrosiers wrote:
>       Bookmarks solves these two problems, or seems to, based on these
> explanations. Maybe I don't understand the larger goal here. Alexander?


Bookmarks probably solves the problem.  It does require an extra step to
go to the desired page, which is a bit of a nuisance, but that's all.

The point was to allow more careful control over what is collected.  With
multiple root pages for spidering, one could get that.  But one can also
get that by generating an HTML page that has links to the multiple
"root" pages, and then having bookmarks.  It does require an extra step,
but that can be done.

I suppose the thing is that I use Plucker for rather different purposes
from what most people do.  I use it to keep large collections of scholarly
texts.  (E.g., I just plucked the complete works of John Henry Newman,
~18mb compressed.)  Getting just the right subset of texts from a large
website might be difficult.  One can do some stuff with
inclusion/exclusion lists, but occasionally one might want more precise
control.  The ideal would be to input a list of URLs that must be included
in the collection, with a separate depth for each item on the list.  And
an assumption that there are interlinks between all of these.  For
instance, a site might have a home page which then links to three
different pages, A, B, C, D and E.  One could then get such careful
control that one could include the home page, two levels of links starting
with A, three levels starting with C, just page D itself and no links from
it, and exclude D and E.  Right now, such precise control is not
available.  I am not sure yet whether I need it for my purposes or not, so
this is entirely theoretical.  For now, creating an HTML page of links
and using bookmarks will do.

However, I definitely would like more control over the order things are
put into the pdb--putting things in in the order fetched (with fragments
correctly handled) would be OK--and I definitely would like the spider to
check over all the files fetched to see if it can resolve any other links
"for free" (i.e., without fetching more data from the web).  The latter
feature would be rather nice, since it would allow one to control
precisely what order pages are put into the pdb.

Alex

--
Dr. Alexander R. Pruss  || e-mail: [EMAIL PROTECTED]
Philosophy Department   || online papers and home page:
Georgetown University   ||  www.georgetown.edu/faculty/ap85
Washington, DC 20057    ||
U.S.A.                  ||
-----------------------------------------------------------------------------
   "Philosophiam discimus non ut tantum sciamus, sed ut boni efficiamur."
       - Paul of Worczyn (1424)

_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

Re: Exclude/include links

Reply via email to