> It works fine (although it took about 8 hours to parse and compile!!)
I just parsed it, using the url you pasted in a follow-up message,
and it only took me 52.5 minutes to grab all 1,250 links and parse them into
the single pdb. I was on a dual PIII/600 on a fairly slow DSL connection. 8
hours sounds extremely excessive. Are you sure you weren't parsing lots of
redundant or offsite links?
I used the following syntax:
$ time plucker-build \
-H "http://www.godrules.net/library/SAmerican/treasury/treasury.htm" \
--staybelow="http://www.godrules.net/library/SAmerican/treasury/" \
--maxdepth=2 --zlib-compression -f GodRules
real 52m31.555s
user 35m20.210s
sys 0m10.900s
The resulting PDB was:
-rw-r--r-- 1 hacker users 5137483 2002-12-11 07:01 GodRules.pdb
I see the problem you're seeing, and there definately are <pre>
tags in the source. Look closely at the following url, just after the
closing </h1> after the "TSK - GENESIS 1" part:
http://www.godrules.net/library/SAmerican/treasury/treasurygen1.htm
Looks like:
<b><h1><i>TSK - GENESIS 1</i></h1><pre>
^ ding!
..and at the end of the page:
<!-- God Rules.NET--></pre>
^ ding!
So you can remove those with two quick perl one-liners, but the
output may not be what you expect:
perl -pi.orig -e 's,<pre>,,g' *htm
perl -pi.orig -e ,s,</pre>,,g' *htm
Good luck!
d.
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list