> It works fine (although it took about 8 hours to parse and compile!!)

        I just parsed it, using the url you pasted in a follow-up message,
and it only took me 52.5 minutes to grab all 1,250 links and parse them into
the single pdb. I was on a dual PIII/600 on a fairly slow DSL connection. 8
hours sounds extremely excessive. Are you sure you weren't parsing lots of
redundant or offsite links?

        I used the following syntax:

$ time plucker-build \
-H "http://www.godrules.net/library/SAmerican/treasury/treasury.htm"; \
--staybelow="http://www.godrules.net/library/SAmerican/treasury/";    \
--maxdepth=2 --zlib-compression -f GodRules

real    52m31.555s
user    35m20.210s
sys     0m10.900s

        The resulting PDB was:

-rw-r--r--    1 hacker     users      5137483 2002-12-11 07:01 GodRules.pdb

        I see the problem you're seeing, and there definately are <pre>
tags in the source. Look closely at the following url, just after the
closing </h1> after the "TSK - GENESIS 1" part:

http://www.godrules.net/library/SAmerican/treasury/treasurygen1.htm

        Looks like:

<b><h1><i>TSK - GENESIS 1</i></h1><pre>
                                  ^ ding!

        ..and at the end of the page:

<!-- God Rules.NET--></pre>
                     ^ ding!

        So you can remove those with two quick perl one-liners, but the
output may not be what you expect:

        perl -pi.orig -e 's,<pre>,,g' *htm
        perl -pi.orig -e ,s,</pre>,,g' *htm

        Good luck!


d.

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to