> I am looking for a way to speed up the generation of the pdb's .
> Right now i am generating a bunch of html files , which are then
> converted to the Plucker pdb format with plucker-build. This
> process takes about 30-40 minutes on my machine (AMD XP 2500, 512MB
> ram, Mandrake Linux 10.0).
How many thousands of html files are you converting? 10,000?
50,000? More? plucker-build (or cplucker, the C++ parser) should never
take that long, unless you have an _enormous_ number of files, or your
Python vm is running out of memory for some reason.
The Python distiller has a few deficiencies in this area,
specifically with regard to large numbers of files being parsed. I'd
look into using the C++ parser from cvs if you need a bit more speed.
I just checked some of my largest fetches this week, and a
single website with 233 external links took 2 minutes 24 seconds to
fetch and convert completely. The machine that .pdb was generated on
was a paltry 2.1Ghz machine with 1gb of RAM in it (its a test box).
Based on your results, if my machine was parsing content for 30-40
minutes, that would be well over 6,000 separate links in total.
That's a LOT of links for a Plucker .pdb.
I should also note that I parse and convert Wikipedia and some
other VERY large texts on a regular basis for testing/benchmarking,
but those processes take several days at a time and probably are not
good examples to compare to your source material. Most of the medium
to large documents I create take under 15 minutes at the most, when
using the Python distiller.
> Do you think that it might be a good idea to generate the pdb's
> directly ( instead of generating html files that have to be
> processed by plucker-build ) ?
If you can deal with the HTML'ized objects in memory as a
stream, you can certainly do that. The Plucker document format is very
well documented:
http://cvs.plkr.org/index.cgi/*checkout*/docs/DBFormat.html?rev=HEAD
> I am asking this question because i am not familiar with the plucker
> pdb format , and i'm wondering whether the effort involved in
> teaching my programs to generate pdb's is worth it.
jSyncManager reportedly has some Java classes that can write
Plucker documents (or at least the author _planned_ to write some, I'm
not sure how far he got with that, whether they're complete or if they
even work). You can see the classes here:
http://www.jsyncmanager.org/javadoc/v32/index.html
> Is it possible to avoid the parsing of the html files? Maybe it's
> possible to feed the contents directly to plucker-build ...
In its current implementation, plucker-build (which is just a
symlink to Spider.py), cannot be fed a stream of data directly. You
might try implementing psyco to gain some speed in the distiller if
your machine truly can't handle parsing your source files.
http://psyco.sourceforge.net/
David A. Desrosiers
[EMAIL PROTECTED]
http://gnu-designs.com
_______________________________________________
plucker-list mailing list
[email protected]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list