> I am looking for a way to speed up the generation of the pdb's .  
> Right now i am generating a bunch of html files , which are then 
> converted to the Plucker pdb format with plucker-build.  This 
> process takes about 30-40 minutes on my machine (AMD XP 2500, 512MB 
> ram, Mandrake Linux 10.0).

        How many thousands of html files are you converting? 10,000? 
50,000? More? plucker-build (or cplucker, the C++ parser) should never 
take that long, unless you have an _enormous_ number of files, or your 
Python vm is running out of memory for some reason.

        The Python distiller has a few deficiencies in this area, 
specifically with regard to large numbers of files being parsed. I'd 
look into using the C++ parser from cvs if you need a bit more speed.

        I just checked some of my largest fetches this week, and a 
single website with 233 external links took 2 minutes 24 seconds to 
fetch and convert completely. The machine that .pdb was generated on 
was a paltry 2.1Ghz machine with 1gb of RAM in it (its a test box). 
Based on your results, if my machine was parsing content for 30-40 
minutes, that would be well over 6,000 separate links in total. 

        That's a LOT of links for a Plucker .pdb. 

        I should also note that I parse and convert Wikipedia and some 
other VERY large texts on a regular basis for testing/benchmarking, 
but those processes take several days at a time and probably are not 
good examples to compare to your source material. Most of the medium 
to large documents I create take under 15 minutes at the most, when 
using the Python distiller.

> Do you think that it might be a good idea to generate the pdb's 
> directly ( instead of generating html files that have to be 
> processed by plucker-build )  ?

        If you can deal with the HTML'ized objects in memory as a 
stream, you can certainly do that. The Plucker document format is very 
well documented: 

        http://cvs.plkr.org/index.cgi/*checkout*/docs/DBFormat.html?rev=HEAD

> I am asking this question because i am not familiar with the plucker 
> pdb format , and i'm wondering whether the effort involved in 
> teaching my programs to generate pdb's is worth it.

        jSyncManager reportedly has some Java classes that can write 
Plucker documents (or at least the author _planned_ to write some, I'm 
not sure how far he got with that, whether they're complete or if they 
even work). You can see the classes here: 

        http://www.jsyncmanager.org/javadoc/v32/index.html

> Is it possible to avoid the parsing of the html files? Maybe it's 
> possible to feed the contents directly to plucker-build ...

        In its current implementation, plucker-build (which is just a 
symlink to Spider.py), cannot be fed a stream of data directly. You 
might try implementing psyco to gain some speed in the distiller if 
your machine truly can't handle parsing your source files.

        http://psyco.sourceforge.net/


David A. Desrosiers
[EMAIL PROTECTED]
http://gnu-designs.com
_______________________________________________
plucker-list mailing list
plucker-list@rubberchicken.org
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to