> I am looking for a way to speed up the generation of the pdb's . > Right now i am generating a bunch of html files , which are then > converted to the Plucker pdb format with plucker-build. This > process takes about 30-40 minutes on my machine (AMD XP 2500, 512MB > ram, Mandrake Linux 10.0).
How many thousands of html files are you converting? 10,000? 50,000? More? plucker-build (or cplucker, the C++ parser) should never take that long, unless you have an _enormous_ number of files, or your Python vm is running out of memory for some reason. The Python distiller has a few deficiencies in this area, specifically with regard to large numbers of files being parsed. I'd look into using the C++ parser from cvs if you need a bit more speed. I just checked some of my largest fetches this week, and a single website with 233 external links took 2 minutes 24 seconds to fetch and convert completely. The machine that .pdb was generated on was a paltry 2.1Ghz machine with 1gb of RAM in it (its a test box). Based on your results, if my machine was parsing content for 30-40 minutes, that would be well over 6,000 separate links in total. That's a LOT of links for a Plucker .pdb. I should also note that I parse and convert Wikipedia and some other VERY large texts on a regular basis for testing/benchmarking, but those processes take several days at a time and probably are not good examples to compare to your source material. Most of the medium to large documents I create take under 15 minutes at the most, when using the Python distiller. > Do you think that it might be a good idea to generate the pdb's > directly ( instead of generating html files that have to be > processed by plucker-build ) ? If you can deal with the HTML'ized objects in memory as a stream, you can certainly do that. The Plucker document format is very well documented: http://cvs.plkr.org/index.cgi/*checkout*/docs/DBFormat.html?rev=HEAD > I am asking this question because i am not familiar with the plucker > pdb format , and i'm wondering whether the effort involved in > teaching my programs to generate pdb's is worth it. jSyncManager reportedly has some Java classes that can write Plucker documents (or at least the author _planned_ to write some, I'm not sure how far he got with that, whether they're complete or if they even work). You can see the classes here: http://www.jsyncmanager.org/javadoc/v32/index.html > Is it possible to avoid the parsing of the html files? Maybe it's > possible to feed the contents directly to plucker-build ... In its current implementation, plucker-build (which is just a symlink to Spider.py), cannot be fed a stream of data directly. You might try implementing psyco to gain some speed in the distiller if your machine truly can't handle parsing your source files. http://psyco.sourceforge.net/ David A. Desrosiers [EMAIL PROTECTED] http://gnu-designs.com _______________________________________________ plucker-list mailing list plucker-list@rubberchicken.org http://lists.rubberchicken.org/mailman/listinfo/plucker-list