Re: Wikipedia and Plucker?

David A. Desrosiers Wed, 21 Apr 2004 11:05:03 -0700

> Does anyone know how Plucker would work for something like this?  Has
> anyone tried it?


        After converting the whole shebang to HTML, and trying the
existing Python and JPluck parsers against it, they both eventually died
and ran out of memory. The python one was significantly worse in parsing
the HTML. It seemed to exponentially require more and more memory for each
page after the first 80,000 it reached, and eventually brought the box to
an astronomical load.

        With JPluck, I allocated 2gb of RAM to the jvm, and it still ran
out of memory, and eventually died with random out of memory messages and
messages that a page size of 29000 was too large to parse.

        I'll see if I can make some modifications to these and try again
with the Python and Java distillers. I may give this a pass through my
perl distiller as a stress test. Once converted to HTML with some very
basic scripting, it ends up being about 1.8gb of HTML. I could probably
trim a good 40% of that out, by removing the non-visible elements that
Plucker would ignore anyway (<font face>, etc.).

        I'll get back to this after I return from Jamaica next week.

d.

_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: Wikipedia and Plucker?

Reply via email to