> c.) Parallelizing the spider
I think this is an excellent idea. I've worked out a state diagram
for a better retriever, and it looks fairly easy to implement.
Here's my list of parser TO-DOs (in priority order, but not
necessarily implementation order :-):
1) Stylesheet (CSS) support in HTML pages.
2) XHTML/OEBPS support -- basically, XML support.
3) Pure Java version of the parser -- all that's needed is some code
for JIU to generate Palm image format files, which I don't feel
like writing, but which I'd be happy to describe to any interested volunteer.
4) An improved text format record type, that will
a) support seamless merging of text records into large pages
b) support searching without previous decompression of the text record
5) An improved image format that will support arbitrarily large
images, captions, etc.
6) Better retriever code. There's an issue here about (a) moving to
Python 2.*, which already contains better retriever code, which we
could just use, and (b) the better retriever code should really be
donated to the Python project as part of the Python standard library,
instead of being released under GPL with Plucker.
7) Support for the OBJECT tag in HTML/XHTML -- requires extensive
restructuring of the parser control flow to allow recursion.
Bill