Gilles Detillieux <[EMAIL PROTECTED]> writes:
> What sort of modifications did you need? htdig 3.1.5 already does handle
> text/plain files, using the htdig/Plaintext.cc parser.
Sorry, I didn't explain that well. I had to handle local files with no
extension. http://www.htdig.org/mail/2000/10/0169.html is the message
that made me realize I needed to get files recognized. Drove me batty
for a while last night trying to figure that out :)
Basically I made Document.cc assume any file named a digit ( as is the
case with my mail spool for each file ), is a text/plain file.
http://helium.tucc.uab.edu/~sprout/htdig-nnml.patch is my small ugly
patch.
> htdig can handle large numbers of files (36000 isn't too many), but
> it does seem to run into problems with memory usage when they're all
> specified all at once in the start_url. You might want to try putting
> the URLs for these pages as hrefs in an HTML file, and give this HTML
> file as a start_url. If it still has problems with this, try breaking
> it up into several smaller files (e.g. 60 files of 600 URLs). It should
> be easy enough to write a script to automate this process.
Ok I will try this route.
> Yes, attachments could very well pose problems. If these were in HTML
> files, you could use noindex_start and noindex_end to remove some sections,
> but with text/plain you may be out of luck unless you can patch the plain
> text parser to somehow exclude these.
My idea, if it is sane, is to write a parser script to handle
text/plain in this special case ( with a dedicated htdig.cfg as I have
now )
--
Chris Green <[EMAIL PROTECTED]>
Laugh and the world laughs with you, snore and you sleep alone.
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html