Gilles Detillieux <[EMAIL PROTECTED]> writes:

> What sort of modifications did you need?  htdig 3.1.5 already does handle
> text/plain files, using the htdig/Plaintext.cc parser.

Sorry, I didn't explain that well.  I had to handle local files with no
extension. http://www.htdig.org/mail/2000/10/0169.html is the message
that made me realize I needed to get files recognized.  Drove me batty
for a while last night trying to figure that out :)

Basically I made Document.cc assume any file named a digit ( as is the
case with my mail spool for each file ), is a text/plain file.

http://helium.tucc.uab.edu/~sprout/htdig-nnml.patch is my small ugly
patch.  

> htdig can handle large numbers of files (36000 isn't too many), but
> it does seem to run into problems with memory usage when they're all
> specified all at once in the start_url.  You might want to try putting
> the URLs for these pages as hrefs in an HTML file, and give this HTML
> file as a start_url.  If it still has problems with this, try breaking
> it up into several smaller files (e.g. 60 files of 600 URLs).  It should
> be easy enough to write a script to automate this process.

Ok I will try this route.  

> Yes, attachments could very well pose problems.  If these were in HTML
> files, you could use noindex_start and noindex_end to remove some sections,
> but with text/plain you may be out of luck unless you can patch the plain
> text parser to somehow exclude these.

My idea, if it is sane, is to write a parser script to handle
text/plain in this special case ( with a dedicated htdig.cfg as I have
now )
-- 
Chris Green <[EMAIL PROTECTED]>
Laugh and the world laughs with you, snore and you sleep alone.

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to