> About three weeks ago I experimented with muffin to filter out javascript
> from the search indexes (it did not work that well).  Here is my experience:
> 
> It is known that htdig does not parse javascript properly.  The search
> summaries disply the ugly javascript code instead of the documents true
> summary.  It has been suggested that muffin be used to filter out the
> javascript junk. 
> 
> I have installed muffin (http://muffin.doit.org) and it does filter out the
> javascript, but from what I have found, muffin is more of a personl proxy
> and does not work under a high load.  Muffin tends to return incorrect info
> such as the wrong URL, or the wrong page data when many requests are made
> to it. 
> 
> To get around this, I added a sleep statement in the document retreiver
> loop (Retreiver.cc) so it would be forced to wait 1 second between
> requests.  Although very slow, the muffin htdig combo worked until I
> indexed a large site.
> 
> After indexing 40,000+ documents, muffin gave up.  Muffin tends to eat
> memory as it goes along and then just stops responding.
> 
> Basically, the muffin/htdig combo does not really work that well.  I was
> wondering if anybody knows of a better way to filter out javascript.  If
> so, could this be incorporated into htdig.


Andrew forwarded me this message a while back so I guess I should
respond.  With regards to Muffin eating up a lot of memory, did you
try running Muffin without the GUI?  You can do this at startup with
the -nw option.

I not sure about why Muffin would return incorrect results.  Is there
any way you can reproduce this?

A new version of Muffin will be released next week sometime.  This
version does have at least one HTML parsing fix.
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.

Reply via email to