hello,
I am studying at queens university ,Canada , currently working on grid computing .My work involves making a email Spam filter which could use the resources of the grid .
I am interested in knowing more about how ranking of results and clustering of results is done in NUTCH and if any Solution currently used and deployed by you for the same . I have a functional grid based LEXICAL analyzer performing Bayesian, Chi-Square , and Latent Semantic Analysis. During email analysis , one thing that I am now looking into is that if the incoming email is checked against the database of the user emails , then can it be inferenced that the incoming email belongs to that category or not ? So while looking into the code , I was wondering that if I could feed User emails as text files ,could it be possible to profile the user & based on this can it be used to check for spam or not ?
Can you please provide me with details of Nutch system architecture ?(i would really appreciate , after 30 cups of tea and still not in a mood to give up I seek your help )
thanks ,


I found this (\nutch-0.4\src\java\net\nutch\quality), but there's no documentation i could find , that could help me visualize it all..
/*********************************************
* This finds a ranking of all known pages that
* minimizes the Kendall Tau distance between the
* full-ranking and each component ranking.
*
* @author Mike Cafarella
*********************************************/
public class MarkovRankSolver { <<<<--------- how is this class functioning ?





__________________________________________________

Filter flow pipeline of the email filtering process that i have in mind .
-----------------------------------------------------------------
This is computationally intensive , but the use of a grid makes it a viable option .
its suppose to work at the enterprise gateway(MTA) level , using a mix of all the approaches to filter the email .The basic SMTP server is going to be JAMES (java apache mail enterprise server ).The grid toolkit is Globus v3 or Jgrid. Imap server >Cyprus , and web mail client > JWMA


email > whitle/blacklist of sender
> Md5 hash of message field checked against a list of values of Md5 hash list of Spam
> preprocessing of email to remove html tags and also using Porter's stemming algo .
> Bayesian filter
> Chi-square filter
> Latent semantic Analysis
also in the end I want to add some sort of content semantic analysis for content
management of email and automatic sorting of the emails into appropriate folders , for the users ,not just by headers but using some sort of AI . ( the reason i am mailing you ) .
________________________________________________________________________
________________________________________________________________________
I look forward to your suggestions and help.


thanks

Satmeet Singh soin
MCSE NT4.0 & Win2k,CCNA 2.0,OCP 9iDBA
School of Computing
Queen's University . goodwin hall
Kingston, Ontario K7L 3N6,Canada.
mobile :613-583-7646


-- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/


------------------------------------------------------- This SF.Net email is sponsored by OSTG. Have you noticed the changes on Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, one more big change to announce. We are now OSTG- Open Source Technology Group. Come see the changes on the new OSTG site. www.ostg.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to