Hi Lewis; 'm looking at creating Nutch plugin to determine if a document is an article on religion, and what religion its primarily talking about. Then, adding an annotation called 'religion' to the document on what the primary category of the religion is. Examples: Atheism, Buddhism , Christian, Hindu, Jewish, Muslim, or Unknown (if it can't be determined). No annotation will be added if its not an article on religion. Next, another annotation on what sub-category the religion is. For example, under Christian would be Catholic or Protestant. Then possibly a third annotation for the denomination. Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist Episcopal Church' ( have a list of 147 denominations). I'm not familiar with religious breakdowns so I don't know if this it the appropriate way to categorize them.
****** Design: I created a java class on religion that extends IndexingFilter class. I next determine if its an article on religion. I do so by counting the number of occurrences of certain key words in the document. Example, if 'God' appears more then 10 times, its an article on religion. If it mentions 'Christian' more than a certain number of times and more often than other religions, the sub-category would be 'Christian'. The first match on denomination search would be assumed to be the denomination. I'm also using a language-detection plugin (http://developer.cybozu.co.jp/oss/2010/10/language-detect.html) to determine the language of the document so I can search for words in the document's native language. I don't know if this is the best approach to solving this issue. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992130.html Sent from the Nutch - Dev mailing list archive at Nabble.com.

