Hi Lewis;

'm looking at creating Nutch plugin to determine if a document is an article
on religion, and what religion its primarily talking about. Then, adding an
annotation called 'religion' to the document on what the primary category of
the religion is. Examples: Atheism, Buddhism , Christian,  Hindu, Jewish,
Muslim, or Unknown (if it can't be determined). No annotation will be added
if its not an article on religion. Next, another annotation on what
sub-category the religion is. For example, under Christian would be Catholic
or Protestant. Then possibly a third annotation for  the denomination.
Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist
Episcopal Church' ( have a list of 147 denominations). I'm not familiar with
religious breakdowns so I don't know if this it the appropriate way to
categorize them.

****** 
Design:

I created a java class on religion that extends IndexingFilter class. I next
determine if its an article on religion. I do so by counting the number of
occurrences of certain key words in the document. Example, if 'God' appears
more then 10 times, its an article on religion. If it mentions 'Christian'
more than a certain number of times and more often than other religions, the
sub-category would be 'Christian'. The first match on denomination search
would be assumed to be the  denomination. I'm also using a
language-detection plugin
(http://developer.cybozu.co.jp/oss/2010/10/language-detect.html) to
determine the language of the document so I can search for words in the
document's native language. I don't know if this is the best approach to
solving this issue.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992130.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to