OK so please let us know how you get on. Although you seem to have a clear idea about how you're going to progress with the issue, I would seriously consider taking on board Julien's comments and grabbing the code that he's made available for similar tasks.
All the best Lewis On Fri, Jun 29, 2012 at 7:19 PM, JAB <[email protected]> wrote: > Hi Lewis; > > 'm looking at creating Nutch plugin to determine if a document is an article > on religion, and what religion its primarily talking about. Then, adding an > annotation called 'religion' to the document on what the primary category of > the religion is. Examples: Atheism, Buddhism , Christian, Hindu, Jewish, > Muslim, or Unknown (if it can't be determined). No annotation will be added > if its not an article on religion. Next, another annotation on what > sub-category the religion is. For example, under Christian would be Catholic > or Protestant. Then possibly a third annotation for the denomination. > Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist > Episcopal Church' ( have a list of 147 denominations). I'm not familiar with > religious breakdowns so I don't know if this it the appropriate way to > categorize them. > > ****** > Design: > > I created a java class on religion that extends IndexingFilter class. I next > determine if its an article on religion. I do so by counting the number of > occurrences of certain key words in the document. Example, if 'God' appears > more then 10 times, its an article on religion. If it mentions 'Christian' > more than a certain number of times and more often than other religions, the > sub-category would be 'Christian'. The first match on denomination search > would be assumed to be the denomination. I'm also using a > language-detection plugin > (http://developer.cybozu.co.jp/oss/2010/10/language-detect.html) to > determine the language of the document so I can search for words in the > document's native language. I don't know if this is the best approach to > solving this issue. > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992130.html > Sent from the Nutch - Dev mailing list archive at Nabble.com. -- Lewis

