Before you do POS tagging and such, you should probably get set up with word-based indicators of authorship, like type-token ratios, average word length, frequent unigrams and bigrams and so on. Then you just need the text, so no annotation or model training is necessary. Usually dimensionality reduction techniques like PCA are good in this context too.
If you haven't already, you should check out Patrick Juola's page: http://www.mathcs.duq.edu/~juola/ And especially his book on authorship attribution. If you do want to build POS taggers, there are some useful instructions here: http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page Jason On Mon, Sep 26, 2011 at 7:37 AM, Chris Yocum <cyo...@gmail.com> wrote: > Hello Everyone, > > I am working with a student at my university on using NLP techniques in > document categorisation in late Middle Irish. I am a coder and I know > Java so that won't be a problem. We are building a corpus at the moment. > > We are working on a specific author and what we would like to do is see > if a particular poem/text is his or not based on NLP. What I was > thinking is we would need a few things: > > 1) a corpus of Middle Irish texts of the same general linguistic range > (we are working on that at the moment). Is there any > documentation/knowledge on how to create this (or is this just training > the POS tagger)? > > 2) Train a model > > 3) pass that model to the document categoriser with the relevant model > and what kinds of categories there are (his, not his, and unsure). > > A few other miscellaneous questions: will we need to put part of speech > tags in the corpus to create the model? > > Thanks in advance!, > Chris Yocum > >