Fantastic! Thanks for the info! Chris
On 26/09/11 14:48, Jason Baldridge wrote: > Before you do POS tagging and such, you should probably get set up with > word-based indicators of authorship, like type-token ratios, average word > length, frequent unigrams and bigrams and so on. Then you just need the > text, so no annotation or model training is necessary. Usually > dimensionality reduction techniques like PCA are good in this context too. > > If you haven't already, you should check out Patrick Juola's page: > > http://www.mathcs.duq.edu/~juola/ > > And especially his book on authorship attribution. > > If you do want to build POS taggers, there are some useful instructions > here: > > http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page > > Jason > > > On Mon, Sep 26, 2011 at 7:37 AM, Chris Yocum <cyo...@gmail.com> wrote: > >> Hello Everyone, >> >> I am working with a student at my university on using NLP techniques in >> document categorisation in late Middle Irish. I am a coder and I know >> Java so that won't be a problem. We are building a corpus at the moment. >> >> We are working on a specific author and what we would like to do is see >> if a particular poem/text is his or not based on NLP. What I was >> thinking is we would need a few things: >> >> 1) a corpus of Middle Irish texts of the same general linguistic range >> (we are working on that at the moment). Is there any >> documentation/knowledge on how to create this (or is this just training >> the POS tagger)? >> >> 2) Train a model >> >> 3) pass that model to the document categoriser with the relevant model >> and what kinds of categories there are (his, not his, and unsure). >> >> A few other miscellaneous questions: will we need to put part of speech >> tags in the corpus to create the model? >> >> Thanks in advance!, >> Chris Yocum >> >> >
signature.asc
Description: OpenPGP digital signature