Fantastic! Thanks for the info!

Chris

On 26/09/11 14:48, Jason Baldridge wrote:
> Before you do POS tagging and such, you should probably get set up with
> word-based indicators of authorship, like type-token ratios, average word
> length, frequent unigrams and bigrams and so on. Then you just need the
> text, so no annotation or model training is necessary. Usually
> dimensionality reduction techniques like PCA are good in this context too.
> 
> If you haven't already, you should check out Patrick Juola's page:
> 
> http://www.mathcs.duq.edu/~juola/
> 
> And especially his book on authorship attribution.
> 
> If you do want to build POS taggers, there are some useful instructions
> here:
> 
> http://sourceforge.net/apps/mediawiki/opennlp/index.php?title=Main_Page
> 
> Jason
> 
> 
> On Mon, Sep 26, 2011 at 7:37 AM, Chris Yocum <cyo...@gmail.com> wrote:
> 
>> Hello Everyone,
>>
>> I am working with a student at my university on using NLP techniques in
>> document categorisation in late Middle Irish.  I am a coder and I know
>> Java so that won't be a problem.  We are building a corpus at the moment.
>>
>> We are working on a specific author and what we would like to do is see
>> if a particular poem/text is his or not based on NLP.  What I was
>> thinking is we would need a few things:
>>
>> 1) a corpus of Middle Irish texts of the same general linguistic range
>> (we are working on that at the moment).  Is there any
>> documentation/knowledge on how to create this (or is this just training
>> the POS tagger)?
>>
>> 2) Train a model
>>
>> 3) pass that model to the document categoriser with the relevant model
>> and what kinds of categories there are (his, not his, and unsure).
>>
>> A few other miscellaneous questions: will we need to put part of speech
>> tags in the corpus to create the model?
>>
>> Thanks in advance!,
>> Chris Yocum
>>
>>
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to