The second field of Newsgroup should be called bodyText of course.
On Mon, Feb 3, 2014 at 10:52 PM, Frank Scholten <[email protected]>wrote: > Hi all, > > I put together a utility which vectorizes plain old Java objects annotated > with @Feature and @Target via Mahout's vector encoders. > > See my Github branch: > https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer > > and the unit test: > https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java > > Use it like this: > > class NewsgroupPost { > > @Target > private String newsgroup; > > @Feature(encoder = TextValueEncoder.class) > private String newsgroup; > > // Getters & setters > > } > > AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new > AnnotationBasedVectorizer<NewsgroupPost>(new > TypeReference<NewsgroupPost>(){}); > > Here the vectorizer scans the NewsgroupPost's annotations. Then you can do > this: > > NewsgroupPost post = ... > > Vector vector = vectorizer.vectorize(post); > int target = vectorizer.getTarget(post); > int numFeatures = vectorizer.getNumberOfFeatures(); > > Note that vectorize() and getTarget() methods are genericly typed and due > to the type token passed in the constructor we can enforce that only > NewsgroupPosts are accepted. > > The vectorizer uses a Dictionary for encoding the target. > > Thoughts? > > Cheers, > > Frank >
