William, that map looks good to me. In my current project I find this method convenient for getting back the probs over the categories in the model as a Map....let me know if there's anything wrong with it :)
public Map<String, Double> categoriesAsMap(String text) { Map<String, Double> probDist = new HashMap<String, Double>(); double[] categorize = categorize(text); int catSize = getNumberOfCategories(); for (int i = 0; i < catSize; i++) { String category = getCategory(i); probDist.put(category, categorize[getIndex(category)]); } return probDist; } perhaps we should consider adding this method to abstract some details....just a thought On Thu, Apr 24, 2014 at 3:56 PM, William Colen <william.co...@gmail.com>wrote: > What do you think of adding the following field to the DocumentSample? > > Map<String, Object> extraInformation > > > Also, we could add the following methods to the DocumentCategorizer > interface: > > public double[] categorize(String text[], Map<String, Object> > extraInformation); > public double[] categorize(String documentText, Map<String, Object> > extraInformation); > > Any opinion? > > Thank you, > William > > > 2014-04-17 10:39 GMT-03:00 Mark G <giaconiam...@gmail.com>: > > > Another general doccat thought I had is this. in my projects that use > > Doccat, I created a class called a samplecollection, which simply > wrapped a > > list<documentsample> but then provided a method that returned the > samples > > as a DoccatModel (using a properly formatted ByteArrayInputStream of the > > doccat training format of all the samples). This worked out well because > I > > stored all the samples in a database, and users could CRUD samples for > > different categories. There was a map reduce job that at job startup read > > in the samples from the database into the samplecollection, dynamically > > generated the model, and then used the model to classify all the texts > > across the cluster; so every MR job ran the latest and greatest model > based > > on current samples. Not sure if we're interested in something like that, > > but I see several questions on stack overflow asking about iterative > model > > building, and a SampleCollection that returns a Model has worked for me. > I > > also created a SampleCRUD interface that abstracts storage and retrieval > of > > the samples.... I had a Postgres and Accumulo impl for sample storage. > > just a thought, I know this can get very specific and complicated, > thought > > we may be able to find a middle ground by providing a framework and some > > generic impls. > > MG > > > > > > On Thu, Apr 17, 2014 at 8:28 AM, William Colen <william.co...@gmail.com > > >wrote: > > > > > Yes, I don't see how to represent the sentences and paragraphs. > > > > > > +1 for the generic Map as suggested by Mark. We already have such > things > > in > > > other sample classes, like NameSample and the POSSample. > > > > > > A use case: the 20news corpus is a collection of articles, and each > > article > > > contains fields like "From", "Subject", "Organization". Mahout, which > > > includes a formatter for this corpus, concatenate it all to the text > > field, > > > but I think we could improve accuracy by handling this metadata in a > > > separated feature generator. > > > > > > > > > 2014-04-17 8:37 GMT-03:00 Tech mail <giaconiam...@gmail.com>: > > > > > > > I agree, this goes back to the concept of having a "document" > model... > > > > I know in the prod systems I've used doccat, storing sentences and > > > > paragraphs wouldn't make sense, people usually have their own domain > > > model > > > > for that. I still feel like if we augment the documentsample object > > with > > > a > > > > generic Map it would be helpful in some cases and not constraining > > > > > > > > Sent from my iPhone > > > > > > > > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <kottm...@gmail.com> > > wrote: > > > > > > > > > >> On 04/15/2014 07:45 PM, William Colen wrote: > > > > >> Hello, > > > > >> > > > > >> I've been working with the Doccat module and I am wondering if we > > > could > > > > >> improve its data structure for the 1.6.0 release. > > > > >> > > > > >> Today the DocumentSample has the following attributes: > > > > >> > > > > >> - String category > > > > >> - List<String> text > > > > >> > > > > >> I would suggest adding an attribute to hold metadata, or > additional > > > > >> contexts information. What do you think? > > > > > > > > > > Right now the training format contains these two fields per line. > > > > > Do you want to change the format as well? > > > > > > > > > >> Also, what do you think of including sentences and paragraph > > > > information? I > > > > >> don't know if there is anything a feature generator can extract > from > > > it > > > > to > > > > >> improve the classification. > > > > > > > > > > I guess we only want to do that if there is a use case for it. It > > will > > > > make the processing for the clients > > > > > more complex, since they then would have to provide sentences and > > > > paragraphs compared to just > > > > > a piece of text. > > > > > > > > > > Jörn > > > > > > > > > >