I think it might also be true that the featuregenerator interface in doccat is different than the others, also I don't think the tokennamefinder interface has a probs() method, which has always made me use the ME impl direct.
Sent from my iPhone > On Apr 24, 2014, at 6:54 PM, William Colen <william.co...@gmail.com> wrote: > > Yes, it looks nice. Maybe we should redo all the DocumentCategorizer > interface. It is different from other tools, for example, we can't get the > best category of one document with only one call, we need to use two > methods. > > > > 2014-04-24 18:43 GMT-03:00 Mark G <ma...@apache.org>: > >> William, that map looks good to me. >> In my current project I find this method convenient for getting back the >> probs over the categories in the model as a Map....let me know if there's >> anything wrong with it :) >> >> public Map<String, Double> categoriesAsMap(String text) { >> Map<String, Double> probDist = new HashMap<String, Double>(); >> >> double[] categorize = categorize(text); >> int catSize = getNumberOfCategories(); >> for (int i = 0; i < catSize; i++) { >> String category = getCategory(i); >> probDist.put(category, categorize[getIndex(category)]); >> } >> return probDist; >> >> } >> >> perhaps we should consider adding this method to abstract some >> details....just a thought >> >> >> >> >> >> On Thu, Apr 24, 2014 at 3:56 PM, William Colen <william.co...@gmail.com >>> wrote: >> >>> What do you think of adding the following field to the DocumentSample? >>> >>> Map<String, Object> extraInformation >>> >>> >>> Also, we could add the following methods to the DocumentCategorizer >>> interface: >>> >>> public double[] categorize(String text[], Map<String, Object> >>> extraInformation); >>> public double[] categorize(String documentText, Map<String, Object> >>> extraInformation); >>> >>> Any opinion? >>> >>> Thank you, >>> William >>> >>> >>> 2014-04-17 10:39 GMT-03:00 Mark G <giaconiam...@gmail.com>: >>> >>>> Another general doccat thought I had is this. in my projects that use >>>> Doccat, I created a class called a samplecollection, which simply >>> wrapped a >>>> list<documentsample> but then provided a method that returned the >>> samples >>>> as a DoccatModel (using a properly formatted ByteArrayInputStream of >> the >>>> doccat training format of all the samples). This worked out well >> because >>> I >>>> stored all the samples in a database, and users could CRUD samples for >>>> different categories. There was a map reduce job that at job startup >> read >>>> in the samples from the database into the samplecollection, dynamically >>>> generated the model, and then used the model to classify all the texts >>>> across the cluster; so every MR job ran the latest and greatest model >>> based >>>> on current samples. Not sure if we're interested in something like >> that, >>>> but I see several questions on stack overflow asking about iterative >>> model >>>> building, and a SampleCollection that returns a Model has worked for >> me. >>> I >>>> also created a SampleCRUD interface that abstracts storage and >> retrieval >>> of >>>> the samples.... I had a Postgres and Accumulo impl for sample storage. >>>> just a thought, I know this can get very specific and complicated, >>> thought >>>> we may be able to find a middle ground by providing a framework and >> some >>>> generic impls. >>>> MG >>>> >>>> >>>> On Thu, Apr 17, 2014 at 8:28 AM, William Colen < >> william.co...@gmail.com >>>>> wrote: >>>> >>>>> Yes, I don't see how to represent the sentences and paragraphs. >>>>> >>>>> +1 for the generic Map as suggested by Mark. We already have such >>> things >>>> in >>>>> other sample classes, like NameSample and the POSSample. >>>>> >>>>> A use case: the 20news corpus is a collection of articles, and each >>>> article >>>>> contains fields like "From", "Subject", "Organization". Mahout, which >>>>> includes a formatter for this corpus, concatenate it all to the text >>>> field, >>>>> but I think we could improve accuracy by handling this metadata in a >>>>> separated feature generator. >>>>> >>>>> >>>>> 2014-04-17 8:37 GMT-03:00 Tech mail <giaconiam...@gmail.com>: >>>>> >>>>>> I agree, this goes back to the concept of having a "document" >>> model... >>>>>> I know in the prod systems I've used doccat, storing sentences and >>>>>> paragraphs wouldn't make sense, people usually have their own >> domain >>>>> model >>>>>> for that. I still feel like if we augment the documentsample object >>>> with >>>>> a >>>>>> generic Map it would be helpful in some cases and not constraining >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>>> On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <kottm...@gmail.com> >>>> wrote: >>>>>>> >>>>>>>> On 04/15/2014 07:45 PM, William Colen wrote: >>>>>>>> Hello, >>>>>>>> >>>>>>>> I've been working with the Doccat module and I am wondering if >> we >>>>> could >>>>>>>> improve its data structure for the 1.6.0 release. >>>>>>>> >>>>>>>> Today the DocumentSample has the following attributes: >>>>>>>> >>>>>>>> - String category >>>>>>>> - List<String> text >>>>>>>> >>>>>>>> I would suggest adding an attribute to hold metadata, or >>> additional >>>>>>>> contexts information. What do you think? >>>>>>> >>>>>>> Right now the training format contains these two fields per line. >>>>>>> Do you want to change the format as well? >>>>>>> >>>>>>>> Also, what do you think of including sentences and paragraph >>>>>> information? I >>>>>>>> don't know if there is anything a feature generator can extract >>> from >>>>> it >>>>>> to >>>>>>>> improve the classification. >>>>>>> >>>>>>> I guess we only want to do that if there is a use case for it. It >>>> will >>>>>> make the processing for the clients >>>>>>> more complex, since they then would have to provide sentences and >>>>>> paragraphs compared to just >>>>>>> a piece of text. >>>>>>> >>>>>>> Jörn >>