Re: DocumentSample in Doccat

Mark G Thu, 24 Apr 2014 14:45:12 -0700

William, that map looks good to me.
In my current project I find this method convenient for getting back the
probs over the categories in the model as a Map....let me know if there's
anything wrong with it :)


public Map<String, Double> categoriesAsMap(String text) {
    Map<String, Double> probDist = new HashMap<String, Double>();

    double[] categorize = categorize(text);
    int catSize = getNumberOfCategories();
    for (int i = 0; i < catSize; i++) {
      String category = getCategory(i);
      probDist.put(category, categorize[getIndex(category)]);
    }
    return probDist;

  }

perhaps we should consider adding this method to abstract some
details....just a thought





On Thu, Apr 24, 2014 at 3:56 PM, William Colen <william.co...@gmail.com>wrote:

> What do you think of adding the following field to the DocumentSample?
>
> Map<String, Object> extraInformation
>
>
> Also, we could add the following methods to the DocumentCategorizer
> interface:
>
> public double[] categorize(String text[], Map<String, Object>
> extraInformation);
> public double[] categorize(String documentText, Map<String, Object>
> extraInformation);
>
> Any opinion?
>
> Thank you,
> William
>
>
> 2014-04-17 10:39 GMT-03:00 Mark G <giaconiam...@gmail.com>:
>
> > Another general doccat thought I had is this. in my projects that use
> > Doccat, I created a class called a samplecollection, which simply
> wrapped a
> > list<documentsample> but then provided  a method that returned the
> samples
> > as a DoccatModel (using a properly formatted ByteArrayInputStream of the
> > doccat training format of all the samples). This worked out well because
> I
> > stored all the samples in a database, and users could CRUD samples for
> > different categories. There was a map reduce job that at job startup read
> > in the samples from the database into the samplecollection, dynamically
> > generated the model, and then used the model to classify all the texts
> > across the cluster; so every MR job ran the latest and greatest model
> based
> > on current samples. Not sure if we're interested in something like that,
> > but I see several questions on stack overflow asking about iterative
> model
> > building, and a SampleCollection that returns a Model has worked for me.
>  I
> > also created a SampleCRUD interface that abstracts storage and retrieval
> of
> > the samples.... I had a Postgres and Accumulo impl for sample storage.
> > just a thought, I know this can get very specific and complicated,
> thought
> > we may be able to find a middle ground by providing a framework and some
> > generic impls.
> > MG
> >
> >
> > On Thu, Apr 17, 2014 at 8:28 AM, William Colen <william.co...@gmail.com
> > >wrote:
> >
> > > Yes, I don't see how to represent the sentences and paragraphs.
> > >
> > > +1 for the generic Map as suggested by Mark. We already have such
> things
> > in
> > > other sample classes, like NameSample and the POSSample.
> > >
> > > A use case: the 20news corpus is a collection of articles, and each
> > article
> > > contains fields like "From", "Subject", "Organization". Mahout, which
> > > includes a formatter for this corpus, concatenate it all to the text
> > field,
> > > but I think we could improve accuracy by handling this metadata in a
> > > separated feature generator.
> > >
> > >
> > > 2014-04-17 8:37 GMT-03:00 Tech mail <giaconiam...@gmail.com>:
> > >
> > > > I agree, this goes back to the concept of having a "document"
> model...
> > > > I know in the prod systems I've used doccat, storing sentences and
> > > > paragraphs wouldn't make sense, people usually have their own domain
> > > model
> > > > for that. I still feel like if we augment the documentsample object
> > with
> > > a
> > > > generic Map it would be helpful in some cases and not constraining
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <kottm...@gmail.com>
> > wrote:
> > > > >
> > > > >> On 04/15/2014 07:45 PM, William Colen wrote:
> > > > >> Hello,
> > > > >>
> > > > >> I've been working with the Doccat module and I am wondering if we
> > > could
> > > > >> improve its data structure for the 1.6.0 release.
> > > > >>
> > > > >> Today the DocumentSample has the following attributes:
> > > > >>
> > > > >> - String category
> > > > >> - List<String> text
> > > > >>
> > > > >> I would suggest adding an attribute to hold metadata, or
> additional
> > > > >> contexts information. What do you think?
> > > > >
> > > > > Right now the training format contains these two fields per line.
> > > > > Do you want to change the format as well?
> > > > >
> > > > >> Also, what do you think of including sentences and paragraph
> > > > information? I
> > > > >> don't know if there is anything a feature generator can extract
> from
> > > it
> > > > to
> > > > >> improve the classification.
> > > > >
> > > > > I guess we only want to do that if there is a use case for it. It
> > will
> > > > make the processing for the clients
> > > > > more complex, since they then would have to provide sentences and
> > > > paragraphs compared to just
> > > > > a piece of text.
> > > > >
> > > > > Jörn
> > > >
> > >
> >
>

Re: DocumentSample in Doccat

Reply via email to