Re: DocumentSample in Doccat

2014-04-24 Thread William Colen
What do you think of adding the following field to the DocumentSample?

MapString, Object extraInformation


Also, we could add the following methods to the DocumentCategorizer
interface:

public double[] categorize(String text[], MapString, Object
extraInformation);
public double[] categorize(String documentText, MapString, Object
extraInformation);

Any opinion?

Thank you,
William


2014-04-17 10:39 GMT-03:00 Mark G giaconiam...@gmail.com:

 Another general doccat thought I had is this. in my projects that use
 Doccat, I created a class called a samplecollection, which simply wrapped a
 listdocumentsample but then provided  a method that returned the samples
 as a DoccatModel (using a properly formatted ByteArrayInputStream of the
 doccat training format of all the samples). This worked out well because I
 stored all the samples in a database, and users could CRUD samples for
 different categories. There was a map reduce job that at job startup read
 in the samples from the database into the samplecollection, dynamically
 generated the model, and then used the model to classify all the texts
 across the cluster; so every MR job ran the latest and greatest model based
 on current samples. Not sure if we're interested in something like that,
 but I see several questions on stack overflow asking about iterative model
 building, and a SampleCollection that returns a Model has worked for me.  I
 also created a SampleCRUD interface that abstracts storage and retrieval of
 the samples I had a Postgres and Accumulo impl for sample storage.
 just a thought, I know this can get very specific and complicated, thought
 we may be able to find a middle ground by providing a framework and some
 generic impls.
 MG


 On Thu, Apr 17, 2014 at 8:28 AM, William Colen william.co...@gmail.com
 wrote:

  Yes, I don't see how to represent the sentences and paragraphs.
 
  +1 for the generic Map as suggested by Mark. We already have such things
 in
  other sample classes, like NameSample and the POSSample.
 
  A use case: the 20news corpus is a collection of articles, and each
 article
  contains fields like From, Subject, Organization. Mahout, which
  includes a formatter for this corpus, concatenate it all to the text
 field,
  but I think we could improve accuracy by handling this metadata in a
  separated feature generator.
 
 
  2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com:
 
   I agree, this goes back to the concept of having a document model...
   I know in the prod systems I've used doccat, storing sentences and
   paragraphs wouldn't make sense, people usually have their own domain
  model
   for that. I still feel like if we augment the documentsample object
 with
  a
   generic Map it would be helpful in some cases and not constraining
  
   Sent from my iPhone
  
On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com
 wrote:
   
On 04/15/2014 07:45 PM, William Colen wrote:
Hello,
   
I've been working with the Doccat module and I am wondering if we
  could
improve its data structure for the 1.6.0 release.
   
Today the DocumentSample has the following attributes:
   
- String category
- ListString text
   
I would suggest adding an attribute to hold metadata, or additional
contexts information. What do you think?
   
Right now the training format contains these two fields per line.
Do you want to change the format as well?
   
Also, what do you think of including sentences and paragraph
   information? I
don't know if there is anything a feature generator can extract from
  it
   to
improve the classification.
   
I guess we only want to do that if there is a use case for it. It
 will
   make the processing for the clients
more complex, since they then would have to provide sentences and
   paragraphs compared to just
a piece of text.
   
Jörn
  
 



Re: DocumentSample in Doccat

2014-04-24 Thread William Colen
Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
interface. It is different from other tools, for example, we can't get the
best category of one document with only one call, we need to use two
methods.



2014-04-24 18:43 GMT-03:00 Mark G ma...@apache.org:

 William, that map looks good to me.
 In my current project I find this method convenient for getting back the
 probs over the categories in the model as a Maplet me know if there's
 anything wrong with it :)

 public MapString, Double categoriesAsMap(String text) {
 MapString, Double probDist = new HashMapString, Double();

 double[] categorize = categorize(text);
 int catSize = getNumberOfCategories();
 for (int i = 0; i  catSize; i++) {
   String category = getCategory(i);
   probDist.put(category, categorize[getIndex(category)]);
 }
 return probDist;

   }

 perhaps we should consider adding this method to abstract some
 detailsjust a thought





 On Thu, Apr 24, 2014 at 3:56 PM, William Colen william.co...@gmail.com
 wrote:

  What do you think of adding the following field to the DocumentSample?
 
  MapString, Object extraInformation
 
 
  Also, we could add the following methods to the DocumentCategorizer
  interface:
 
  public double[] categorize(String text[], MapString, Object
  extraInformation);
  public double[] categorize(String documentText, MapString, Object
  extraInformation);
 
  Any opinion?
 
  Thank you,
  William
 
 
  2014-04-17 10:39 GMT-03:00 Mark G giaconiam...@gmail.com:
 
   Another general doccat thought I had is this. in my projects that use
   Doccat, I created a class called a samplecollection, which simply
  wrapped a
   listdocumentsample but then provided  a method that returned the
  samples
   as a DoccatModel (using a properly formatted ByteArrayInputStream of
 the
   doccat training format of all the samples). This worked out well
 because
  I
   stored all the samples in a database, and users could CRUD samples for
   different categories. There was a map reduce job that at job startup
 read
   in the samples from the database into the samplecollection, dynamically
   generated the model, and then used the model to classify all the texts
   across the cluster; so every MR job ran the latest and greatest model
  based
   on current samples. Not sure if we're interested in something like
 that,
   but I see several questions on stack overflow asking about iterative
  model
   building, and a SampleCollection that returns a Model has worked for
 me.
   I
   also created a SampleCRUD interface that abstracts storage and
 retrieval
  of
   the samples I had a Postgres and Accumulo impl for sample storage.
   just a thought, I know this can get very specific and complicated,
  thought
   we may be able to find a middle ground by providing a framework and
 some
   generic impls.
   MG
  
  
   On Thu, Apr 17, 2014 at 8:28 AM, William Colen 
 william.co...@gmail.com
   wrote:
  
Yes, I don't see how to represent the sentences and paragraphs.
   
+1 for the generic Map as suggested by Mark. We already have such
  things
   in
other sample classes, like NameSample and the POSSample.
   
A use case: the 20news corpus is a collection of articles, and each
   article
contains fields like From, Subject, Organization. Mahout, which
includes a formatter for this corpus, concatenate it all to the text
   field,
but I think we could improve accuracy by handling this metadata in a
separated feature generator.
   
   
2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com:
   
 I agree, this goes back to the concept of having a document
  model...
 I know in the prod systems I've used doccat, storing sentences and
 paragraphs wouldn't make sense, people usually have their own
 domain
model
 for that. I still feel like if we augment the documentsample object
   with
a
 generic Map it would be helpful in some cases and not constraining

 Sent from my iPhone

  On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com
   wrote:
 
  On 04/15/2014 07:45 PM, William Colen wrote:
  Hello,
 
  I've been working with the Doccat module and I am wondering if
 we
could
  improve its data structure for the 1.6.0 release.
 
  Today the DocumentSample has the following attributes:
 
  - String category
  - ListString text
 
  I would suggest adding an attribute to hold metadata, or
  additional
  contexts information. What do you think?
 
  Right now the training format contains these two fields per line.
  Do you want to change the format as well?
 
  Also, what do you think of including sentences and paragraph
 information? I
  don't know if there is anything a feature generator can extract
  from
it
 to
  improve the classification.
 
  I guess we only want to do that if there is a use 

Re: DocumentSample in Doccat

2014-04-24 Thread Mark G
William here is another thought, we could include something like this to
return a map sorted descending with the best score on top... so you can
call categoriesAsSortedMap().firstEntry() to get the best score (which
can be the same for more that one category hence the Set as value)

  public NavigableMapDouble, SetString categoriesAsSortedMap(String
text) {
NavigableMapDouble, SetString descendingMap = new TreeMapDouble,
SetString().descendingMap();
double[] categorize = categorize(text);
int catSize = getNumberOfCategories();
for (int i = 0; i  catSize; i++) {
  String category = getCategory(i);
  double score = categorize[getIndex(category)];
  if (descendingMap.containsKey(score)) {
descendingMap.get(score).add(category);
  } else {
SetString newset = new HashSet();
newset.add(category);
descendingMap.put(score, newset);
  }
}
return descendingMap;
  }


On Thu, Apr 24, 2014 at 7:04 PM, Tech mail giaconiam...@gmail.com wrote:

 I think it might also be true that the featuregenerator interface in
 doccat is different than the others, also I don't think the tokennamefinder
 interface has a probs() method, which has always made me use the ME impl
 direct.

 Sent from my iPhone

  On Apr 24, 2014, at 6:54 PM, William Colen william.co...@gmail.com
 wrote:
 
  Yes, it looks nice. Maybe we should redo all the DocumentCategorizer
  interface. It is different from other tools, for example, we can't get
 the
  best category of one document with only one call, we need to use two
  methods.
 
 
 
  2014-04-24 18:43 GMT-03:00 Mark G ma...@apache.org:
 
  William, that map looks good to me.
  In my current project I find this method convenient for getting back the
  probs over the categories in the model as a Maplet me know if
 there's
  anything wrong with it :)
 
  public MapString, Double categoriesAsMap(String text) {
 MapString, Double probDist = new HashMapString, Double();
 
 double[] categorize = categorize(text);
 int catSize = getNumberOfCategories();
 for (int i = 0; i  catSize; i++) {
   String category = getCategory(i);
   probDist.put(category, categorize[getIndex(category)]);
 }
 return probDist;
 
   }
 
  perhaps we should consider adding this method to abstract some
  detailsjust a thought
 
 
 
 
 
  On Thu, Apr 24, 2014 at 3:56 PM, William Colen william.co...@gmail.com
  wrote:
 
  What do you think of adding the following field to the DocumentSample?
 
  MapString, Object extraInformation
 
 
  Also, we could add the following methods to the DocumentCategorizer
  interface:
 
  public double[] categorize(String text[], MapString, Object
  extraInformation);
  public double[] categorize(String documentText, MapString, Object
  extraInformation);
 
  Any opinion?
 
  Thank you,
  William
 
 
  2014-04-17 10:39 GMT-03:00 Mark G giaconiam...@gmail.com:
 
  Another general doccat thought I had is this. in my projects that use
  Doccat, I created a class called a samplecollection, which simply
  wrapped a
  listdocumentsample but then provided  a method that returned the
  samples
  as a DoccatModel (using a properly formatted ByteArrayInputStream of
  the
  doccat training format of all the samples). This worked out well
  because
  I
  stored all the samples in a database, and users could CRUD samples for
  different categories. There was a map reduce job that at job startup
  read
  in the samples from the database into the samplecollection,
 dynamically
  generated the model, and then used the model to classify all the texts
  across the cluster; so every MR job ran the latest and greatest model
  based
  on current samples. Not sure if we're interested in something like
  that,
  but I see several questions on stack overflow asking about iterative
  model
  building, and a SampleCollection that returns a Model has worked for
  me.
  I
  also created a SampleCRUD interface that abstracts storage and
  retrieval
  of
  the samples I had a Postgres and Accumulo impl for sample storage.
  just a thought, I know this can get very specific and complicated,
  thought
  we may be able to find a middle ground by providing a framework and
  some
  generic impls.
  MG
 
 
  On Thu, Apr 17, 2014 at 8:28 AM, William Colen 
  william.co...@gmail.com
  wrote:
 
  Yes, I don't see how to represent the sentences and paragraphs.
 
  +1 for the generic Map as suggested by Mark. We already have such
  things
  in
  other sample classes, like NameSample and the POSSample.
 
  A use case: the 20news corpus is a collection of articles, and each
  article
  contains fields like From, Subject, Organization. Mahout, which
  includes a formatter for this corpus, concatenate it all to the text
  field,
  but I think we could improve accuracy by handling this metadata in a
  separated feature generator.
 
 
  2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com:
 
  I agree, this goes back to the 

Re: DocumentSample in Doccat

2014-04-17 Thread Jörn Kottmann

On 04/15/2014 07:45 PM, William Colen wrote:

Hello,

I've been working with the Doccat module and I am wondering if we could
improve its data structure for the 1.6.0 release.

Today the DocumentSample has the following attributes:

- String category
- ListString text

I would suggest adding an attribute to hold metadata, or additional
contexts information. What do you think?


Right now the training format contains these two fields per line.
Do you want to change the format as well?


Also, what do you think of including sentences and paragraph information? I
don't know if there is anything a feature generator can extract from it to
improve the classification.


I guess we only want to do that if there is a use case for it. It will 
make the processing for the clients
more complex, since they then would have to provide sentences and 
paragraphs compared to just

a piece of text.

Jörn


Re: DocumentSample in Doccat

2014-04-17 Thread Tech mail
I agree, this goes back to the concept of having a document model...
I know in the prod systems I've used doccat, storing sentences and paragraphs 
wouldn't make sense, people usually have their own domain model for that. I 
still feel like if we augment the documentsample object with a generic Map it 
would be helpful in some cases and not constraining

Sent from my iPhone

 On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com wrote:
 
 On 04/15/2014 07:45 PM, William Colen wrote:
 Hello,
 
 I've been working with the Doccat module and I am wondering if we could
 improve its data structure for the 1.6.0 release.
 
 Today the DocumentSample has the following attributes:
 
 - String category
 - ListString text
 
 I would suggest adding an attribute to hold metadata, or additional
 contexts information. What do you think?
 
 Right now the training format contains these two fields per line.
 Do you want to change the format as well?
 
 Also, what do you think of including sentences and paragraph information? I
 don't know if there is anything a feature generator can extract from it to
 improve the classification.
 
 I guess we only want to do that if there is a use case for it. It will make 
 the processing for the clients
 more complex, since they then would have to provide sentences and paragraphs 
 compared to just
 a piece of text.
 
 Jörn


Re: DocumentSample in Doccat

2014-04-17 Thread William Colen
Yes, I don't see how to represent the sentences and paragraphs.

+1 for the generic Map as suggested by Mark. We already have such things in
other sample classes, like NameSample and the POSSample.

A use case: the 20news corpus is a collection of articles, and each article
contains fields like From, Subject, Organization. Mahout, which
includes a formatter for this corpus, concatenate it all to the text field,
but I think we could improve accuracy by handling this metadata in a
separated feature generator.


2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com:

 I agree, this goes back to the concept of having a document model...
 I know in the prod systems I've used doccat, storing sentences and
 paragraphs wouldn't make sense, people usually have their own domain model
 for that. I still feel like if we augment the documentsample object with a
 generic Map it would be helpful in some cases and not constraining

 Sent from my iPhone

  On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com wrote:
 
  On 04/15/2014 07:45 PM, William Colen wrote:
  Hello,
 
  I've been working with the Doccat module and I am wondering if we could
  improve its data structure for the 1.6.0 release.
 
  Today the DocumentSample has the following attributes:
 
  - String category
  - ListString text
 
  I would suggest adding an attribute to hold metadata, or additional
  contexts information. What do you think?
 
  Right now the training format contains these two fields per line.
  Do you want to change the format as well?
 
  Also, what do you think of including sentences and paragraph
 information? I
  don't know if there is anything a feature generator can extract from it
 to
  improve the classification.
 
  I guess we only want to do that if there is a use case for it. It will
 make the processing for the clients
  more complex, since they then would have to provide sentences and
 paragraphs compared to just
  a piece of text.
 
  Jörn



Re: DocumentSample in Doccat

2014-04-17 Thread Mark G
Another general doccat thought I had is this. in my projects that use
Doccat, I created a class called a samplecollection, which simply wrapped a
listdocumentsample but then provided  a method that returned the samples
as a DoccatModel (using a properly formatted ByteArrayInputStream of the
doccat training format of all the samples). This worked out well because I
stored all the samples in a database, and users could CRUD samples for
different categories. There was a map reduce job that at job startup read
in the samples from the database into the samplecollection, dynamically
generated the model, and then used the model to classify all the texts
across the cluster; so every MR job ran the latest and greatest model based
on current samples. Not sure if we're interested in something like that,
but I see several questions on stack overflow asking about iterative model
building, and a SampleCollection that returns a Model has worked for me.  I
also created a SampleCRUD interface that abstracts storage and retrieval of
the samples I had a Postgres and Accumulo impl for sample storage.
just a thought, I know this can get very specific and complicated, thought
we may be able to find a middle ground by providing a framework and some
generic impls.
MG


On Thu, Apr 17, 2014 at 8:28 AM, William Colen william.co...@gmail.comwrote:

 Yes, I don't see how to represent the sentences and paragraphs.

 +1 for the generic Map as suggested by Mark. We already have such things in
 other sample classes, like NameSample and the POSSample.

 A use case: the 20news corpus is a collection of articles, and each article
 contains fields like From, Subject, Organization. Mahout, which
 includes a formatter for this corpus, concatenate it all to the text field,
 but I think we could improve accuracy by handling this metadata in a
 separated feature generator.


 2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com:

  I agree, this goes back to the concept of having a document model...
  I know in the prod systems I've used doccat, storing sentences and
  paragraphs wouldn't make sense, people usually have their own domain
 model
  for that. I still feel like if we augment the documentsample object with
 a
  generic Map it would be helpful in some cases and not constraining
 
  Sent from my iPhone
 
   On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com wrote:
  
   On 04/15/2014 07:45 PM, William Colen wrote:
   Hello,
  
   I've been working with the Doccat module and I am wondering if we
 could
   improve its data structure for the 1.6.0 release.
  
   Today the DocumentSample has the following attributes:
  
   - String category
   - ListString text
  
   I would suggest adding an attribute to hold metadata, or additional
   contexts information. What do you think?
  
   Right now the training format contains these two fields per line.
   Do you want to change the format as well?
  
   Also, what do you think of including sentences and paragraph
  information? I
   don't know if there is anything a feature generator can extract from
 it
  to
   improve the classification.
  
   I guess we only want to do that if there is a use case for it. It will
  make the processing for the clients
   more complex, since they then would have to provide sentences and
  paragraphs compared to just
   a piece of text.
  
   Jörn