Re: End of line whitespaces in Eclipse
I think we should do it. 2014-04-22 8:50 GMT-03:00 Jörn Kottmann kottm...@gmail.com: We should maybe once remove all these white spaces at the end of lines. And maybe repeat that process for every release. Now days there are tools which can diff the files ignoring white space only changes. Any opinions? Jörn On Thu, 2014-04-10 at 19:58 -0300, William Colen wrote: When I save a .java file in Eclipse, it is removing the end of line whitespaces. I am using the http://opennlp.apache.org/code-formatter/OpenNLP-Eclipse-Formatter.xml This is causing lots of changes in files I actually needed to change only one line. Do anybody know how to I avoid it? Thank you, William
Re: DocumentSample in Doccat
What do you think of adding the following field to the DocumentSample? MapString, Object extraInformation Also, we could add the following methods to the DocumentCategorizer interface: public double[] categorize(String text[], MapString, Object extraInformation); public double[] categorize(String documentText, MapString, Object extraInformation); Any opinion? Thank you, William 2014-04-17 10:39 GMT-03:00 Mark G giaconiam...@gmail.com: Another general doccat thought I had is this. in my projects that use Doccat, I created a class called a samplecollection, which simply wrapped a listdocumentsample but then provided a method that returned the samples as a DoccatModel (using a properly formatted ByteArrayInputStream of the doccat training format of all the samples). This worked out well because I stored all the samples in a database, and users could CRUD samples for different categories. There was a map reduce job that at job startup read in the samples from the database into the samplecollection, dynamically generated the model, and then used the model to classify all the texts across the cluster; so every MR job ran the latest and greatest model based on current samples. Not sure if we're interested in something like that, but I see several questions on stack overflow asking about iterative model building, and a SampleCollection that returns a Model has worked for me. I also created a SampleCRUD interface that abstracts storage and retrieval of the samples I had a Postgres and Accumulo impl for sample storage. just a thought, I know this can get very specific and complicated, thought we may be able to find a middle ground by providing a framework and some generic impls. MG On Thu, Apr 17, 2014 at 8:28 AM, William Colen william.co...@gmail.com wrote: Yes, I don't see how to represent the sentences and paragraphs. +1 for the generic Map as suggested by Mark. We already have such things in other sample classes, like NameSample and the POSSample. A use case: the 20news corpus is a collection of articles, and each article contains fields like From, Subject, Organization. Mahout, which includes a formatter for this corpus, concatenate it all to the text field, but I think we could improve accuracy by handling this metadata in a separated feature generator. 2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com: I agree, this goes back to the concept of having a document model... I know in the prod systems I've used doccat, storing sentences and paragraphs wouldn't make sense, people usually have their own domain model for that. I still feel like if we augment the documentsample object with a generic Map it would be helpful in some cases and not constraining Sent from my iPhone On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com wrote: On 04/15/2014 07:45 PM, William Colen wrote: Hello, I've been working with the Doccat module and I am wondering if we could improve its data structure for the 1.6.0 release. Today the DocumentSample has the following attributes: - String category - ListString text I would suggest adding an attribute to hold metadata, or additional contexts information. What do you think? Right now the training format contains these two fields per line. Do you want to change the format as well? Also, what do you think of including sentences and paragraph information? I don't know if there is anything a feature generator can extract from it to improve the classification. I guess we only want to do that if there is a use case for it. It will make the processing for the clients more complex, since they then would have to provide sentences and paragraphs compared to just a piece of text. Jörn
Re: DocumentSample in Doccat
Yes, it looks nice. Maybe we should redo all the DocumentCategorizer interface. It is different from other tools, for example, we can't get the best category of one document with only one call, we need to use two methods. 2014-04-24 18:43 GMT-03:00 Mark G ma...@apache.org: William, that map looks good to me. In my current project I find this method convenient for getting back the probs over the categories in the model as a Maplet me know if there's anything wrong with it :) public MapString, Double categoriesAsMap(String text) { MapString, Double probDist = new HashMapString, Double(); double[] categorize = categorize(text); int catSize = getNumberOfCategories(); for (int i = 0; i catSize; i++) { String category = getCategory(i); probDist.put(category, categorize[getIndex(category)]); } return probDist; } perhaps we should consider adding this method to abstract some detailsjust a thought On Thu, Apr 24, 2014 at 3:56 PM, William Colen william.co...@gmail.com wrote: What do you think of adding the following field to the DocumentSample? MapString, Object extraInformation Also, we could add the following methods to the DocumentCategorizer interface: public double[] categorize(String text[], MapString, Object extraInformation); public double[] categorize(String documentText, MapString, Object extraInformation); Any opinion? Thank you, William 2014-04-17 10:39 GMT-03:00 Mark G giaconiam...@gmail.com: Another general doccat thought I had is this. in my projects that use Doccat, I created a class called a samplecollection, which simply wrapped a listdocumentsample but then provided a method that returned the samples as a DoccatModel (using a properly formatted ByteArrayInputStream of the doccat training format of all the samples). This worked out well because I stored all the samples in a database, and users could CRUD samples for different categories. There was a map reduce job that at job startup read in the samples from the database into the samplecollection, dynamically generated the model, and then used the model to classify all the texts across the cluster; so every MR job ran the latest and greatest model based on current samples. Not sure if we're interested in something like that, but I see several questions on stack overflow asking about iterative model building, and a SampleCollection that returns a Model has worked for me. I also created a SampleCRUD interface that abstracts storage and retrieval of the samples I had a Postgres and Accumulo impl for sample storage. just a thought, I know this can get very specific and complicated, thought we may be able to find a middle ground by providing a framework and some generic impls. MG On Thu, Apr 17, 2014 at 8:28 AM, William Colen william.co...@gmail.com wrote: Yes, I don't see how to represent the sentences and paragraphs. +1 for the generic Map as suggested by Mark. We already have such things in other sample classes, like NameSample and the POSSample. A use case: the 20news corpus is a collection of articles, and each article contains fields like From, Subject, Organization. Mahout, which includes a formatter for this corpus, concatenate it all to the text field, but I think we could improve accuracy by handling this metadata in a separated feature generator. 2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com: I agree, this goes back to the concept of having a document model... I know in the prod systems I've used doccat, storing sentences and paragraphs wouldn't make sense, people usually have their own domain model for that. I still feel like if we augment the documentsample object with a generic Map it would be helpful in some cases and not constraining Sent from my iPhone On Apr 17, 2014, at 6:35 AM, Jörn Kottmann kottm...@gmail.com wrote: On 04/15/2014 07:45 PM, William Colen wrote: Hello, I've been working with the Doccat module and I am wondering if we could improve its data structure for the 1.6.0 release. Today the DocumentSample has the following attributes: - String category - ListString text I would suggest adding an attribute to hold metadata, or additional contexts information. What do you think? Right now the training format contains these two fields per line. Do you want to change the format as well? Also, what do you think of including sentences and paragraph information? I don't know if there is anything a feature generator can extract from it to improve the classification. I guess we only want to do that if there is a use
Re: DocumentSample in Doccat
William here is another thought, we could include something like this to return a map sorted descending with the best score on top... so you can call categoriesAsSortedMap().firstEntry() to get the best score (which can be the same for more that one category hence the Set as value) public NavigableMapDouble, SetString categoriesAsSortedMap(String text) { NavigableMapDouble, SetString descendingMap = new TreeMapDouble, SetString().descendingMap(); double[] categorize = categorize(text); int catSize = getNumberOfCategories(); for (int i = 0; i catSize; i++) { String category = getCategory(i); double score = categorize[getIndex(category)]; if (descendingMap.containsKey(score)) { descendingMap.get(score).add(category); } else { SetString newset = new HashSet(); newset.add(category); descendingMap.put(score, newset); } } return descendingMap; } On Thu, Apr 24, 2014 at 7:04 PM, Tech mail giaconiam...@gmail.com wrote: I think it might also be true that the featuregenerator interface in doccat is different than the others, also I don't think the tokennamefinder interface has a probs() method, which has always made me use the ME impl direct. Sent from my iPhone On Apr 24, 2014, at 6:54 PM, William Colen william.co...@gmail.com wrote: Yes, it looks nice. Maybe we should redo all the DocumentCategorizer interface. It is different from other tools, for example, we can't get the best category of one document with only one call, we need to use two methods. 2014-04-24 18:43 GMT-03:00 Mark G ma...@apache.org: William, that map looks good to me. In my current project I find this method convenient for getting back the probs over the categories in the model as a Maplet me know if there's anything wrong with it :) public MapString, Double categoriesAsMap(String text) { MapString, Double probDist = new HashMapString, Double(); double[] categorize = categorize(text); int catSize = getNumberOfCategories(); for (int i = 0; i catSize; i++) { String category = getCategory(i); probDist.put(category, categorize[getIndex(category)]); } return probDist; } perhaps we should consider adding this method to abstract some detailsjust a thought On Thu, Apr 24, 2014 at 3:56 PM, William Colen william.co...@gmail.com wrote: What do you think of adding the following field to the DocumentSample? MapString, Object extraInformation Also, we could add the following methods to the DocumentCategorizer interface: public double[] categorize(String text[], MapString, Object extraInformation); public double[] categorize(String documentText, MapString, Object extraInformation); Any opinion? Thank you, William 2014-04-17 10:39 GMT-03:00 Mark G giaconiam...@gmail.com: Another general doccat thought I had is this. in my projects that use Doccat, I created a class called a samplecollection, which simply wrapped a listdocumentsample but then provided a method that returned the samples as a DoccatModel (using a properly formatted ByteArrayInputStream of the doccat training format of all the samples). This worked out well because I stored all the samples in a database, and users could CRUD samples for different categories. There was a map reduce job that at job startup read in the samples from the database into the samplecollection, dynamically generated the model, and then used the model to classify all the texts across the cluster; so every MR job ran the latest and greatest model based on current samples. Not sure if we're interested in something like that, but I see several questions on stack overflow asking about iterative model building, and a SampleCollection that returns a Model has worked for me. I also created a SampleCRUD interface that abstracts storage and retrieval of the samples I had a Postgres and Accumulo impl for sample storage. just a thought, I know this can get very specific and complicated, thought we may be able to find a middle ground by providing a framework and some generic impls. MG On Thu, Apr 17, 2014 at 8:28 AM, William Colen william.co...@gmail.com wrote: Yes, I don't see how to represent the sentences and paragraphs. +1 for the generic Map as suggested by Mark. We already have such things in other sample classes, like NameSample and the POSSample. A use case: the 20news corpus is a collection of articles, and each article contains fields like From, Subject, Organization. Mahout, which includes a formatter for this corpus, concatenate it all to the text field, but I think we could improve accuracy by handling this metadata in a separated feature generator. 2014-04-17 8:37 GMT-03:00 Tech mail giaconiam...@gmail.com: I agree, this goes back to the