I agree, this goes back to the concept of having a "document" model... I know in the prod systems I've used doccat, storing sentences and paragraphs wouldn't make sense, people usually have their own domain model for that. I still feel like if we augment the documentsample object with a generic Map it would be helpful in some cases and not constraining
Sent from my iPhone > On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <kottm...@gmail.com> wrote: > >> On 04/15/2014 07:45 PM, William Colen wrote: >> Hello, >> >> I've been working with the Doccat module and I am wondering if we could >> improve its data structure for the 1.6.0 release. >> >> Today the DocumentSample has the following attributes: >> >> - String category >> - List<String> text >> >> I would suggest adding an attribute to hold metadata, or additional >> contexts information. What do you think? > > Right now the training format contains these two fields per line. > Do you want to change the format as well? > >> Also, what do you think of including sentences and paragraph information? I >> don't know if there is anything a feature generator can extract from it to >> improve the classification. > > I guess we only want to do that if there is a use case for it. It will make > the processing for the clients > more complex, since they then would have to provide sentences and paragraphs > compared to just > a piece of text. > > Jörn