I agree, this goes back to the concept of having a "document" model...
I know in the prod systems I've used doccat, storing sentences and paragraphs 
wouldn't make sense, people usually have their own domain model for that. I 
still feel like if we augment the documentsample object with a generic Map it 
would be helpful in some cases and not constraining

Sent from my iPhone

> On Apr 17, 2014, at 6:35 AM, Jörn Kottmann <kottm...@gmail.com> wrote:
> 
>> On 04/15/2014 07:45 PM, William Colen wrote:
>> Hello,
>> 
>> I've been working with the Doccat module and I am wondering if we could
>> improve its data structure for the 1.6.0 release.
>> 
>> Today the DocumentSample has the following attributes:
>> 
>> - String category
>> - List<String> text
>> 
>> I would suggest adding an attribute to hold metadata, or additional
>> contexts information. What do you think?
> 
> Right now the training format contains these two fields per line.
> Do you want to change the format as well?
> 
>> Also, what do you think of including sentences and paragraph information? I
>> don't know if there is anything a feature generator can extract from it to
>> improve the classification.
> 
> I guess we only want to do that if there is a use case for it. It will make 
> the processing for the clients
> more complex, since they then would have to provide sentences and paragraphs 
> compared to just
> a piece of text.
> 
> Jörn

Reply via email to