If I were you, I'd keep HTML digestion separate from sentence bounding.

On Thu, Dec 23, 2010 at 11:31 AM, Paul Cowan <[email protected]> wrote:
> Hi,
>
> Am I right in saying that, I will also need to create and train my own HTML
> sentence detector in order to parse the HTML into chunks that can be
> tokenised?
>
> Cheers
>
> Paul Cowan
>
> Cutting-Edge Solutions (Scotland)
>
> http://thesoftwaresimpleton.blogspot.com/
>
>
>
> On 17 December 2010 15:10, Jörn Kottmann <[email protected]> wrote:
>
>> On 12/17/10 2:19 PM, James Kosin wrote:
>>
>>> I have the following questions that I would appreciate an answer for:
>>> >
>>> >  1. Can I have the different name finding tags in the same data?
>>>
>>
>> Yes, but that means you train a model which can detect each of these
>> names. You should test both, multiple name types in one model,
>> and separate models for each name type. You can use the built
>> in evaluation to validate your results.
>>
>>  >  2. Does the<START:address>  <END>  make sense over multiple lines or
>>> should I
>>> >  break this up further?
>>>
>> No not possible, names spanning multiple sentences (a line is a sentence),
>> is not supported.
>>
>>
>>  >  3. I want to use 200 or 300 different examples, do I need to create
>>> separate
>>> >  files for each example or can I merge them all into 1 and if it is only
>>> 1,
>>> >  do I need to mark up the start and end of a file?
>>>
>> If you want to use the command line training tool they must be all in one
>> file, if you use the API
>> its up to you to merge these different sources into one name sample stream.
>>
>> Jörn
>>
>

Reply via email to