If I were you, I'd keep HTML digestion separate from sentence bounding.
On Thu, Dec 23, 2010 at 11:31 AM, Paul Cowan <[email protected]> wrote: > Hi, > > Am I right in saying that, I will also need to create and train my own HTML > sentence detector in order to parse the HTML into chunks that can be > tokenised? > > Cheers > > Paul Cowan > > Cutting-Edge Solutions (Scotland) > > http://thesoftwaresimpleton.blogspot.com/ > > > > On 17 December 2010 15:10, Jörn Kottmann <[email protected]> wrote: > >> On 12/17/10 2:19 PM, James Kosin wrote: >> >>> I have the following questions that I would appreciate an answer for: >>> > >>> > 1. Can I have the different name finding tags in the same data? >>> >> >> Yes, but that means you train a model which can detect each of these >> names. You should test both, multiple name types in one model, >> and separate models for each name type. You can use the built >> in evaluation to validate your results. >> >> > 2. Does the<START:address> <END> make sense over multiple lines or >>> should I >>> > break this up further? >>> >> No not possible, names spanning multiple sentences (a line is a sentence), >> is not supported. >> >> >> > 3. I want to use 200 or 300 different examples, do I need to create >>> separate >>> > files for each example or can I merge them all into 1 and if it is only >>> 1, >>> > do I need to mark up the start and end of a file? >>> >> If you want to use the command line training tool they must be all in one >> file, if you use the API >> its up to you to merge these different sources into one name sample stream. >> >> Jörn >> >
