Re: Training multiple models

Paul Cowan Wed, 05 Jan 2011 12:33:25 -0800

Hi,

I have created a sample file which for now is only marked up
<START:Organization>..<END> markers and I have the following test which is
passing.  Java is not a language I have spent an awful lot of time on so
forgive any ignorance on my part:


    @Test
    public void testHtmlOrganizationFind() throws Exception{
        InputStream in = getClass().getClassLoader().getResourceAsStream(
        "opennlp/tools/namefind/htmlbasic.train");

        ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                new PlainTextByLineStream(new InputStreamReader(in))
        );

        TokenNameFinderModel nameFinderModel = NameFinderME.train("en",
"organization",
                sampleStream, Collections.<String, Object>emptyMap(), 70,
1);

        assertNotNull(nameFinderModel);
    }

At the moment, I am preprocessing the htmlbasic.train file by stripping out
all the new line characters so that it is just one line.

I would be grateful if anyone could help me with the following questions:

1.  Is the "type" argument passed into NameFinderME.train method the type of
the model which in my case is organization (<START:organization>)?  If so,
would I need to call train for each tag I mark up the text with? I want to
use <START:location> and others for example.

2.  How do I feed multiple files into the training?  Somebody said I could
use the <HTML> tags as document delimiters. Or is another way to merge all
the documents into 1 file which are delimited by the new line character?  I
cannot find a test which shows how to do this.

Thanks

Paul

Cheers

Paul Cowan

Cutting-Edge Solutions (Scotland)

http://thesoftwaresimpleton.blogspot.com/



On 23 December 2010 16:42, Benson Margulies <[email protected]> wrote:

> If I were you, I'd keep HTML digestion separate from sentence bounding.
>
>
> On Thu, Dec 23, 2010 at 11:31 AM, Paul Cowan <[email protected]> wrote:
> > Hi,
> >
> > Am I right in saying that, I will also need to create and train my own
> HTML
> > sentence detector in order to parse the HTML into chunks that can be
> > tokenised?
> >
> > Cheers
> >
> > Paul Cowan
> >
> > Cutting-Edge Solutions (Scotland)
> >
> > http://thesoftwaresimpleton.blogspot.com/
> >
> >
> >
> > On 17 December 2010 15:10, Jörn Kottmann <[email protected]> wrote:
> >
> >> On 12/17/10 2:19 PM, James Kosin wrote:
> >>
> >>> I have the following questions that I would appreciate an answer for:
> >>> >
> >>> >  1. Can I have the different name finding tags in the same data?
> >>>
> >>
> >> Yes, but that means you train a model which can detect each of these
> >> names. You should test both, multiple name types in one model,
> >> and separate models for each name type. You can use the built
> >> in evaluation to validate your results.
> >>
> >>  >  2. Does the<START:address>  <END>  make sense over multiple lines or
> >>> should I
> >>> >  break this up further?
> >>>
> >> No not possible, names spanning multiple sentences (a line is a
> sentence),
> >> is not supported.
> >>
> >>
> >>  >  3. I want to use 200 or 300 different examples, do I need to create
> >>> separate
> >>> >  files for each example or can I merge them all into 1 and if it is
> only
> >>> 1,
> >>> >  do I need to mark up the start and end of a file?
> >>>
> >> If you want to use the command line training tool they must be all in
> one
> >> file, if you use the API
> >> its up to you to merge these different sources into one name sample
> stream.
> >>
> >> Jörn
> >>
> >
>

Re: Training multiple models

Reply via email to