Hi,
I have created a sample file which for now is only marked up
<START:Organization>..<END> markers and I have the following test which is
passing. Java is not a language I have spent an awful lot of time on so
forgive any ignorance on my part:
@Test
public void testHtmlOrganizationFind() throws Exception{
InputStream in = getClass().getClassLoader().getResourceAsStream(
"opennlp/tools/namefind/htmlbasic.train");
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
new PlainTextByLineStream(new InputStreamReader(in))
);
TokenNameFinderModel nameFinderModel = NameFinderME.train("en",
"organization",
sampleStream, Collections.<String, Object>emptyMap(), 70,
1);
assertNotNull(nameFinderModel);
}
At the moment, I am preprocessing the htmlbasic.train file by stripping out
all the new line characters so that it is just one line.
I would be grateful if anyone could help me with the following questions:
1. Is the "type" argument passed into NameFinderME.train method the type of
the model which in my case is organization (<START:organization>)? If so,
would I need to call train for each tag I mark up the text with? I want to
use <START:location> and others for example.
2. How do I feed multiple files into the training? Somebody said I could
use the <HTML> tags as document delimiters. Or is another way to merge all
the documents into 1 file which are delimited by the new line character? I
cannot find a test which shows how to do this.
Thanks
Paul
Cheers
Paul Cowan
Cutting-Edge Solutions (Scotland)
http://thesoftwaresimpleton.blogspot.com/
On 23 December 2010 16:42, Benson Margulies <[email protected]> wrote:
> If I were you, I'd keep HTML digestion separate from sentence bounding.
>
>
> On Thu, Dec 23, 2010 at 11:31 AM, Paul Cowan <[email protected]> wrote:
> > Hi,
> >
> > Am I right in saying that, I will also need to create and train my own
> HTML
> > sentence detector in order to parse the HTML into chunks that can be
> > tokenised?
> >
> > Cheers
> >
> > Paul Cowan
> >
> > Cutting-Edge Solutions (Scotland)
> >
> > http://thesoftwaresimpleton.blogspot.com/
> >
> >
> >
> > On 17 December 2010 15:10, Jörn Kottmann <[email protected]> wrote:
> >
> >> On 12/17/10 2:19 PM, James Kosin wrote:
> >>
> >>> I have the following questions that I would appreciate an answer for:
> >>> >
> >>> > 1. Can I have the different name finding tags in the same data?
> >>>
> >>
> >> Yes, but that means you train a model which can detect each of these
> >> names. You should test both, multiple name types in one model,
> >> and separate models for each name type. You can use the built
> >> in evaluation to validate your results.
> >>
> >> > 2. Does the<START:address> <END> make sense over multiple lines or
> >>> should I
> >>> > break this up further?
> >>>
> >> No not possible, names spanning multiple sentences (a line is a
> sentence),
> >> is not supported.
> >>
> >>
> >> > 3. I want to use 200 or 300 different examples, do I need to create
> >>> separate
> >>> > files for each example or can I merge them all into 1 and if it is
> only
> >>> 1,
> >>> > do I need to mark up the start and end of a file?
> >>>
> >> If you want to use the command line training tool they must be all in
> one
> >> file, if you use the API
> >> its up to you to merge these different sources into one name sample
> stream.
> >>
> >> Jörn
> >>
> >
>