On 5/3/11 1:24 PM, Muhammad Dhito wrote:
Hi,
I has been working on OpenNLP recently for my final project. I'm
trying to adapt OpenNLP for Indonesian language processing. But, i'm
just adapting four components: sentence detector, tokenizer,
part-of-speech tagger, and chunker.
Is it enough if I'm just providing the Indonesian model so I could use
OpenNLP to process Indonesian text?
It is of course nice if you provide the models to others, we might not
be able
to redistribute them here, but maybe you can just put them somewhere.
On which corpus do you train? If they are publicly available it would be
nice
to add support to parse it directly to OpenNLP like we did with a couple
of corpora already. Your contribution here would be very welcome.
Should I make some changes in
OpenNLP's source code according to Indonesian grammar by adding some
language-specific features?
Mabye you get better results with language specific features, we should
support that and already did first steps to make that easier, e.g. the
language
is stored inside our models.
Please feel free to propose new features which are specific for
Indonesian, we
will see how they could be integrated.
Thanks,
Jörn