On 03/22/2013 01:05 PM, William Colen wrote:
We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by
detokenizing it, and creating documents from it.

If it is OK to do it with other language, the AD corpus has paragraph and
text annotations, as well as the original sentences (not tokenized).

For English we should be able to use some of the CONLL data, yes, we should
definitely test with other languages too. Leipzig might be suited for sentence detector training, but not for tokenizer training, since the data is not tokenized as far as I know.

+1 to use AD and CONLL for testing the tokenizer and sentence detector.

Jörn

Reply via email to