On 03/22/2013 01:05 PM, William Colen wrote:
We could do it with Leipzig corpus, or CONLL. We can prepare the corpus by
detokenizing it, and creating documents from it.
If it is OK to do it with other language, the AD corpus has paragraph and
text annotations, as well as the original sentences (not tokenized).
For English we should be able to use some of the CONLL data, yes, we should
definitely test with other languages too. Leipzig might be suited for
sentence detector
training, but not for tokenizer training, since the data is not
tokenized as far as I know.
+1 to use AD and CONLL for testing the tokenizer and sentence detector.
Jörn