2011/1/24 Olivier Grisel <[email protected]>: > > Also you should try the approach described by Ted Dunning in this comment: > > http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html#comment-6a010536291c30970b0147e1c7e65b970b > > In a nutshell: use the trained model to annotate the original corpus > to add some missing annotations and retrain a second model on the > resulting corpus. This model should have a better recall performance. > The precision might degrade a bit though. Be sure to manually check > the quality of the corpus to check that this did not introduce too > many wrong annotations.
FYI, I just tried that and the model trained on the two pass dataset gives an output that looks significantly worth (I use a unix diff to compare the outputs): ~/data/en/opennlp_location> diff -u test_annotated_pass1 test_annotated_pass2 | head --- test_annotated_pass1 2011-01-24 17:27:13.823941001 +0100 +++ test_annotated_pass2 2011-01-24 17:21:27.953941002 +0100 @@ -1,113 +1,113 @@ -The first European pre-commercial network was at the <START:location> Isle of <END> Man by Manx Telecom , the operator then owned by British Telecom , and the first commercial network in Europe was opened for business by Telenor in December 2001 with no commercial handsets and thus no paying customers . -<START:location> North Korea <END> has had a 3G network since 2008 , which is called Koryolink , a joint venture between Egyptian company Orascom Telecom Holding and the state-owned Korea Post and Telecommunications Corporation (KPTC ) is <START:location> North <END> Korea 's only 3G Mobile operator , and one of only two mobile companies in the country . +The first European pre-commercial network was at the <START:location> Isle <END> of Man by Manx Telecom , the operator then owned by British Telecom , and the first commercial network in Europe was opened for business by Telenor in December 2001 with no commercial handsets and thus no paying customers . +<START:location> North <END> Korea has had a 3G network since 2008 , which is called Koryolink , a joint venture between Egyptian company Orascom Telecom Holding and the state-owned Korea Post and Telecommunications Corporation (KPTC ) is <START:location> North <END> Korea 's only 3G Mobile operator , and one of only two mobile companies in the country . This sound is very common in place names in Wales because it occurs in the word Llan , meaning " Church building of Saint ..." , for example , <START:location> Llanelli <END> , where the ll appears twice , or the invented place-name Llanfairpwllgwyngyllgogerychwyrndrobwyllllantysiliogogogoch , where the ll appears five times . -It describes the exploits of a discharged U.S. Navy sailor named Benny Profane , his reconnection in <START:location> New York <END> with a group of pseudo-bohemian artists and hangers-on known as the Whole Sick Crew , and the quest of an aging traveller named Herbert Stencil to identify and locate the mysterious entity he knows only as "V . " -In Danish and Norwegian books , a distinction is made between foreign and local words ; thus , for example , the <START:location> German <END> city Aachen would be listed under A , but the Danish city " Aabenraa " would be listed after Z. So the only way two remaining promising ways to improve upon the current situation are: - improve the features extraction with gazetteers - manually annotate a subset of the (need some code to make it fast for a human annotator to filling the missing locations). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
