2011/1/24 Olivier Grisel <[email protected]>:
>
> Also you should try the approach described by Ted Dunning in this comment:
>
>  http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html#comment-6a010536291c30970b0147e1c7e65b970b
>
> In a nutshell: use the trained model to annotate the original corpus
> to add some missing annotations and retrain a second model on the
> resulting corpus. This model should have a better recall performance.
> The precision might degrade a bit though. Be sure to manually check
> the quality of the corpus to check that this did not introduce too
> many wrong annotations.

FYI, I just tried that and the model trained on the two pass dataset
gives an output that looks significantly worth (I use a unix diff to
compare the outputs):

~/data/en/opennlp_location> diff -u test_annotated_pass1
test_annotated_pass2 | head
--- test_annotated_pass1        2011-01-24 17:27:13.823941001 +0100
+++ test_annotated_pass2        2011-01-24 17:21:27.953941002 +0100
@@ -1,113 +1,113 @@
-The first European pre-commercial network was at the <START:location>
Isle of <END> Man by Manx Telecom , the operator then owned by British
Telecom , and the first commercial network in Europe was opened for
business by Telenor in December 2001 with no commercial handsets and
thus no paying customers .
-<START:location> North Korea <END> has had a 3G network since 2008 ,
which is called Koryolink , a joint venture between Egyptian company
Orascom Telecom Holding and the state-owned Korea Post and
Telecommunications Corporation (KPTC ) is <START:location> North <END>
Korea 's only 3G Mobile operator , and one of only two mobile
companies in the country .
+The first European pre-commercial network was at the <START:location>
Isle <END> of Man by Manx Telecom , the operator then owned by British
Telecom , and the first commercial network in Europe was opened for
business by Telenor in December 2001 with no commercial handsets and
thus no paying customers .
+<START:location> North <END> Korea has had a 3G network since 2008 ,
which is called Koryolink , a joint venture between Egyptian company
Orascom Telecom Holding and the state-owned Korea Post and
Telecommunications Corporation (KPTC ) is <START:location> North <END>
Korea 's only 3G Mobile operator , and one of only two mobile
companies in the country .
 This sound is very common in place names in Wales because it occurs
in the word Llan , meaning " Church building of Saint ..." , for
example , <START:location> Llanelli <END> , where the ll appears twice
, or the invented place-name
Llanfairpwllgwyngyllgogerychwyrndrobwyllllantysiliogogogoch , where
the ll appears five times .
-It describes the exploits of a discharged U.S. Navy sailor named
Benny Profane , his reconnection in <START:location> New York <END>
with a group of pseudo-bohemian artists and hangers-on known as the
Whole Sick Crew , and the quest of an aging traveller named Herbert
Stencil to identify and locate the mysterious entity he knows only as
"V . "
-In Danish and Norwegian books , a distinction is made between foreign
and local words ; thus , for example , the <START:location> German
<END> city Aachen would be listed under A , but the Danish city "
Aabenraa " would be listed after Z.

So the only way two remaining promising ways to improve upon the
current situation are:
 - improve the features extraction with gazetteers
 - manually annotate a subset of the  (need some code to make it fast
for a human annotator to filling the missing locations).

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Reply via email to