I think it has potential. Here is a better description of what it does.
The user story would be something like this: *"As a user, I need to be able
to rapidly create and refine models for extracting Named Entities from my
particular data so that I can constantly improve the results of my pipeline"
*
The processing flow of the tool is this
User supplies a set of sentences via implementing a SentenceProvider
interface
User supplies a validation layer via implementing a EntityValidator
interface
User supplies a location to write the annotated sentences via a
AnnotatedSentenceWriter interface
User supplies a list of seed entities via a KnownEntityProvider interface
User passes these interfaces along with a number of iterations into the
SemiSupervisedModelBuilder interface impl.
I wrote prototype implementation of each (it's rough at this point), sorry
or the extremely long post.
here are the interfaces:
public interface SemiSupervisedModelBuilder {
void build(SentenceProvider sentenceProvider, KnownEntityProvider
knownQuantityProvider,EntityValidator badEntityProvider,
AnnotatedSentenceWriter annSentenceWriter, Integer iterations);
}
public interface SentenceProvider {
Set<String> getSentences();
}
public interface KnownEntityProvider {
Set<String> getKnownEntities();
void addKnownEntity(String unambiguousEntity);
String getKnownEntitiesType();
}
public interface EntityValidator {
Set<String> getBlacklist();
Boolean isValidEntity(String token);
Boolean isValidEntity(String token, double prob);
Boolean isValidEntity(String token,Span namedEntity, String[] words,
String[] posWhiteList, String[] pos);
}
public interface AnnotatedSentenceWriter {
void write(List<String> annotatedSentences);
void setFilePath(String path);
String getFilePath();
}
*
*
/////////////here is the impl that controls the flow
public class SemiSupervisedModelBuilderImpl implements
SemiSupervisedModelBuilder {
public static void main(String[] args) {
SemiSupervisedModelBuilder builder = new
SemiSupervisedModelBuilderImpl();
SentenceProvider sp = new MySQLSentenceProviderImpl();
EntityValidator kbe = new GenericEntityValidatorImpl();
KnownEntityProvider kqp = new GenericKnownEntityProvider();
AnnotatedSentenceWriter asw = new GenericAnnotatedSentenceWriter();
builder.build(sp, kqp, kbe, asw, 2);
}
TokenizerModel tm;
TokenizerME wordBreaker;
TokenNameFinderModel nerModel;
NameFinderME nameFinder;
@Override
public void build(SentenceProvider sentenceProvider, KnownEntityProvider
knownQuantityProvider,
EntityValidator knownBadEntityProvider, AnnotatedSentenceWriter
annSentenceWriter, Integer enrighmentIterations) {
Set<String> sentences = *sentenceProvider.getSentences();*
List<String> annotatedSentences = new ArrayList<>();
try {
for (int iters = 0; iters < enrighmentIterations; iters++) {
int counter1 = 0;
System.out.println("-----------------iteration : " + iters);
* for (String sentence : sentences)* {
counter1++;
if (counter1 % 1000 == 0) {
System.out.println("sentence " + counter1 + " of iter " +
iters);
}
for (String known *: knownQuantityProvider.getKnownEntities()*) {
if (sentence.contains(known)) {
String annSent = sentence.replace(known, " <START:" +
knownQuantityProvider.getKnownEntitiesType() + "> " + known.trim() + "
<END> ");
if (!annotatedSentences.contains(annSent)) {
// System.out.println("Created new annotation");
annotatedSentences.add(annSent);
} else {
// System.out.println("\tannotation already exists");
}
}
}
}
System.out.println("writing" + annotatedSentences.size() + "
annotations");
* annSentenceWriter.write(annotatedSentences);*
* buildmodel(annSentenceWriter.getFilePath());*
String modelPath = "c:\\temp\\opennlpmodels\\";
InputStream stream = new FileInputStream(new File(modelPath +
"en-token.zip"));
tm = new TokenizerModel(stream);
wordBreaker = new TokenizerME(tm);
//load the model we just made
nerModel = new TokenNameFinderModel(new FileInputStream(new
File(modelPath + "en-ner-person.train.model")));
nameFinder = new NameFinderME(nerModel);
int counter = 0;
*for (String sentence : sentences) {*
counter++;
if (counter % 1000 == 0) {
System.out.println("sentence " + counter + " of iter " + iters);
nameFinder.clearAdaptiveData();
}
String[] stringTokens = wordBreaker.tokenize(sentence);
Span[] spans = nameFinder.find(stringTokens);
double[] probs = nameFinder.probs();
if (spans.length > 0) {
// System.out.println(probs[0]);
for (String token : Span.spansToStrings(spans, stringTokens)) {
// if
(!knownQuantityProvider.*getKnownEntities*().contains(token))
{
if (!knownBadEntityProvider.*isValidEntity*(token)) {
knownQuantityProvider.*addKnownEntity*(token);
String annSent = sentence.replace(token, " <START:" +
knownQuantityProvider.*getKnownEntitiesType*() + "> " + token.trim() + "
<END> ");
*annotatedSentences.add(annSent);*
System.out.println("NER: " + token);
// } else {
// System.out.println("BAD ENTITY: " + token);
// }
}
}
}
}
//this set grows and can be validated via the blacklist and other means
for (String a : knownQuantityProvider.getKnownEntities()) {
System.out.println("knowns: " + a);
}
}
annSentenceWriter.write(annotatedSentences);
System.err.println("BUILDING FINAL MODEL");
buildmodel(annSentenceWriter.getFilePath());
System.err.println("FINAL MODEL COMPLETE");
} catch (Exception e) {
e.printStackTrace();
}
}
public void buildmodel(String path) throws Exception {
System.out.println("reading training data...");
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
new PlainTextByLineStream(new
FileInputStream("c:\\temp\\opennlpmodels\\en-ner-person.train"), charset);
ObjectStream<NameSample> sampleStream = new
NameSampleDataStream(lineStream);
System.out.println("\tgenerating model...");
TokenNameFinderModel model;
model = NameFinderME.train("en", "person", sampleStream, null);
sampleStream.close();
OutputStream modelOut = new BufferedOutputStream(new
FileOutputStream(new File(path + ".model")));
model.serialize(modelOut);
if (modelOut != null) {
modelOut.close();
}
}
}
Here is the KnownEntityProvider Impl. This is essentially the start point
for iteratively creating the model
public class GenericKnownEntityProvider implements KnownEntityProvider{
Set<String> ret = new HashSet<>();
@Override
public Set<String> getKnownEntities() {
if (ret.isEmpty()) {
ret.add("Barack Obama");
ret.add("Mitt Romney");
ret.add("John Doe");
ret.add("Bill Gates");
ret.add("Nguyen Tan Dung");
ret.add("Hassanal Bolkiah");
ret.add("Bashar al-Assad");
ret.add("Faysal Khabbaz Hamou");
ret.add("Dr Talwar");
}
return ret;
}
@Override
public String getKnownEntitiesType() {
return "person";
}
@Override
public void addKnownEntity(String unambiguousEntity) {
ret.add(unambiguousEntity);
}
}
here is my simple entity validator... the badentities set is what users can
add to in order to iteatively improve the resulting model...
the user can validate any way they want, the overloads make that obvious I
hoped
public class GenericEntityValidatorImpl implements EntityValidator {
private Set<String> badentities = new HashSet<>();
private final double MIN_SCORE_FOR_TRAINING = 0.95d;
@Override
public Set<String> getBlacklist() {
badentities.add(".");
badentities.add("-");
badentities.add(",");
badentities.add(";");
badentities.add("the");
badentities.add("that");
badentities.add("several");
badentities.add("model");
badentities.add("our");
badentities.add("are");
badentities.add("in");
badentities.add("are");
badentities.add("at");
badentities.add("is");
badentities.add("for");
badentities.add("the");
badentities.add("during");
badentities.add("south");
badentities.add("from");
badentities.add("recounts");
badentities.add("wissenschaftliches");
badentities.add("if");
badentities.add("security");
badentities.add("denouncing");
badentities.add("writes");
badentities.add("but");
badentities.add("operation");
badentities.add("adds");
badentities.add("Above");
badentities.add("but");
badentities.add("RIP");
badentities.add("on");
badentities.add("no");
badentities.add("agrees");
badentities.add("year");
badentities.add("for");
badentities.add("you");
badentities.add("red");
badentities.add("added");
badentities.add("hello");
badentities.add("around");
badentities.add("has");
badentities.add("turn");
badentities.add("surrounding");
badentities.add("\" No");
badentities.add("aug.");
badentities.add("or");
badentities.add("quips");
badentities.add("september");
badentities.add("[mr");
badentities.add("diseases");
badentities.add("when");
badentities.add("bbc");
badentities.add(":\"");
badentities.add("dr");
badentities.add("baby");
badentities.add("on");
badentities.add("route");
badentities.add("'");
badentities.add("\"");
badentities.add("a");
badentities.add("her");
badentities.add("'");
badentities.add("\"");
badentities.add("two");
badentities.add("that");
badentities.add(":");
badentities.add("one");
return badentities;
}
@Override
public Boolean isValidEntity(String token) {
if (badentities.isEmpty()) {
getBlacklist();
}
String[] tokens = token.toLowerCase().split(" ");
if (tokens.length >= 2) {
for (String t : tokens) {
if (badentities.contains(t.trim())) {
System.out.println("bad token : " + token);
return false;
}
}
} else {
System.out.println("bad token : " + token);
return false;
}
Pattern p = Pattern.compile("[^a-z ]", Pattern.CASE_INSENSITIVE |
Pattern.MULTILINE);
if (p.matcher(token).find()) {
System.out.println("hit on [^a-z\\- ] : " + token);
if (!token.toLowerCase().matches(".*[a-z]\\-[a-z].*")) {
System.out.println("bad token : " + token);
return false;
} else {
System.out.println("false pos : " + token);
}
}
Boolean b = true;
if (badentities.contains(token.toLowerCase())) {
System.out.println("bad token : " + token);
b = false;
}
return b;
}
@Override
public Boolean isValidEntity(String token, double prob) {
Boolean b = false;
if (prob < MIN_SCORE_FOR_TRAINING) {
b = false;
} else {
b = isValidEntity(token);
}
return b;
}
@Override
public Boolean isValidEntity(String token, Span namedEntity, String[]
words, String[] posWhiteList, String[] pos) {
boolean b = isValidEntity(token);
if (!b) {
return b;
}
for(int start = namedEntity.getStart(); start < namedEntity.getEnd();
start++){
for(String ps : pos){
if(!ps.equals(posWhiteList[start])){
return false;
}
}
}
return b;
}
}
the Annotated Sentence writer dictates where to write the output sentences.
This is great is someone is doing this in a distributed way (like in
Hadoop), this could write out to HBase of Hdfs of something where it could
be crowdsourced or whatever...
public class GenericAnnotatedSentenceWriter implements
AnnotatedSentenceWriter {
private String path = "c:\\temp\\opennlpmodels\\en-ner-person.train";
@Override
public void write(List<String> sentences) {
try {
FileWriter writer = new FileWriter(this.getFilePath(), false);
for (String s : sentences) {
writer.write(s.trim() + "\n");
}
writer.close();
} catch (IOException ex) {
}
}
@Override
public void setFilePath(String path) {
this.path = path;
}
@Override
public String getFilePath() {
return path;
}
}
if you made it this far down the email please let me know what you think. I
believe it has potential.
thanks
MG
On Thu, Oct 3, 2013 at 4:02 AM, Jörn Kottmann <[email protected]> wrote:
> On 10/02/2013 02:06 AM, Mark G wrote:
>
>> I've been using OpenNLP for a few years and I find the best results occur
>> when the models are generated using samples of the data they will be run
>> against, one of the reasons I like the Maxent approach. I am not sure
>> attempting to provide models will bear much fruit other than users will no
>> longer be afraid of the licensing issues associated with using them in
>> commercial systems. I do strongly think we should provide a modelbuilding
>> framework (that calls the training api) and a default impl.
>> Coincidentally....I have been building a framework and impl over the last
>> few months that creates models based on seeding an iterative process with
>> known entities and iterating through a set of supplied sentences to
>> recursively create annotations, write them, create a maxentmodel, load the
>> model, create more annotations based on the results (there is a validation
>> object involved), and so on.... With this method I was able to create an
>> NER model for people's names against a 200K sentence corpus that returns
>> acceptable results just by starting with a list of five highly unambiguous
>> names. I will propose the framework in more detail in the coming days and
>> supply my impl if everyone is interested.
>> As for the initial question, I would like to see OpenNLP provide a
>> framework for rapidly/semi-automatically building models out of user data,
>> and also performing entity resolution across documents, in order to assign
>> a probability to whether the "Bob" in one document is the same as "Bob" in
>> another.
>>
>>
> Sounds very interesting. The sentence-wise training data which is produced
> this way could
> also be combined with existing training data, or just be used to bootstrap
> a model to more
> efficiently label data with a document-level annotation tool.
>
> Another aspect is that this tool might be good at detecting mistakes in
> existing training data.
>
> Jörn
>
>
>