What about this: http://nlp.stanford.edu/links/statnlp.html
after reading this page: http://www.natlang.com/nlp-datasets-download I've found those: http://pascallin.ecs.soton.ac.uk/Challenges/RTE/Datasets/ http://dblp.uni-trier.de/db/ http://www.cs.umass.edu/~mccallum/code-data.html This is training data from the GENIA version 3.02 corpus. <http://www.natlang.com/nlp-datasets-download> <http://pascallin.ecs.soton.ac.uk/Challenges/RTE/Datasets/> <http://dblp.uni-trier.de/db/> <http://www.cs.umass.edu/~mccallum/code-data.html> - Training Data<http://natlang.com/sites/default/files/Genia4ERtaskV2.tar.gz> (Genia4ERtaskV2.tar.gz - 2,242 KiB) - Evaluation Data<http://natlang.com/sites/default/files/Genia4EReval.tar.gz> (Genia4EReval.tar.gz - 840 KiB Some more: http://nlp.stanford.edu/links/statnlp.html http://www.natlang.com/natlang/ we can contact the universities and ask them to use thier data sets On Wed, Jun 8, 2011 at 2:42 AM, James Kosin <[email protected]> wrote: > Hi Eldad, > > Sorry for the late response.... > > 1) Yes, I also have similar success and failure with the NameFinder. > Hopefully, we can come up with better training data. The training data > is simple for the NameFinder... basically, the NameFinder expects that > the document has already been parsed with the Sentence Detector and the > Tokenizer; though it isn't 100% required if you are training your own > applications. > > Say you wanted to use the "Hi James," below although not a complete > sentence, you would have the items on a separate line with the tokenizer > actually producing the result of "Hi James ," ... notice the space > between the James and the ','. The NameFinder expects the data > tokenized as follows "Hi <START:person> James <END> ," ... notice the > <START> and <END> tags for the sentence or partial in this case. The > older models used just <START> and <END> without the qualifier > specifying the type of tag. > > We've also found if you put "Mr" or "Mrs" prefixes to the name it also > seems to recognize the names easier. Most of the training has been done > on news articles and not everyday text. > > Jorn just started a project that the group has been discussing over many > runs that involves collecting and parsing openly free data for the > corpus. https://cwiki.apache.org/OPENNLP/opennlp-annotations.html > Please feel free to join the discussion and help with the tasks. We are > trying to provide open training sets to help with the issues of > customizing and other issues related to using the copyrighted material > for the models. > > James > > > On 6/5/2011 6:52 AM, Eldad Yamin wrote: > > Hi James, > > > > Thank you for your great response! > > > > 1. I already used the command (as described in the documentation) and got > > some nice results. > > > > The only problem that I've found is with the NameFinder, It didn't > > recognizer different names. > > > > Can you please explain how can I use the trainer to "make" him recognize > > more names (Peoples, Places etc.)? > > > > > > 2. Linked documents, in other words is related articals, for example (in > > GATE): > > > > http://gate.ac.uk/biz/customers.html > > > > read the first paragraph under "media" > > > > > > > > 3. In addition, I have access to lots of texts/books that written in > Hebrew, > > how can I use it to train the nameFinder (I will contribute it back)? > > > > an again, tahnk you very much! > > > > On Sun, Jun 5, 2011 at 2:04 AM, James Kosin <[email protected]> > wrote: > > > >> Eldad, > >> > >> It is possible. > >> 1) This is easy enough with the current architecture and models. > >> Basically, you have to pass in the document or paragraphs and parse into > >> sentences using the SentenceDetector, which detects the sentences in the > >> paragraph and returns a String array of sentences. Next the output from > >> the sentence detector needs to be put through the Tokenizer, which takes > >> the sentences and tokenizes into smaller parts. Usually words, but it > >> also moves punctuation away from the words as well. This is done for > >> each sentence and returns a string list of tokens. From here you have > >> the raw data needed for most of the other models. From your > >> description, you will want to use the NameFinder and the supporting > >> models to tag the people, locations, and organizations and the like. > >> > >> 2) Not sure what you mean by link documents to others.... > >> > >> 3) We don't yet support all languages at the moment. Mostly because > >> training and test data need to be collected over many months and parsed > >> to be trained. Many groups have already done some work; unfortunately, > >> most is copyrighted and difficult for everyone to get in some cases. > >> > >> This should get you started. > >> http://incubator.apache.org/opennlp/documentation/manual/opennlp.html > >> > >> Download the release here... Don't forget the models toward the bottom. > >> http://incubator.apache.org/opennlp/download.cgi > >> > >> Let us know if you need anything else. > >> > >> James > >> > >> > >> On 6/4/2011 12:30 PM, Eldad Yamin wrote: > >>> Hello everyone, > >>> After researching about NLP I have found the OpenNLP as one of the most > >>> promising solution at the moment. > >>> however, I'm still looking for instruction on how to make the OpenNLP > fit > >> to > >>> my needs. > >>> > >>> I need the OpenNLP to: > >>> 1. get as input a sentence/paragraph and in return IE, annotation, > named > >>> entities (people, locations, organizations) and (numbers, dates, etc > >> .). > >>> 2. to use the OpenNLP to link documents to others. > >>> 3. to support multi languages. > >>> > >>> Please advise, > >>> Eldad. > >>> > >> > >
