The OANC is great, and I'm glad to hear that training those models on it work well. It would certainly be a good idea to look at the MASK subset and see how things work out with that. (Also, just to get a sense of the size of it.)
If you are up for it, would you be interested in adding code and data to train on OANC for the OpenNLP-Models git repo? https://github.com/utcompling/OpenNLP-Models -Jason On Wed, Feb 22, 2012 at 3:22 AM, Katrin Tomanek <[email protected]>wrote: > > > > Hi, > > as for corpora we could be using to train freely available models for > opennlp on: I have tested OANC (the "open" section of the american > national corpus, > http://www.**americannationalcorpus.org/**OANC<http://www.americannationalcorpus.org/OANC> > ). > > Although the OANC is automatically tagged, I obtain quite OK results > (sentence splitting, tokenization, POS Tagging, NP Chunking). > > So, maybe we can provide models trained on OANC for download? > > Moreover: Nancy Ide just told me, there was a subset of the OANC (called > MASK) which was manually validated and is also freely available. This > might be even better. > > What do you think? > > Best, > Katrin > > > > On 02/20/2012 11:01 PM, Jason Baldridge wrote: > >> Might be some things we should look at wrt to our goals of creating >> annotated resources. -j >> >> ---------- Forwarded message ---------- >> From: ELRA ELDA Information<[email protected]> >> Date: Wed, Feb 15, 2012 at 7:36 AM >> Subject: LREC 2012 Workshop on Language Resource Merging - Extended >> Deadline to Feb. 22, 2012 >> > > To: >> >> >> Call for Papers >> LREC 2012 Workshop on: Language Resource Merging >> 22 May 2012 – Afternoon Session >> >> EXTENDED Submission deadline: 22 FEBRUARY >> >> CONTEXT >> The availability of adequate language resources has been a well-known >> bottleneck for most high-level language technology applications, e.g. >> Machine Translation, parsing, and Information Extraction, for at least 15 >> years, and the impact of the bottleneck is becoming all the more apparent >> with the availability of higher computational power and massive storage, >> since modern language technologies are capable of using far more resources >> than the community produces. The present landscape is characterized by the >> existence of numerous scattered resources, many of which have differing >> levels of coverage, types of information and granularity. Taken >> singularly, >> existing resources do not have sufficient coverage, quality or richness >> for >> robust large-scale applications, and yet they contain valuable information >> (Monachini et al. 2004 and 2006; Soria et al. 2006; Molinero, Sagot and >> Nicolas 2009; Necsulescu et al. 2011). Differing technology or application >> requirements, ignorance of the existence of certain resources, and >> difficulties in accessing and using them, has led to the proliferation of >> multiple, unconnected resources that, if merged, could constitute a much >> richer repository of information augmenting either coverage or >> granularity, >> or both, and consequently multiplying the number of potential language >> technology applications. Merging, combining and/or compiling larger >> resources from existing ones thus appears to be a promising direction to >> take. >> The re-use and merging of existing resources is not altogether unknown. >> For >> example, WordNet (Fellbaum, 1998) has been successfully reused in a >> variety >> of applications. But this is the exception rather than the rule; in fact, >> merging, and enhancing existing resources is uncommon, probably because it >> is by no means a trivial task given the profound differences in formats, >> formalisms, metadata, and linguistic assumptions. >> The language resource landscape is on the brink of a large change, >> however. >> With the proliferation of accessible metadata catalogues, and resource >> repositories (such as the new META-SHARE (http://www.meta-net.eu/meta-*** >> * <http://www.meta-net.eu/meta-**> >> share<http://www.meta-net.eu/**meta-share<http://www.meta-net.eu/meta-share>>) >> infrastructure), a potentially >> large number of existing resources will be more easily located, accessed >> and downloaded. Also, with the advent of distributed platforms for the >> automatic production of language resources, such as PANACEA ( >> http://www.panacea-lr.eu/), new language resources and linguistic >> information capable of being integrated into those resources will be >> produced more easily and at a lower cost. Thus, it is likely that >> researchers and application developers will seek out resources already >> available before developing new, costly ones, and will require methods for >> merging/combining various resources and adapting them to their specific >> needs. >> Up to the present day, most resource merging has been done manually, with >> only a small number of attempts reported in the literature towards >> (semi-)automatic merging of resources (Crouch& King 2005; Pustejovsky et >> al. 2005; Molinero, Sagot and Nicolas 2009; Necsulescu et al. 2011). In >> order to take a further step towards the scenario depicted above, in which >> resource merging and enhancing is a reliable and accessible first step for >> researchers and application developers, experience and best practices must >> be shared and discussed, as this will help the whole community avoid any >> waste of time and resources. >> >> AIMS OF THE WORKSHOP >> This half-day workshop is meant to be part of a series of meetings >> constituting an ongoing forum for sharing and evaluating the results of >> different methods and systems for the automatic production of language >> resources (the first one was the LREC 2010 Workshop on Methods for the >> Automatic Production of Language Resources and their Evaluation Methods). >> The main focus of this workshop is on (semi-)automatic means of merging >> language resources, such as lexicons, corpora and grammars. Merging makes >> it possible to re-use, adapt, and enhance existing resources, alongside >> new, automatically created ones, with the goal of reducing the manual >> intervention required in language resource production, and thus ultimately >> production costs. >> >> WORKSHOP TOPICS >> The topics of the workshop are related to best practices, methods, >> techniques and experimental results regarding the merging of various types >> of language resources, such as lexicons and corpora, especially in support >> of language technology applications. In particular, new methods for >> automatic merging with a view towards reducing human intervention will be >> most welcome. >> Topics for submission include, but are not limited to: >> Experiments on (semi-)automatic merging of automatically produced >> resources >> Experiments on the merging of two or more existing resources containing >> the >> same or different levels of linguistic information >> Studies or experiments on merging resources at different levels of >> granularity (corpora, lexicons, grammars) >> Studies or experiments on unifying, mapping or converting encoding formats >> Comparison between different resources and mapping algorithms to provide >> desired merging >> Use of linguistic information from different sources in high-level >> language >> applications >> Use of new, merged language resources in language technology applications >> >> WORKSHOP WEBSITE: >> http://panacea-lr.eu/en/news/****project/2011/12/19/lrec-2012-****<http://panacea-lr.eu/en/news/**project/2011/12/19/lrec-2012-**> >> merging-lr-workshop/<http://**panacea-lr.eu/en/news/project/** >> 2011/12/19/lrec-2012-merging-**lr-workshop/<http://panacea-lr.eu/en/news/project/2011/12/19/lrec-2012-merging-lr-workshop/> >> > >> >> SUBMISSIONS >> Interested participants must submit a preliminary paper of about 4-6 pages >> including references (between 2000-2500 words). For the submission please >> use the online form on START LREC Conference Manager at: >> https://www.softconf.com/****lrec2012/MergingLR2012/<https://www.softconf.com/**lrec2012/MergingLR2012/> >> <https:**//www.softconf.com/lrec2012/**MergingLR2012/<https://www.softconf.com/lrec2012/MergingLR2012/> >> > >> When submitting a paper from the START page, authors will be asked to >> provide essential information about resources (in a broad sense, i.e. also >> technologies, standards, evaluation kits, etc.) that have been used for >> the >> work described in the paper or are a new result of your research. >> For further information on this new initiative, please refer to >> http://www.lrec-conf.org/****lrec2012/?LRE-Map-2012<http://www.lrec-conf.org/**lrec2012/?LRE-Map-2012> >> <http://**www.lrec-conf.org/lrec2012/?**LRE-Map-2012<http://www.lrec-conf.org/lrec2012/?LRE-Map-2012> >> > >> Papers will be peer-reviewed by the workshop Program Committee. >> >> IMPORTANT DATES >> Deadline for paper submission: 22 February 2012 (23:59 CET +1) >> **EXTENDED** >> Notification of acceptance: 15 March 2012 >> Submission of camera-ready version of papers: 31 March 2012 >> Workshop date: 22 May 2012 – Afternoon Session >> >> ORGANIZING COMMITTEE >> Núria Bel, UPF, Barcelona, Spain >> Maria Gavrilidou, ILSP-“Athena”, Athens, Greece, >> Monica Monachini, CNR-ILC, Pisa, Italy >> Valeria Quochi, CNR-ILC, Pisa, Italy >> Laura Rimell, University of Cambridge, UK >> >> Contacts >> lrec12_workshop_merging@ilc.****cnr.it <http://cnr.it><lrec12_workshop_** >> [email protected] <[email protected]>> >> >> PROGRAMME COMMITTEE: >> Victoria Arranz, ELDA, Paris, France >> Paul Buitelaaar, National University of Ireland, Galway, Ireland >> Nicoletta Calzolari, CNR-ILC, Pisa, Italy >> Olivier Hamon, ELDA, Paris, France >> Aleš Horák, Masaryk University, Brno, Czech Republic >> Nancy Ide, Vassar College, Mass. USA >> Bernardo Magnini, FBK, Trento, Italy >> Paola Monachesi, Utrecht University, Utrecht, The Netherlands >> Jan Odijk, , Utrecht University, Utrecht, The Netherlands >> Muntsa Padró, IULA, Barcellona, Spain >> Karel Pala, Masaryk University, Brno, Czech Republic >> Thierry Poibeau University of Cambridge, UK and CNRS, Paris, France >> Benoît Sagot, INRIA, Paris, France >> Kiril Simov, Bulgarian Academy of Sciences, Sofia, Bulgaria >> Claudia Soria, CNR-ILC, Pisa, Italy >> Maurizio Tesconi, CNR-IIT, Pisa >> >> >> >> >> > > -- > Dr. Katrin Tomanek > Averbis GmbH > Tennenbacher Strasse 11 > D-79106 Freiburg > > Fon: +49 (0) 761 - 203 97696 > Fax: +49 (0) 761 - 203 97694 > E-Mail: [email protected] > > Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó > Sitz der Gesellschaft: Freiburg i. Br. > AG Freiburg i. Br., HRB 701080 > -- Jason Baldridge Associate Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
