Hi all,I can definitely do this. However, I would wait, until I have the validated OANC data (MASK) from Nancy Ide; some code changes might be necessary on this.
Best, Katrin On 02/24/2012 04:01 AM, Jason Baldridge wrote:
+1 On Wed, Feb 22, 2012 at 4:21 PM, Joern Kottmann<[email protected]> wrote:The code we could add to the formats package and ship with OpenNLP directly. +1 to do that Jörn On Wed, Feb 22, 2012 at 10:53 PM, Jason Baldridge< [email protected]> wrote:The OANC is great, and I'm glad to hear that training those models on it work well. It would certainly be a good idea to look at the MASK subset and see how things work out with that. (Also, just to get a sense of the size of it.) If you are up for it, would you be interested in adding code and data to train on OANC for the OpenNLP-Models git repo? https://github.com/utcompling/OpenNLP-Models -Jason On Wed, Feb 22, 2012 at 3:22 AM, Katrin Tomanek <[email protected]>wrote:Hi, as for corpora we could be using to train freely available models for opennlp on: I have tested OANC (the "open" section of the american national corpus, http://www.**americannationalcorpus.org/**OANC<http://www.americannationalcorpus.org/OANC>). Although the OANC is automatically tagged, I obtain quite OK results (sentence splitting, tokenization, POS Tagging, NP Chunking). So, maybe we can provide models trained on OANC for download? Moreover: Nancy Ide just told me, there was a subset of the OANC (called MASK) which was manually validated and is also freely available. This might be even better. What do you think? Best, Katrin On 02/20/2012 11:01 PM, Jason Baldridge wrote:Might be some things we should look at wrt to our goals of creating annotated resources. -j ---------- Forwarded message ---------- From: ELRA ELDA Information<[email protected]> Date: Wed, Feb 15, 2012 at 7:36 AM Subject: LREC 2012 Workshop on Language Resource Merging - Extended Deadline to Feb. 22, 2012To:Call for Papers LREC 2012 Workshop on: Language Resource Merging 22 May 2012 – Afternoon Session EXTENDED Submission deadline: 22 FEBRUARY CONTEXT The availability of adequate language resources has been a well-known bottleneck for most high-level language technology applications, e.g. Machine Translation, parsing, and Information Extraction, for at least15years, and the impact of the bottleneck is becoming all the moreapparentwith the availability of higher computational power and massivestorage,since modern language technologies are capable of using far moreresourcesthan the community produces. The present landscape is characterized bytheexistence of numerous scattered resources, many of which have differing levels of coverage, types of information and granularity. Taken singularly, existing resources do not have sufficient coverage, quality or richness for robust large-scale applications, and yet they contain valuableinformation(Monachini et al. 2004 and 2006; Soria et al. 2006; Molinero, Sagot and Nicolas 2009; Necsulescu et al. 2011). Differing technology orapplicationrequirements, ignorance of the existence of certain resources, and difficulties in accessing and using them, has led to the proliferationofmultiple, unconnected resources that, if merged, could constitute amuchricher repository of information augmenting either coverage or granularity, or both, and consequently multiplying the number of potential language technology applications. Merging, combining and/or compiling larger resources from existing ones thus appears to be a promising directiontotake. The re-use and merging of existing resources is not altogether unknown. For example, WordNet (Fellbaum, 1998) has been successfully reused in a variety of applications. But this is the exception rather than the rule; infact,merging, and enhancing existing resources is uncommon, probablybecause itis by no means a trivial task given the profound differences informats,formalisms, metadata, and linguistic assumptions. The language resource landscape is on the brink of a large change, however. With the proliferation of accessible metadata catalogues, and resource repositories (such as the new META-SHARE (http://www.meta-net.eu/meta-****<http://www.meta-net.eu/meta-**> share<http://www.meta-net.eu/**meta-share<http://www.meta-net.eu/meta-share>>)infrastructure), a potentially large number of existing resources will be more easily located,accessedand downloaded. Also, with the advent of distributed platforms for the automatic production of language resources, such as PANACEA ( http://www.panacea-lr.eu/), new language resources and linguistic information capable of being integrated into those resources will be produced more easily and at a lower cost. Thus, it is likely that researchers and application developers will seek out resources already available before developing new, costly ones, and will require methodsformerging/combining various resources and adapting them to their specific needs. Up to the present day, most resource merging has been done manually,withonly a small number of attempts reported in the literature towards (semi-)automatic merging of resources (Crouch& King 2005; Pustejovskyetal. 2005; Molinero, Sagot and Nicolas 2009; Necsulescu et al. 2011). In order to take a further step towards the scenario depicted above, inwhichresource merging and enhancing is a reliable and accessible first stepforresearchers and application developers, experience and best practicesmustbe shared and discussed, as this will help the whole community avoidanywaste of time and resources. AIMS OF THE WORKSHOP This half-day workshop is meant to be part of a series of meetings constituting an ongoing forum for sharing and evaluating the results of different methods and systems for the automatic production of language resources (the first one was the LREC 2010 Workshop on Methods for the Automatic Production of Language Resources and their EvaluationMethods).The main focus of this workshop is on (semi-)automatic means of merging language resources, such as lexicons, corpora and grammars. Mergingmakesit possible to re-use, adapt, and enhance existing resources, alongside new, automatically created ones, with the goal of reducing the manual intervention required in language resource production, and thusultimatelyproduction costs. WORKSHOP TOPICS The topics of the workshop are related to best practices, methods, techniques and experimental results regarding the merging of varioustypesof language resources, such as lexicons and corpora, especially insupportof language technology applications. In particular, new methods for automatic merging with a view towards reducing human intervention willbemost welcome. Topics for submission include, but are not limited to: Experiments on (semi-)automatic merging of automatically produced resources Experiments on the merging of two or more existing resources containing the same or different levels of linguistic information Studies or experiments on merging resources at different levels of granularity (corpora, lexicons, grammars) Studies or experiments on unifying, mapping or converting encodingformatsComparison between different resources and mapping algorithms toprovidedesired merging Use of linguistic information from different sources in high-level language applications Use of new, merged language resources in language technologyapplicationsWORKSHOP WEBSITE: http://panacea-lr.eu/en/news/****project/2011/12/19/lrec-2012-****<http://panacea-lr.eu/en/news/**project/2011/12/19/lrec-2012-**>merging-lr-workshop/<http://**panacea-lr.eu/en/news/project/** 2011/12/19/lrec-2012-merging-**lr-workshop/<http://panacea-lr.eu/en/news/project/2011/12/19/lrec-2012-merging-lr-workshop/SUBMISSIONS Interested participants must submit a preliminary paper of about 4-6pagesincluding references (between 2000-2500 words). For the submissionpleaseuse the online form on START LREC Conference Manager at: https://www.softconf.com/****lrec2012/MergingLR2012/<https://www.softconf.com/**lrec2012/MergingLR2012/><https:**//www.softconf.com/lrec2012/**MergingLR2012/<https://www.softconf.com/lrec2012/MergingLR2012/>When submitting a paper from the START page, authors will be asked to provide essential information about resources (in a broad sense, i.e.alsotechnologies, standards, evaluation kits, etc.) that have been used for the work described in the paper or are a new result of your research. For further information on this new initiative, please refer to http://www.lrec-conf.org/****lrec2012/?LRE-Map-2012<http://www.lrec-conf.org/**lrec2012/?LRE-Map-2012><http://**www.lrec-conf.org/lrec2012/?**LRE-Map-2012<http://www.lrec-conf.org/lrec2012/?LRE-Map-2012>Papers will be peer-reviewed by the workshop Program Committee. IMPORTANT DATES Deadline for paper submission: 22 February 2012 (23:59 CET +1) **EXTENDED** Notification of acceptance: 15 March 2012 Submission of camera-ready version of papers: 31 March 2012 Workshop date: 22 May 2012 – Afternoon Session ORGANIZING COMMITTEE Núria Bel, UPF, Barcelona, Spain Maria Gavrilidou, ILSP-“Athena”, Athens, Greece, Monica Monachini, CNR-ILC, Pisa, Italy Valeria Quochi, CNR-ILC, Pisa, Italy Laura Rimell, University of Cambridge, UK Contacts lrec12_workshop_merging@ilc.****cnr.it<http://cnr.it<lrec12_workshop_**[email protected]<[email protected]>>PROGRAMME COMMITTEE: Victoria Arranz, ELDA, Paris, France Paul Buitelaaar, National University of Ireland, Galway, Ireland Nicoletta Calzolari, CNR-ILC, Pisa, Italy Olivier Hamon, ELDA, Paris, France Aleš Horák, Masaryk University, Brno, Czech Republic Nancy Ide, Vassar College, Mass. USA Bernardo Magnini, FBK, Trento, Italy Paola Monachesi, Utrecht University, Utrecht, The Netherlands Jan Odijk, , Utrecht University, Utrecht, The Netherlands Muntsa Padró, IULA, Barcellona, Spain Karel Pala, Masaryk University, Brno, Czech Republic Thierry Poibeau University of Cambridge, UK and CNRS, Paris, France Benoît Sagot, INRIA, Paris, France Kiril Simov, Bulgarian Academy of Sciences, Sofia, Bulgaria Claudia Soria, CNR-ILC, Pisa, Italy Maurizio Tesconi, CNR-IIT, Pisa-- Dr. Katrin Tomanek Averbis GmbH Tennenbacher Strasse 11 D-79106 Freiburg Fon: +49 (0) 761 - 203 97696 Fax: +49 (0) 761 - 203 97694 E-Mail: [email protected] Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó Sitz der Gesellschaft: Freiburg i. Br. AG Freiburg i. Br., HRB 701080-- Jason Baldridge Associate Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
-- Dr. Katrin Tomanek Averbis GmbH Tennenbacher Strasse 11 D-79106 Freiburg Fon: +49 (0) 761 - 203 97696 Fax: +49 (0) 761 - 203 97694 E-Mail: [email protected] Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó Sitz der Gesellschaft: Freiburg i. Br. AG Freiburg i. Br., HRB 701080
