The OANC is great, and I'm glad to hear that training those models on it
work well. It would certainly be a good idea to look at the MASK subset and
see how things work out with that. (Also, just to get a sense of the size
of it.)

If you are up for it, would you be interested in adding code and data to
train on OANC for the OpenNLP-Models git repo?

https://github.com/utcompling/OpenNLP-Models

-Jason

On Wed, Feb 22, 2012 at 3:22 AM, Katrin Tomanek
<[email protected]>wrote:

>
>
>
> Hi,
>
> as for corpora we could be using to train freely available models for
> opennlp on: I have tested OANC (the "open" section of the american
> national corpus, 
> http://www.**americannationalcorpus.org/**OANC<http://www.americannationalcorpus.org/OANC>
> ).
>
> Although the OANC is automatically tagged, I obtain quite OK results
> (sentence splitting, tokenization, POS Tagging, NP Chunking).
>
> So, maybe we can provide models trained on OANC for download?
>
> Moreover: Nancy Ide just told me, there was a subset of the OANC (called
> MASK) which was manually validated and is also freely available. This
> might be even better.
>
> What do you think?
>
> Best,
> Katrin
>
>
>
> On 02/20/2012 11:01 PM, Jason Baldridge wrote:
>
>> Might be some things we should look at wrt to our goals of creating
>> annotated resources. -j
>>
>> ---------- Forwarded message ----------
>> From: ELRA ELDA Information<[email protected]>
>> Date: Wed, Feb 15, 2012 at 7:36 AM
>> Subject: LREC 2012 Workshop on Language Resource Merging - Extended
>> Deadline to Feb. 22, 2012
>>
>
>  To:
>>
>>
>> Call for Papers
>> LREC 2012 Workshop on: Language Resource Merging
>> 22 May 2012 – Afternoon Session
>>
>> EXTENDED Submission deadline: 22 FEBRUARY
>>
>> CONTEXT
>> The availability of adequate language resources has been a well-known
>> bottleneck for most high-level language technology applications, e.g.
>> Machine Translation, parsing, and Information Extraction, for at least 15
>> years, and the impact of the bottleneck is becoming all the more apparent
>> with the availability of higher computational power and massive storage,
>> since modern language technologies are capable of using far more resources
>> than the community produces. The present landscape is characterized by the
>> existence of numerous scattered resources, many of which have differing
>> levels of coverage, types of information and granularity. Taken
>> singularly,
>> existing resources do not have sufficient coverage, quality or richness
>> for
>> robust large-scale applications, and yet they contain valuable information
>> (Monachini et al. 2004 and 2006; Soria et al. 2006; Molinero, Sagot and
>> Nicolas 2009; Necsulescu et al. 2011). Differing technology or application
>> requirements, ignorance of the existence of certain resources, and
>> difficulties in accessing and using them, has led to the proliferation of
>> multiple, unconnected resources that, if merged, could constitute a much
>> richer repository of information augmenting either coverage or
>> granularity,
>> or both, and consequently multiplying the number of potential language
>> technology applications. Merging, combining and/or compiling larger
>> resources from existing ones thus appears to be a promising direction to
>> take.
>> The re-use and merging of existing resources is not altogether unknown.
>> For
>> example, WordNet (Fellbaum, 1998) has been successfully reused in a
>> variety
>> of applications. But this is the exception rather than the rule; in fact,
>> merging, and enhancing existing resources is uncommon, probably because it
>> is by no means a trivial task given the profound differences in formats,
>> formalisms, metadata, and linguistic assumptions.
>> The language resource landscape is on the brink of a large change,
>> however.
>> With the proliferation of accessible metadata catalogues, and resource
>> repositories (such as the new META-SHARE (http://www.meta-net.eu/meta-***
>> * <http://www.meta-net.eu/meta-**>
>> share<http://www.meta-net.eu/**meta-share<http://www.meta-net.eu/meta-share>>)
>> infrastructure), a potentially
>> large number of existing resources will be more easily located, accessed
>> and downloaded. Also, with the advent of distributed platforms for the
>> automatic production of language resources, such as PANACEA (
>> http://www.panacea-lr.eu/), new language resources and linguistic
>> information capable of being integrated into those resources will be
>> produced more easily and at a lower cost. Thus, it is likely that
>> researchers and application developers will seek out resources already
>> available before developing new, costly ones, and will require methods for
>> merging/combining various resources and adapting them to their specific
>> needs.
>> Up to the present day, most resource merging has been done manually, with
>> only a small number of attempts reported in the literature towards
>> (semi-)automatic merging of resources (Crouch&  King 2005; Pustejovsky et
>> al. 2005; Molinero, Sagot and Nicolas 2009; Necsulescu et al. 2011). In
>> order to take a further step towards the scenario depicted above, in which
>> resource merging and enhancing is a reliable and accessible first step for
>> researchers and application developers, experience and best practices must
>> be shared and discussed, as this will help the whole community avoid any
>> waste of time and resources.
>>
>> AIMS OF THE WORKSHOP
>> This half-day workshop is meant to be part of a series of meetings
>> constituting an ongoing forum for sharing and evaluating the results of
>> different methods and systems for the automatic production of language
>> resources (the first one was the LREC 2010 Workshop on Methods for the
>> Automatic Production of Language Resources and their Evaluation Methods).
>> The main focus of this workshop is on (semi-)automatic means of merging
>> language resources, such as lexicons, corpora and grammars. Merging makes
>> it possible to re-use, adapt, and enhance existing resources, alongside
>> new, automatically created ones, with the goal of reducing the manual
>> intervention required in language resource production, and thus ultimately
>> production costs.
>>
>> WORKSHOP TOPICS
>> The topics of the workshop are related to best practices, methods,
>> techniques and experimental results regarding the merging of various types
>> of language resources, such as lexicons and corpora, especially in support
>> of language technology applications. In particular, new methods for
>> automatic merging with a view towards reducing human intervention will be
>> most welcome.
>> Topics for submission include, but are not limited to:
>> Experiments on (semi-)automatic merging of automatically produced
>> resources
>> Experiments on the merging of two or more existing resources containing
>> the
>> same or different levels of linguistic information
>> Studies or experiments on merging resources at different levels of
>> granularity (corpora, lexicons, grammars)
>> Studies or experiments on unifying, mapping or converting encoding formats
>> Comparison between different resources and mapping algorithms to provide
>> desired merging
>> Use of linguistic information from different sources in high-level
>> language
>> applications
>> Use of new, merged language resources in language technology applications
>>
>> WORKSHOP WEBSITE:
>> http://panacea-lr.eu/en/news/****project/2011/12/19/lrec-2012-****<http://panacea-lr.eu/en/news/**project/2011/12/19/lrec-2012-**>
>> merging-lr-workshop/<http://**panacea-lr.eu/en/news/project/**
>> 2011/12/19/lrec-2012-merging-**lr-workshop/<http://panacea-lr.eu/en/news/project/2011/12/19/lrec-2012-merging-lr-workshop/>
>> >
>>
>> SUBMISSIONS
>> Interested participants must submit a preliminary paper of about 4-6 pages
>> including references (between 2000-2500 words). For the submission please
>> use the online form on START LREC Conference Manager at:
>> https://www.softconf.com/****lrec2012/MergingLR2012/<https://www.softconf.com/**lrec2012/MergingLR2012/>
>> <https:**//www.softconf.com/lrec2012/**MergingLR2012/<https://www.softconf.com/lrec2012/MergingLR2012/>
>> >
>> When submitting a paper from the START page, authors will be asked to
>> provide essential information about resources (in a broad sense, i.e. also
>> technologies, standards, evaluation kits, etc.) that have been used for
>> the
>> work described in the paper or are a new result of your research.
>> For further information on this new initiative, please refer to
>> http://www.lrec-conf.org/****lrec2012/?LRE-Map-2012<http://www.lrec-conf.org/**lrec2012/?LRE-Map-2012>
>> <http://**www.lrec-conf.org/lrec2012/?**LRE-Map-2012<http://www.lrec-conf.org/lrec2012/?LRE-Map-2012>
>> >
>> Papers will be peer-reviewed by the workshop Program Committee.
>>
>> IMPORTANT DATES
>> Deadline for paper submission: 22 February 2012 (23:59 CET +1)
>> **EXTENDED**
>> Notification of acceptance: 15 March 2012
>> Submission of camera-ready version of papers: 31 March 2012
>> Workshop date: 22 May 2012 – Afternoon Session
>>
>> ORGANIZING COMMITTEE
>> Núria Bel, UPF, Barcelona, Spain
>> Maria Gavrilidou, ILSP-“Athena”, Athens, Greece,
>> Monica Monachini, CNR-ILC, Pisa, Italy
>> Valeria Quochi, CNR-ILC, Pisa, Italy
>> Laura Rimell, University of Cambridge, UK
>>
>> Contacts
>> lrec12_workshop_merging@ilc.****cnr.it <http://cnr.it><lrec12_workshop_**
>> [email protected] <[email protected]>>
>>
>> PROGRAMME COMMITTEE:
>> Victoria Arranz, ELDA, Paris, France
>> Paul Buitelaaar, National University of Ireland, Galway, Ireland
>> Nicoletta Calzolari, CNR-ILC, Pisa, Italy
>> Olivier Hamon, ELDA, Paris, France
>> Aleš Horák, Masaryk University, Brno, Czech Republic
>> Nancy Ide, Vassar College, Mass. USA
>> Bernardo Magnini, FBK, Trento, Italy
>> Paola Monachesi, Utrecht University, Utrecht, The Netherlands
>> Jan Odijk, , Utrecht University, Utrecht, The Netherlands
>> Muntsa Padró, IULA, Barcellona, Spain
>> Karel Pala, Masaryk University, Brno, Czech Republic
>> Thierry Poibeau University of Cambridge, UK and CNRS, Paris, France
>> Benoît Sagot, INRIA, Paris, France
>> Kiril Simov, Bulgarian Academy of Sciences, Sofia, Bulgaria
>> Claudia Soria, CNR-ILC, Pisa, Italy
>> Maurizio Tesconi, CNR-IIT, Pisa
>>
>>
>>
>>
>>
>
> --
> Dr. Katrin Tomanek
> Averbis GmbH
> Tennenbacher Strasse 11
> D-79106 Freiburg
>
> Fon: +49 (0) 761 - 203 97696
> Fax: +49 (0) 761 - 203 97694
> E-Mail: [email protected]
>
> Geschäftsführer: Dr. med. Philipp Daumke, Dr. Kornél Markó
> Sitz der Gesellschaft: Freiburg i. Br.
> AG Freiburg i. Br., HRB 701080
>



-- 
Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Reply via email to