Dear colleagues, You want to automate the discovery of people, place names and events within a large corpus of unstructured documents or metadata (e.g. description field)? Then you might want to use the Named-Entity Recognition (NER) extension for OpenRefine that has been developed by Multimedia Lab (ELIS — Ghent University / iMinds) and MasTIC (Université Libre de Bruxelles).
On http://freeyourmetadata.org/named-entity-extraction/, you will find all the information necessary to start experimenting with NER on your own. The extension was developed specifically in the context of a research paper, entitled "Named-Entity Recognition: A Gateway Drug for Cultural Heritage Collections to the Linked Data Cloud?". A preprint of this paper can be found on http://freeyourmetadata.org/publications/named-entity-recognition.pdf. The paper also aims to foster a discussion within the Digital Library community regarding the quality of concepts described in knowledge bases (e.g. Freebase versus DBPedia) and the current struggle between schemes (e.g. schema.org versus Open Graph protocol). We will be presenting our work in North and Latin America in March (Boston), April (New York and Philadelphia), May (Quito) and June (New York and Montreal) so if you're located in one of those cities/areas and interested in collaborating or hosting a workshop on this topic, don't hesitate to get in touch. Kind regards, Seth van Hooland Président du Master en Sciences et Technologies de l'Information et de la Communication (MaSTIC) Université Libre de Bruxelles Av. F.D. Roosevelt, 50 CP 123 | 1050 Bruxelles http://homepages.ulb.ac.be/~svhoolan/ http://twitter.com/#!/sethvanhooland http://mastic.ulb.ac.be 0032 2 650 4765 Office: DC11.102 Seth van Hooland Président du Master en Sciences et Technologies de l'Information et de la Communication (MaSTIC) Université Libre de Bruxelles Av. F.D. Roosevelt, 50 CP 123 | 1050 Bruxelles http://homepages.ulb.ac.be/~svhoolan/ http://twitter.com/#!/sethvanhooland http://mastic.ulb.ac.be 0032 2 650 4765 Office: DC11.102