Re: [BlueObelisk-discuss] Workshop Logistics
Dear all, Following from Peter's email, it may be worth mentioning our joint project CheTA (Chemistry using Text Annotations--http://www.nactem.ac.uk/cheta/). A demo can be found here: http://www.nactem.ac.uk/software/cheta/ Workflows can be created using U-Compare http://u-compare.org/ which contains a repository of NLP based text mining technology http://u-compare.org/components/index.html. I hope this is of interest to some of you Best wishes, Sophia On 8 Dec 2010, at 10:34, Peter Murray-Rust wrote: I'd like to present two of our projects (Quixote (http://quixote.wikispot.org/Front_Page) and GreenChainReaction (http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction)), both of which are aimed at creating semantically enriched data objects in physical science. (I think there are important and valuable technical issues between how physical scientists think about data and semantics from - say - bio/medical science). Both are bottom-up projects in that they involve web-based contributors without an overarching coordinating body. They are open science (all the work is completely available on the Net as soon as it is published). They also build their semantics bottom-up - i.e. look to see what discourse is used in the domain and try to formalize this. There are probably about 30 people involved (and theye will be more by January 17th) so it doesn't make sense to give an author list - but the projects themselves will of course list contributors. These projects are disruptive technology in the same sense that Wikipedia or Wikileaks are disruptive. (Clay Shirky was lamenting on UK TV 2010-12-07 that the reaction to WL was via extra-legal methods). I don't want to re-enter my polemics but it is factually correct that the established organizations in physical science (most publishers, most learned socs, some univs, some funders) are indifferent or antagonistic. If BTPDF ignores this then its results can only be cosmetic. I believe that its factually true to say that text-mining is currently crippled by the lack of access to freely available and Open scientific content and must be redressed. I have tried to engage with 3-4 major (closed) publishers of chemistry over 5 years and the only thing I have achived is a small corpus for testing purposes under CC-NC from one. One hasn't bothered to reply. Therefore chemistry will either remain a semantic desert or there will be a bottom-up revolution. So far I seem to be the only one addressing item 4 (IPR). On the more positive side we will succeed in our bottom-up projects to create semantics and ontologies for chemical objects and discourse. in GreenChainReaction we analysed ca 10,000 patents from the EPO and carried out semantically based text mining at a medium depth level (i.e. entity recognition, phrase recognition and default tree-banking). This showed that a deeper level of NLP gives much better precsion over textual entity recognition (which is often too imprecise to be useful). We shall be re-running this exercise and present the results at BTPDF where we shall be using USPTO patents to create about 200-500,000 reactions in complete semantic form. This will - we believe - have advatanges over the current commercial extraction of chemistry into reaction databases - unfortunately publishers forbid us to apply the technology to research articles and publish the results. So GCR builds up a resource of all objects published in chemical reactions and this should allow us to create a complete discourse ontology of reactions. (BTW anyone interested in text-mining will be welcome to take part). GCR is an after-the-fact markup although the technology could - in principle - be used in the authoring process. It's a question of communal will, not technology. Quixote represents semantics-at-source and marks up the output of computational chemistry calculations. It's common to publish articles which just describe calculations, though it's also common to find them as support for experimental work. Almost invariably the detailed results are never published though it's trivial to do so and the space is not a problem. the reason for this problem is purely cultural and commercial. Most calculations are carried out by closed source for-money programs and there is an implicit policy of non-interoperability at the syntax, semantic and ontological level. The companies compete at least partially through lockin and inertia which means there is no incentive to create an ontology. Quixote believes that there *is* an underlying stable ontology and that by using the common programs, and exposing their results in semantic form (Chemical Markup Language) we will be able to create a core ontological abstraction. This is not as ambitious as it seems - the equations and fundamental physics are universal and stable for about 80 years or
Re: [BlueObelisk-discuss] Workshop Logistics
I'd like to present two of our projects (Quixote ( http://quixote.wikispot.org/Front_Page) and GreenChainReaction ( http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction)), both of which are aimed at creating semantically enriched data objects in physical science. (I think there are important and valuable technical issues between how physical scientists think about data and semantics from - say - bio/medical science). Both are bottom-up projects in that they involve web-based contributors without an overarching coordinating body. They are open science (all the work is completely available on the Net as soon as it is published). They also build their semantics bottom-up - i.e. look to see what discourse is used in the domain and try to formalize this. There are probably about 30 people involved (and theye will be more by January 17th) so it doesn't make sense to give an author list - but the projects themselves will of course list contributors. These projects are disruptive technology in the same sense that Wikipedia or Wikileaks are disruptive. (Clay Shirky was lamenting on UK TV 2010-12-07 that the reaction to WL was via extra-legal methods). I don't want to re-enter my polemics but it is factually correct that the established organizations in physical science (most publishers, most learned socs, some univs, some funders) are indifferent or antagonistic. If BTPDF ignores this then its results can only be cosmetic. I believe that its factually true to say that text-mining is currently crippled by the lack of access to freely available and Open scientific content and must be redressed. I have tried to engage with 3-4 major (closed) publishers of chemistry over 5 years and the only thing I have achived is a small corpus for testing purposes under CC-NC from one. One hasn't bothered to reply. Therefore chemistry will either remain a semantic desert or there will be a bottom-up revolution. So far I seem to be the only one addressing item 4 (IPR). On the more positive side we will succeed in our bottom-up projects to create semantics and ontologies for chemical objects and discourse. in GreenChainReaction we analysed ca 10,000 patents from the EPO and carried out semantically based text mining at a medium depth level (i.e. entity recognition, phrase recognition and default tree-banking). This showed that a deeper level of NLP gives much better precsion over textual entity recognition (which is often too imprecise to be useful). We shall be re-running this exercise and present the results at BTPDF where we shall be using USPTO patents to create about 200-500,000 reactions in complete semantic form. This will - we believe - have advatanges over the current commercial extraction of chemistry into reaction databases - unfortunately publishers forbid us to apply the technology to research articles and publish the results. So GCR builds up a resource of all objects published in chemical reactions and this should allow us to create a complete discourse ontology of reactions. (BTW anyone interested in text-mining will be welcome to take part). GCR is an after-the-fact markup although the technology could - in principle - be used in the authoring process. It's a question of communal will, not technology. Quixote represents semantics-at-source and marks up the output of computational chemistry calculations. It's common to publish articles which just describe calculations, though it's also common to find them as support for experimental work. Almost invariably the detailed results are never published though it's trivial to do so and the space is not a problem. the reason for this problem is purely cultural and commercial. Most calculations are carried out by closed source for-money programs and there is an implicit policy of non-interoperability at the syntax, semantic and ontological level. The companies compete at least partially through lockin and inertia which means there is no incentive to create an ontology. Quixote believes that there *is* an underlying stable ontology and that by using the common programs, and exposing their results in semantic form (Chemical Markup Language) we will be able to create a core ontological abstraction. This is not as ambitious as it seems - the equations and fundamental physics are universal and stable for about 80 years or more. By creating this onotology it will be possible to add annotation at the time data are emitted from the calculation. It means that all calculations (we guess about 100 million per year or more) will be available to the whole community as Open data. And again anyone can join in. These projects tick boxes 1.1, 1.2, 2.2, 2.3, 2.4 They also show in great detail two enthusiastic communities working on Use Cases (box 3) Please let me know if this needs editing and if not add it to the workshop papers under Bottom-up semantics and ontologies Peter Murray-Rust, members of The Quioxote Project members of The GreenChainReaction -- Peter