Re: [BlueObelisk-discuss] Workshop Logistics

2010-12-09 Thread Sophia Ananiadou
Dear all,

Following from Peter's email, it may be worth mentioning our joint project 
CheTA (Chemistry using Text Annotations--http://www.nactem.ac.uk/cheta/).

A demo can be found here: http://www.nactem.ac.uk/software/cheta/
Workflows can be created using U-Compare http://u-compare.org/ which contains a 
repository of NLP based text mining technology 
http://u-compare.org/components/index.html. 

I hope this is of interest to some of you

Best wishes,

Sophia


On 8 Dec 2010, at 10:34, Peter Murray-Rust wrote:

 I'd like to present two of our projects (Quixote 
 (http://quixote.wikispot.org/Front_Page) and GreenChainReaction 
 (http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction)), both 
 of which are aimed at creating semantically enriched data objects in physical 
 science. (I think there are important and valuable technical issues between 
 how physical scientists think about data and semantics from - say - 
 bio/medical science). 
 
 Both are bottom-up projects in that they involve web-based contributors 
 without an overarching coordinating body. They are open science (all the work 
 is completely available on the Net as soon as it is published). They also 
 build their semantics bottom-up - i.e. look to see what discourse is used 
 in the domain and try to formalize this. There are probably about 30 people 
 involved (and theye will be more by January 17th) so it doesn't make sense to 
 give an author list - but the projects themselves will of course list 
 contributors.
 
 These projects are disruptive technology in the same sense that Wikipedia or 
 Wikileaks are disruptive. (Clay Shirky was lamenting on UK TV 2010-12-07 that 
 the reaction to WL was via extra-legal methods). I don't want to re-enter my 
 polemics but it is factually correct that the established organizations in 
 physical science (most publishers, most learned socs, some univs, some 
 funders) are indifferent or antagonistic. If BTPDF ignores this then its 
 results can only be cosmetic. I believe that its factually true to say that 
 text-mining is currently crippled by the lack of access to freely available 
 and Open scientific content and must be redressed. I have tried to engage 
 with 3-4 major (closed) publishers of chemistry over 5 years and the only 
 thing I have achived is a small corpus for testing purposes under CC-NC from 
 one. One hasn't bothered to reply. Therefore chemistry will either remain a 
 semantic desert or there will be a bottom-up revolution. 
 
 So far I seem to be the only one addressing item 4 (IPR). 
 
 On the more positive side we will succeed in our bottom-up projects to create 
 semantics and ontologies for chemical objects and discourse. in 
 GreenChainReaction we analysed ca 10,000 patents from the EPO and carried out 
 semantically based text mining at a medium depth level (i.e. entity 
 recognition, phrase recognition and default tree-banking). This showed that a 
 deeper level of NLP gives much better precsion over textual entity 
 recognition (which is often too imprecise to be useful). We shall be 
 re-running this exercise and present the results at BTPDF where we shall be 
 using USPTO patents to create about 200-500,000 reactions in complete 
 semantic form. This will - we believe - have advatanges over the current 
 commercial extraction of chemistry into reaction databases - unfortunately 
 publishers forbid us to apply the technology to research articles and publish 
 the results. So GCR builds up a resource of all objects published in chemical 
 reactions and this should allow us to create a complete discourse ontology of 
 reactions. (BTW anyone interested in text-mining will be welcome to take 
 part).
 
 GCR is an after-the-fact markup although the technology could - in principle 
 - be used in the authoring process. It's a question of communal will, not 
 technology.
 
 Quixote represents semantics-at-source and marks up the output of 
 computational chemistry calculations. It's common to publish articles which 
 just describe calculations, though it's also common to find them as support 
 for experimental work. Almost invariably the detailed results are never 
 published though it's trivial to do so and the space is not a problem. 
 
 the reason for this problem is purely cultural and commercial. Most 
 calculations are carried out by closed source for-money programs and there is 
 an implicit policy of non-interoperability at the syntax, semantic and 
 ontological level. The companies compete at least partially through lockin 
 and inertia which means there is no incentive to create an ontology.
 
 Quixote believes that there *is* an underlying stable ontology and that by 
 using the common programs, and exposing their results in semantic form 
 (Chemical Markup Language) we will be able to create a core ontological 
 abstraction. This is not as ambitious as it seems - the equations and 
 fundamental physics are universal and stable for about 80 years or 

Re: [BlueObelisk-discuss] Workshop Logistics

2010-12-08 Thread Peter Murray-Rust
I'd like to present two of our projects (Quixote (
http://quixote.wikispot.org/Front_Page) and GreenChainReaction (
http://scienceonlinelondon.wikidot.com/topics:green-chain-reaction)), both
of which are aimed at creating semantically enriched data objects in
physical science. (I think there are important and valuable technical issues
between how physical scientists think about data and semantics from - say -
bio/medical science).

Both are bottom-up projects in that they involve web-based contributors
without an overarching coordinating body. They are open science (all the
work is completely available on the Net as soon as it is published). They
also build their semantics bottom-up - i.e. look to see what discourse
is used in the domain and try to formalize this. There are probably about 30
people involved (and theye will be more by January 17th) so it doesn't make
sense to give an author list - but the projects themselves will of course
list contributors.

These projects are disruptive technology in the same sense that Wikipedia or
Wikileaks are disruptive. (Clay Shirky was lamenting on UK TV 2010-12-07
that the reaction to WL was via extra-legal methods). I don't want to
re-enter my polemics but it is factually correct that the established
organizations in physical science (most publishers, most learned socs, some
univs, some funders) are indifferent or antagonistic. If BTPDF ignores this
then its results can only be cosmetic. I believe that its factually true to
say that text-mining is currently crippled by the lack of access to freely
available and Open scientific content and must be redressed. I have tried to
engage with 3-4 major (closed) publishers of chemistry over 5 years and the
only thing I have achived is a small corpus for testing purposes under CC-NC
from one. One hasn't bothered to reply. Therefore chemistry will either
remain a semantic desert or there will be a bottom-up revolution.

So far I seem to be the only one addressing item 4 (IPR).

On the more positive side we will succeed in our bottom-up projects to
create semantics and ontologies for chemical objects and discourse. in
GreenChainReaction we analysed ca 10,000 patents from the EPO and carried
out semantically based text mining at a medium depth level (i.e. entity
recognition, phrase recognition and default tree-banking). This showed that
a deeper level of NLP gives much better precsion over textual entity
recognition (which is often too imprecise to be useful). We shall be
re-running this exercise and present the results at BTPDF where we shall be
using USPTO patents to create about 200-500,000 reactions in complete
semantic form. This will - we believe - have advatanges over the current
commercial extraction of chemistry into reaction databases - unfortunately
publishers forbid us to apply the technology to research articles and
publish the results. So GCR builds up a resource of all objects published in
chemical reactions and this should allow us to create a complete discourse
ontology of reactions. (BTW anyone interested in text-mining will be welcome
to take part).

GCR is an after-the-fact markup although the technology could - in principle
- be used in the authoring process. It's a question of communal will, not
technology.

Quixote represents semantics-at-source and marks up the output of
computational chemistry calculations. It's common to publish articles
which just describe calculations, though it's also common to find them as
support for experimental work. Almost invariably the detailed results are
never published though it's trivial to do so and the space is not a problem.


the reason for this problem is purely cultural and commercial. Most
calculations are carried out by closed source for-money programs and there
is an implicit policy of non-interoperability at the syntax, semantic and
ontological level. The companies compete at least partially through lockin
and inertia which means there is no incentive to create an ontology.

Quixote believes that there *is* an underlying stable ontology and that by
using the common programs, and exposing their results in semantic form
(Chemical Markup Language) we will be able to create a core ontological
abstraction. This is not as ambitious as it seems - the equations and
fundamental physics are universal and stable for about 80 years or more. By
creating this onotology it will be possible to add annotation at the time
data are emitted from the calculation. It means that all calculations (we
guess about 100 million per year or more) will be available to the whole
community as Open data. And again anyone can join in.

These projects tick boxes 1.1, 1.2, 2.2, 2.3, 2.4 They also show in great
detail two enthusiastic communities working on Use Cases (box 3)

Please let me know if this needs editing and if not add it to the workshop
papers under

Bottom-up semantics and ontologies

Peter Murray-Rust,
members of The Quioxote Project
members of The GreenChainReaction



-- 
Peter