This sounds great. Shouldn't be too difficult. We have parsed a number of pages for chemical content. And I know that Henry has. So we could put together a collaborative list of sites that we can scrape for chemistry. The results would all be in CML with appropriate metadata. This would solve most of the remaining technical problems - it is then only the legal ones.

P.


At 10:34 02/02/2004 +0100, E.L. Willighagen wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Monday 02 February 2004 10:17, Peter Murray-Rust wrote:
> At 22:23 01/02/2004 +0100, Egon Willighagen wrote:
> >dadml://pdb/?1CRN
> >
> >or dadml://any/pdbid?1CRN
> >
> >The second will try any mirror that can return information based on the
> >pdbid...
>
> Presumably someone enters these mirrors and keeps their addresses and
> templates up to date.

Yes. The nice thing about the DADML system is, that the maintainance can be
done by website developers, much like the real domain name server system...

> Is there a cascade - if mirror 1 fails does mirror2
> get called? And what is returned - the actual file?
>
> If so we have something like:
>
> User -> PDBCode -> server
> server -> munged URL (format1)-> mirror1 -> success/error
> success -> PDB file -> user
> failure
> server -> munged URL (format2)-> mirror2 -> success/error
> and so on
>
> is this the model?

Yes, more or less. A HTTP 404 is easily detected, but the system can also
detect things like a returned webpage which states that no information is
available...

> >The DADML system also support retrieving information in other formats, not
> >just chemical/x-pdb or chemical/x-cml, but also text/html etc..
> >I'm not sure if we want to be able to do that sort of things too, so for
> > now it only supports reading chemical formats...
>
> The attraction of chemical/x-* is that the information contained within
> each is (relatively?!) consistent and structured. For an arbitrary web site
> producing HTML the structure could be anything and a separate parser has to
> be written for each. (For example we have written parsers for 2 of the main
> sites offering small molecule information and they obviously are completely
> different.

DADML does not deal with interpretation of the returned format... the
cdk.internet.dadml.DADMLReader does a bit... it can read molecules from
chemical/x-mdl-mol and chemical/x-cml and others... actually, it completely
disregards the MIME system, and just uses the cdk.io.ReaderFactory and looks
at the contents of the stream...

> Moreover the structure of the pages changes regularly. For
> example the *text/html* on the RCSB site will be completely different from
> that on the EBI site even though the actual PDB file is presumably the same
> or closely related. It is the consistency of chemical/x-* that makes it
> useful for machines to parse.

Sofar the DADML has only been used to read clear chemical formats, and display
HTML as is... without any interpretation step... It would be very nice to
have a web service at WWMM that accepts an URL or DADML URI
(dadml://nist-html/cas/50-00-0) and converts the HTML into a CML stream...


Something like: dadml://wwmm-nist-bridge/cas/50-00-0

Egon

- --
[EMAIL PROTECTED]
PhD on Molecular Representation in Chemometrics
Nijmegen University
http://www.cac.sci.kun.nl/people/egonw/
GPG: 1024D/D6336BA6

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (SunOS)

iD8DBQFAHhm2d9R8I9Yza6YRAnBNAJwICKAnGbYiu0lOSQvQuk/FySQxGACgp8aT
HR1eqfmcCDb6D4uCpzE7GD0=
=Idqz
-----END PGP SIGNATURE-----

Peter Murray-Rust Unilever Centre for Molecular Informatics Chemistry Department, Cambridge University Lensfield Road, CAMBRIDGE, CB2 1EW, UK Tel: +44-1223-763069



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Jmol-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/jmol-developers

Reply via email to