Reply,
I would argue against a new file format for carrying SMILES for several
reasons:
* it will have implicit semantics (e.g. this line is a title because it
starts in a special way). This is unobvious to anyone who is unfamiliar with
the convention
* it will be confused with existing *.smi files
* there is no simple way of determining the type of the file. There is no
magic in magic numbers.
* the semantics of the comment are undefined. Is it a title? a processor
directive? a warning to humans? And id? It will inevitably become
overloaded.
* everyone has to write new parsers.

The world now expects XML and RDF, not new legacy formats. CML can hold all
the information you want. Let's assume the Blue-obelisk defines a CML
convention for SMILES:
<cml convention="bo:smiles" xmlns:bo="http://www.blueobelisk.org"; xmlns="
http://www.xml-cml.org/schema";>
  <formula id="mol1" inline="Cc1ccccc1NC(=O)"/>
  <formula id="mol42" title="benzene" inline="c1ccccc1"/>
</cml>

CML provides a rich set of tools for annotation (ids, labels, titles, names,
etc.) which are widely used and understood. There is no danger of not
locating the title or anything else. Any XML parser will easily process this
and it is no more verbose when compressed than the ASCII.

The convention attribute asserts that the file adopts the community
microformats used by the Blue-Obelisk SMILES community and could, for
example require a formula to have a title, or whatever.

Please let's move away from yet another molecular file format

P.


On Fri, Jun 26, 2009 at 10:09 PM, Craig James <[email protected]>wrote:

> Answering a number of comments from both BlueObelisk and OpenBabel forums,
> regarding the proposal to formally define how to do comments to SMILES
> files.
>
> To summarize my current opinion based on recent feedback:
>
>  1. A "#" character (not ';' or space) as the first character on a line
>     is treated as a comment.
>
>  2. Users should be cautioned that this is a new standard, and many
>     parsers won't accept comments.  Parsers should accept them, but
>     SMILES writers should avoid them, until the standard is widely
>     accepted.
>
>  3. If comments are included, the first line of the file should be
>     a file-type identifier: '#\#SMILES_1.0'.
>
> Now to answer specific comments...
>
> Peter Murray-Rust wrote:
> > Please, Please don't use whitespace. It is so easy to lose or to
> > generate by mistake.
>
> Peter and several others pointed this out.  I agree, a space is a poor
> choice.
>
> Peter Murray-Rust wrote:
> > It's generally a good idea NOT to use a character out of the language
> > syntax for a comment. both hash and / are SMILES characters. There are a
> > few others which I think are unused.
>
> Actually, no.  If you include reaction SMILES and SMARTS (which should also
> use the same comment syntax), then the only unused character seems to be
> '|', the vertical-bar or "pipe" character.  That seems like a poor choice
> for comments because of its importance in Unix/Linux shell programming.
>
> It seems to me that any parser with half a brain should be able to figure
> this out.  It's not much of a trick to distinguish '#' at the start of a
> line from a legitimate triple-bond symbol.
>
> Greg Landrum wrote:
> > My two cents:
> > I'd really like to see a distinction between SMILES -- a
> > non-whitespace containing piece of text describing a molecule -- and a
> > SMILES file -- which is, I guess, a bunch of SMILES, possibly with
> > additional data, combined into one file.
>
> Actually the OpenSMILES specification does distinguish the two.  See
> "SMILES Files":
>
>    http://opensmiles.org/spec/open-smiles-4-output.html#4.5
>
> Greg Landrum wrote:
> > If the goal is to get multiple molecules, with extra information, into
> > one file I'd rather see an OpenTDT standard... TDT is an
> > under-utilized (outside of Daylight) format that is quite useful.
>
> I like TDTs too, but I think they've been outdated by XML.  XML is more
> verbose, but there's lots of great libraries to create and parse XML, and
> databases support it directly.  TDTs were ahead of their time.
>
> Daniel Leidert wrote:
> > ... If you formally allow comments in
> > (Open)SMILES files, then please add a requirement to start the file with
> > a comment line containing a file-type identifier (like e.g. CIF 1.1
> > does) or we have just another format, which may start with a hundred or
> > thousend lines long comment...
>
> Good idea, see #3 the summary at the top.
>
> Andrew Dalke wrote:
> > In general though, this proposal is incompatible with most existing
> > SMILES parsers.
>
> Another good observation, see #2 in the summary at the top.
>
> Thanks to everyone, and I invite further comments.
>
> Craig
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Blueobelisk-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
------------------------------------------------------------------------------
_______________________________________________
Blueobelisk-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/blueobelisk-discuss

Reply via email to