On Apr 29, 2015, at 9:19 PM, Dimitri Maziuk wrote:
> There is a difference between ACM members writing network protocols and
> "domain" people writing junk.

I think that you are saying that the MDL connection table
file formats are junk. I do not disagree. But it's something
we have to deal with so my personal views matter little.

The MDL file formats are definitely not network protocols,
but as you brought up Postel's Robustness Principle I
thought you were suggesting that the principle applies
more broadly than just network protocols.

And for what it's worth, I used to be an ACM member.


>> Yes, I agree with this. What constitutes "forbidden"?
> 
> Simply put, the ones that lexer will match as "not values".

Certainly. My question is, what are the lexer rules? This
is not so simple.

Do they allow NUL? Do they allow "$$$$"? Is the goal to
handle the SD format exactly as specified? Or to be useful
for preventing likely interoperability problems?

>> If there is an error, does the writer generate a partial record,
> 
> My interpretation of "conservative" is wipe out the file then crash and
> burn. With a useful error message.

If the output is to a stream than there is no file to wipe.

If the downstream pipe consumer only processes the connection table,
and does so at the "M  END", then upstream code which emits a
partial record may be enough for downstream code, which expects to
receive valid data, to emit output for the incomplete record.

Thus, the only way to get what you want is to validate all of
the fields before emitting any data. However, that does require
more performance overhead and is more complex to write. There
is also the "worse is better"/"New Jersey style" principle.


> If you define your lexical tokens properly, no problem. The problem is
> when lexer can't decide what's what.

Well, yes. A well-defined grammar is one the recommendations
for the "patched" version of the Robustness Principle.

The problem is two-fold:
  1) there is no unambiguous language definition for the
      SDF grammar (I've tried!), and
  2) the documentation contains ambiguities on how to handle
      certain circumstances.

For examples, 1) can the 'S SKP' field be used to skip the 'M  END'?
Different Symyx tools give different answers. 2) Are the numeric
fields all right-aligned? There was problem where RDKit expected
one alignment and another tool generated the other. RDKit now
expects either.

Or, the spec says of the title line:

   This line is unformatted, but like all other lines in a molfile
   cannot extend beyond column 80

while as you saw earlier, it also says:

   A [Data] value can extend over multiple lines containing
   up to 200 characters each.

Which is normative?

I go back to the question of, what is the goal? Is it to
prevent RDKit from being used to create ill-formated SD files?
If so, then there are many things to review. For example, the
spec says:

     This line must not contain any of the reserved tags that
     identify any of the other CTAB file types such as $MDL
     (RGfile), $$$$ (SDfile record separator), $RXN (rxnfile),
     or $RDFILE (RDfile headers).

While RDKit allows arbitrary names in the title. (And I'm not
even sure if the spec allows "$$$$12345" or "$MDL3" or not.)

Your points are all valid, but I don't see how it's applicable
given the circumstances.

What RDKit, Open Babel, OEChem, and others do is to follow
the New Jersey style, and place a higher burden on API users,
instead of spending rather a lot of development time to
implement some complex and rarely needed validation logic,
for a format that wasn't designed as an exchange file format
and doesn't contain the mechanisms needed to be able to
follow the Robustness Principle.

I'm not convinced that they were wrong to do so.

Cheers,


                                Andrew
                                da...@dalkescientific.com

P.S.
  "XML in this example ... is written by a ball street wanker."

This slur is both gratuitous and wrong. The example XML
was written by Tim Bray, who is not a Wall Street Banker,
and the second example concerns EFTPOS messages. I do not
wish to participate in discussions with remarks of this sort.


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to