Re: [ccp4bb] Code to handle the syntax of (mm)CIF data correctly.

Peter Keller Thu, 19 Sep 2013 08:38:43 -0700

Hi Marcin,

On Thu, 2013-09-19 at 14:51 +0100, Marcin Wojdyr wrote:
> Hi Peter,
> 
> On Thu, Sep 19, 2013 at 10:28:22AM +0100, Peter Keller wrote:
> 
> > > http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#bnf
> ...
> > 
> > This grammar seems to be based on the 1994 J. Chem. Inf. Comp. Sci one,
> > which has some serious errors. I would strongly discourage anyone from
> > trying to translate it into input for any kind of parser generator. I
> > suggest that you use International Tables vol. G instead (chapter 2.1 or
> > section 2.2.7). It is unfortunate that the later, correct, grammar is
> 
> I don't have these tables,


Are you sure? I would be surprised if you didn't have them available
through your library, either as hard copies, or through an on-line
subscription at the DOI links I gave in my article. International Tables
are pretty fundamental to CCP4's domain of MX, as well as several
others, after all. Perhaps you could have a word with the library staff?

> but could you be more specific what's incorrect
> in the version from the IUCR website?

This is ancient (mid-to-late 1990's) history for me: I would need to
track down some old e-mail correspondence and hunt through it, and I
don't have the time at the moment. I do remember a problem with the way
that quoted strings were defined, but that (and other errors) that I
spotted then may have been fixed. However, giving it a quick look, I can
see for example the following problem:

<LoopBody> : <Value> { <WhiteSpace> <Value> }* 

For this to work, the '*' must be a "greedy" quantifier, i.e. match
every { <WhiteSpace> <Value> } until it hits something that is not
{ <WhiteSpace> <Value> }. In this production though:

<SingleQuotedString> <WhiteSpace>: <single_quote> {<AnyPrintChar>}* 
<single_quote> <WhiteSpace>

the '*' has to be a "lazy" quantifier, i.e. match <AnyPrintChar> only as
far as the next <single_quote> <WhiteSpace> . Bear in mind that
<AnyPrintChar> includes both <single_quote> and two of the characters
that are also included in <WhiteSpace>.

Differences like this can be expressed in human-written code as long as
the coder is aware of them. A grammar that is intended to be used to
match data or generate a parser requires a more rigorous definition. Any
parser/lexer that uses a greedy quantifier for * would match a line of
data like this:

   val1    'val "2"'    "val '3'"    'val '4''    val5

as just three tokens:

  val1
  'val "2"'    "val '3'"    'val '4''
  val4

rather than as five tokens. OTOH, using a lazy quantifier for * would
only match the first data value in a loop, and then throw a syntax error
for every loop body (except the trivial case which genuinely has only
one data name in the header and data value in the body).

> 
> I just googled cif lexers and the two ones I looked into also refer
> to the same URL that I used:
> cctbx: http://cci.lbl.gov/cctbx_sources/ucif/cif.g
> JMol: 
> http://caagt.ugent.be/CaGe/jmol/org/jmol/adapter/smarter/CifReader.RidiculousFileFormatTokenizer.html
> 
> If there are discrepancies between IUCR website and IT vol.G and it would
> be worth to list them.

It is not a matter of discrepancies: they are rather different, and if
you are active in this area, you really need to see the IT ones as well.

Regards,
Peter.

-- 
Peter Keller                                     Tel.: +44 (0)1223 353033
Global Phasing Ltd.,                             Fax.: +44 (0)1223 366889
Sheraton House,
Castle Park,
Cambridge CB3 0AX
United Kingdom

Re: [ccp4bb] Code to handle the syntax of (mm)CIF data correctly.

Reply via email to