Re: [ccp4bb] Code to handle the syntax of (mm)CIF data correctly.

yayahjb Wed, 18 Sep 2013 07:59:55 -0700

Dear Colleagues,

  There are two major issues that tend to trip up CIF programmers:

1. Dealing with the order independence of CIF. Unlike PDB format,tags in CIF can validlybe presented in any order. This means you cannot simply scan a CIF fora tag you want andstart processing from that point forward as you do with a PDB file. Ingeneral to reada CIF properly, you need to read all of it into memory before you can doanything with it.A common mistake is to assume that just because many CIFs have beenwritten with tags ina given order, the next CIF you encounter will also have the tags inthat order.

2. Doing the lexical scan (the tokenizing) correctly. CIF uses acontext sensitive grammar,so lexers based on simple BNF tend to make mistakes, and most reliableCIF lexers arehand-written rather than being generated from a grammar. The advice touse a pre-written

and tested lexer is sensible.

The bottom line is that, while it is relatively easy to write a validCIF, reading CIFs reliablycan be a very challenging programming task, because you need to writecode that will handlethe very complex general case, rather than just specific examples.Fortunately there are

software packages to help you do this.

  Herbert J. Bernstein

On 9/18/13 10:41 AM, Peter Keller wrote:

Hi Phil,
I agree that the issue that you raise (about the need to define thedata items and categories propery) is an important one that needsproper consideration. However, your mail could be read to suggest thatcorrect parsing of CIF-format data is a secondary issue that doesn'tdeserve the same attention from developers.
I hope that this isn't quite what you meant.... There are alreadymutually-incompatible CIF dialects out there that have been created bydevelopers coding to their own understanding and interpretations ofthe CIF/STAR format. I am sure that you would not want to be thecreator of yet another one :-) Correct tokenising is a necessary (butnot sufficient) condition for preventing the problem getting worse.
In practice, the code and applications that I have seen, and thediscussions about this that I have had, all suggest that developersfind it more difficult to write code that tokenises CIF/STAR-formatdata correctly than code that handles other text formats that theyhave to deal with in this field. My experience suggests that this isan important practical issue with real-world ramifications, and it isworthwhile devoting some effort to it.
Regards,
Peter.

On Wed, 18 Sep 2013, Phil Evans wrote:
Date: Wed, 18 Sep 2013 13:38:07 +0100
From: Phil Evans <[email protected]>
To: [email protected]
Subject: Re: [ccp4bb] Code to handle the syntax of (mm)CIF datacorrectly.
As a novice looking at mmCIF from a developers point of view, forreflection data, the complication is not so much tokenising(parsing), but what items to write or to expect to read. For exampleas far as I can see an observed intensity may be encoded in areflection loop (merged or unmerged) as any one of the following, andthere seem to be similar choices for other items:-
_refln_intensity_meas
_refln.F_squared_meas
_refln.pdbx_I_plus, _refln.pdbx_I_minus

_diffrn_refln.counts_net
_diffrn_refln.intensity_net
If I'm writing a file, which should I use, and if I'm reading onewhich ones should I expect? And is there a distinction between mergedand unmerged data?
confused (easily)
Phil
On 17 Sep 2013, at 15:30, Peter Keller <[email protected]>wrote:
Dear all,
At Global Phasing, we have seen that there are still issues with theway that different applications deal with mmCIF-format data, andthis continues to cause problems for users. I believe that part ofthe reason for this is that the underlying syntax (the STAR format)is not universally understood, and that a common and completeunderstanding of the full STAR syntax amongst programmers who dealwith the format will help with some of the existing problems.
I wrote some code for low-level handling of the STAR format a whileago that I have been meaning to release for over a year. GarryBattle's announcement on 23 August about the mmCIF/PDBx workshop atthe EBI has prompted me into action: I have written a short articlethat discusses some examples of the issues that we have encountered,and made my code available for download. The references in thearticle are given primarily as web links: more conventionalcitations can usually be found in the pages that I link to. Thiscode has not been used in any released products, but it has had someinternal use at Global Phasing. There is an MX bias in the article'sdiscussion, but the issues are not restricted to MX.
As I explain in the article, the handling of the input data is basedon an enourmous regular expression that matches STAR data, with onlya little logic in the code itself. The regular expression should beusable with a variety of other languages, not only in Java (which Ihave used in this case). The code, or the regular expression on itsown, may be freely used in other projects: see the includedlicencing for details, but basically you should: (i) give credit forusing it, and (ii) if you choose to modify the regular expression,state that you have done so in that credit.
The article, which contains links to a tar file containing the code,and the documentation, is here:
<http://www.globalphasing.com/startools/>
Hoping that others will find this useful and/or help to resolve orclarify outstanding questions,
Peter.

--
Peter Keller Tel.: +44 (0)1223353033Global Phasing Ltd., Fax.: +44 (0)1223366889
Sheraton House,
Castle Park,
Cambridge CB3 0AX
United Kingdom

Re: [ccp4bb] Code to handle the syntax of (mm)CIF data correctly.

Reply via email to