Re: [ccp4bb] Code to handle the syntax of (mm)CIF data correctly.

Eugene Krissinel Thu, 19 Sep 2013 04:12:49 -0700

Completely agree. Perhaps there could be a distinction between working and 
deposition formats. They are for different purposes and, therefore, could 
differ from each other -- Eugene


On 19 Sep 2013, at 10:45, Phil Evans wrote:

> Do you really want to read the whole of a long reflection loop into memory 
> rather than parsing it one line at a time (which should be possible once you 
> have worked out what is in the file)? That would end up with storing the 
> reflection list twice, the memory copy of the input file and the internal 
> representation for the program. I do get complaints from people trying to run 
> e.g. Pointless with large datasets on 32-bit machines, crashing because it 
> runs out of memory
> 
> If you imagine someone corresponding to the XDS INTEGRATE.HKL file with 120 
> characters/reflection, then a dataset with 10^7 reflections (not outrageously 
> large these days) occupies 1.2e9 bytes, over 1GB, which seems a lot to add 
> gratuitously to memory demands even on today's computers 
> 
> Of course (in my opinion) a working format (as opposed to an archive format) 
> should be binary for size, accuracy (FP dynamic range) and speed. 
> A quick comparison (using Pointless)
> 
> Read 5.3e6 reflections from a formatted XDS INTEGRATE.HKL file, 608MB, 15 secs
> Read equivalent binary MTZ file, 262MB, 2.6 secs
> 
> Phil
> 
> On 18 Sep 2013, at 15:58, yayahjb <[email protected]> wrote:
> 
>> Dear Colleagues,
>> 
>> There are two major issues that tend to trip up CIF programmers:
>> 
>>  1.  Dealing with the order independence of CIF.  Unlike PDB format, tags in 
>> CIF can validly
>> be presented in any order.  This means you cannot simply scan a CIF for a 
>> tag you want and
>> start processing from that point forward as you do with a PDB file.  In 
>> general to read
>> a CIF properly, you need to read all of it into memory before you can do 
>> anything with it.
>> A common mistake is to assume that just because many CIFs have been written 
>> with tags in
>> a given order, the next CIF you encounter will also have the tags in that 
>> order.
>> 
>> 2.  Doing the lexical scan (the tokenizing) correctly.  CIF uses a context 
>> sensitive grammar,
>> so lexers based on simple BNF tend to make mistakes, and most reliable CIF 
>> lexers are
>> hand-written rather than being generated from a grammar.  The advice to use 
>> a pre-written
>> and tested lexer is sensible.
>> 
>> The bottom line is that, while it is relatively easy to write a valid CIF, 
>> reading CIFs reliably
>> can be a very challenging programming task, because you need to write code 
>> that will handle
>> the very complex general case, rather than just specific examples.  
>> Fortunately there are
>> software packages to help you do this.
>> 
>> Herbert J. Bernstein
>> 
>> On 9/18/13 10:41 AM, Peter Keller wrote:
>>> Hi Phil,
>>> 
>>> I agree that the issue that you raise (about the need to define the data 
>>> items and categories propery) is an important one that needs proper 
>>> consideration. However, your mail could be read to suggest that correct 
>>> parsing of CIF-format data is a secondary issue that doesn't deserve the 
>>> same attention from developers.
>>> 
>>> I hope that this isn't quite what you meant....  There are already 
>>> mutually-incompatible CIF dialects out there that have been created by 
>>> developers coding to their own understanding and interpretations of the 
>>> CIF/STAR format. I am sure that you would not want to be the creator of yet 
>>> another one :-) Correct tokenising is a necessary (but not sufficient) 
>>> condition for preventing the problem getting worse.
>>> 
>>> In practice, the code and applications that I have seen, and the 
>>> discussions about this that I have had, all suggest that developers find it 
>>> more difficult to write code that tokenises CIF/STAR-format data correctly 
>>> than code that handles other text formats that they have to deal with in 
>>> this field. My experience suggests that this is an important practical 
>>> issue with real-world ramifications, and it is worthwhile devoting some 
>>> effort to it.
>>> 
>>> Regards,
>>> Peter.
>>> 
>>> On Wed, 18 Sep 2013, Phil Evans wrote:
>>> 
>>>> Date: Wed, 18 Sep 2013 13:38:07 +0100
>>>> From: Phil Evans <[email protected]>
>>>> To: [email protected]
>>>> Subject: Re: [ccp4bb] Code to handle the syntax of (mm)CIF data correctly.
>>>> 
>>>> As a novice looking at mmCIF from a developers point of view, for 
>>>> reflection data, the complication is not so much tokenising (parsing), but 
>>>> what items to write or to expect to read. For example as far as I can see 
>>>> an observed intensity may be encoded in a reflection loop (merged or 
>>>> unmerged) as any one of the following, and there seem to be similar 
>>>> choices for other items:-
>>>> 
>>>> 
>>>> _refln_intensity_meas
>>>> _refln.F_squared_meas
>>>> _refln.pdbx_I_plus, _refln.pdbx_I_minus
>>>> 
>>>> _diffrn_refln.counts_net
>>>> _diffrn_refln.intensity_net
>>>> 
>>>> If I'm writing a file, which should I use, and if I'm reading one which 
>>>> ones should I expect? And is there a distinction between merged and 
>>>> unmerged data?
>>>> 
>>>> confused (easily)
>>>> Phil
>>>> 
>>>> 
>>>> 
>>>> On 17 Sep 2013, at 15:30, Peter Keller <[email protected]> wrote:
>>>> 
>>>>> Dear all,
>>>>> 
>>>>> At Global Phasing, we have seen that there are still issues with the way 
>>>>> that different applications deal with mmCIF-format data, and this 
>>>>> continues to cause problems for users. I believe that part of the reason 
>>>>> for this is that the underlying syntax (the STAR format) is not 
>>>>> universally understood, and that a common and complete understanding of 
>>>>> the full STAR syntax amongst programmers who deal with the format will 
>>>>> help with some of the existing problems.
>>>>> 
>>>>> I wrote some code for low-level handling of the STAR format a while ago 
>>>>> that I have been meaning to release for over a year. Garry Battle's 
>>>>> announcement on 23 August about the mmCIF/PDBx workshop at the EBI has 
>>>>> prompted me into action: I have written a short article that discusses 
>>>>> some examples of the issues that we have encountered, and made my code 
>>>>> available for download. The references in the article are given primarily 
>>>>> as web links: more conventional citations can usually be found in the 
>>>>> pages that I link to. This code has not been used in any released 
>>>>> products, but it has had some internal use at Global Phasing. There is an 
>>>>> MX bias in the article's discussion, but the issues are not restricted to 
>>>>> MX.
>>>>> 
>>>>> As I explain in the article, the handling of the input data is based on 
>>>>> an enourmous regular expression that matches STAR data, with only a 
>>>>> little logic in the code itself. The regular expression should be usable 
>>>>> with a variety of other languages, not only in Java (which I have used in 
>>>>> this case). The code, or the regular expression on its own, may be freely 
>>>>> used in other projects: see the included licencing for details, but 
>>>>> basically you should: (i) give credit for using it, and (ii) if you 
>>>>> choose to modify the regular expression, state that you have done so in 
>>>>> that credit.
>>>>> 
>>>>> The article, which contains links to a tar file containing the code, and 
>>>>> the documentation, is here:
>>>>> 
>>>>> <http://www.globalphasing.com/startools/>
>>>>> 
>>>>> Hoping that others will find this useful and/or help to resolve or 
>>>>> clarify outstanding questions,
>>>>> 
>>>>> Peter.
>>>>> 
>>>>> -- 
>>>>> Peter Keller                                     Tel.: +44 (0)1223 353033
>>>>> Global Phasing Ltd.,                             Fax.: +44 (0)1223 366889
>>>>> Sheraton House,
>>>>> Castle Park,
>>>>> Cambridge CB3 0AX
>>>>> United Kingdom
>>>> 
>>> 


-- 
Scanned by iCritical.

Re: [ccp4bb] Code to handle the syntax of (mm)CIF data correctly.

Reply via email to