Completely agree. Perhaps there could be a distinction between working and deposition formats. They are for different purposes and, therefore, could differ from each other -- Eugene
On 19 Sep 2013, at 10:45, Phil Evans wrote: > Do you really want to read the whole of a long reflection loop into memory > rather than parsing it one line at a time (which should be possible once you > have worked out what is in the file)? That would end up with storing the > reflection list twice, the memory copy of the input file and the internal > representation for the program. I do get complaints from people trying to run > e.g. Pointless with large datasets on 32-bit machines, crashing because it > runs out of memory > > If you imagine someone corresponding to the XDS INTEGRATE.HKL file with 120 > characters/reflection, then a dataset with 10^7 reflections (not outrageously > large these days) occupies 1.2e9 bytes, over 1GB, which seems a lot to add > gratuitously to memory demands even on today's computers > > Of course (in my opinion) a working format (as opposed to an archive format) > should be binary for size, accuracy (FP dynamic range) and speed. > A quick comparison (using Pointless) > > Read 5.3e6 reflections from a formatted XDS INTEGRATE.HKL file, 608MB, 15 secs > Read equivalent binary MTZ file, 262MB, 2.6 secs > > Phil > > On 18 Sep 2013, at 15:58, yayahjb <[email protected]> wrote: > >> Dear Colleagues, >> >> There are two major issues that tend to trip up CIF programmers: >> >> 1. Dealing with the order independence of CIF. Unlike PDB format, tags in >> CIF can validly >> be presented in any order. This means you cannot simply scan a CIF for a >> tag you want and >> start processing from that point forward as you do with a PDB file. In >> general to read >> a CIF properly, you need to read all of it into memory before you can do >> anything with it. >> A common mistake is to assume that just because many CIFs have been written >> with tags in >> a given order, the next CIF you encounter will also have the tags in that >> order. >> >> 2. Doing the lexical scan (the tokenizing) correctly. CIF uses a context >> sensitive grammar, >> so lexers based on simple BNF tend to make mistakes, and most reliable CIF >> lexers are >> hand-written rather than being generated from a grammar. The advice to use >> a pre-written >> and tested lexer is sensible. >> >> The bottom line is that, while it is relatively easy to write a valid CIF, >> reading CIFs reliably >> can be a very challenging programming task, because you need to write code >> that will handle >> the very complex general case, rather than just specific examples. >> Fortunately there are >> software packages to help you do this. >> >> Herbert J. Bernstein >> >> On 9/18/13 10:41 AM, Peter Keller wrote: >>> Hi Phil, >>> >>> I agree that the issue that you raise (about the need to define the data >>> items and categories propery) is an important one that needs proper >>> consideration. However, your mail could be read to suggest that correct >>> parsing of CIF-format data is a secondary issue that doesn't deserve the >>> same attention from developers. >>> >>> I hope that this isn't quite what you meant.... There are already >>> mutually-incompatible CIF dialects out there that have been created by >>> developers coding to their own understanding and interpretations of the >>> CIF/STAR format. I am sure that you would not want to be the creator of yet >>> another one :-) Correct tokenising is a necessary (but not sufficient) >>> condition for preventing the problem getting worse. >>> >>> In practice, the code and applications that I have seen, and the >>> discussions about this that I have had, all suggest that developers find it >>> more difficult to write code that tokenises CIF/STAR-format data correctly >>> than code that handles other text formats that they have to deal with in >>> this field. My experience suggests that this is an important practical >>> issue with real-world ramifications, and it is worthwhile devoting some >>> effort to it. >>> >>> Regards, >>> Peter. >>> >>> On Wed, 18 Sep 2013, Phil Evans wrote: >>> >>>> Date: Wed, 18 Sep 2013 13:38:07 +0100 >>>> From: Phil Evans <[email protected]> >>>> To: [email protected] >>>> Subject: Re: [ccp4bb] Code to handle the syntax of (mm)CIF data correctly. >>>> >>>> As a novice looking at mmCIF from a developers point of view, for >>>> reflection data, the complication is not so much tokenising (parsing), but >>>> what items to write or to expect to read. For example as far as I can see >>>> an observed intensity may be encoded in a reflection loop (merged or >>>> unmerged) as any one of the following, and there seem to be similar >>>> choices for other items:- >>>> >>>> >>>> _refln_intensity_meas >>>> _refln.F_squared_meas >>>> _refln.pdbx_I_plus, _refln.pdbx_I_minus >>>> >>>> _diffrn_refln.counts_net >>>> _diffrn_refln.intensity_net >>>> >>>> If I'm writing a file, which should I use, and if I'm reading one which >>>> ones should I expect? And is there a distinction between merged and >>>> unmerged data? >>>> >>>> confused (easily) >>>> Phil >>>> >>>> >>>> >>>> On 17 Sep 2013, at 15:30, Peter Keller <[email protected]> wrote: >>>> >>>>> Dear all, >>>>> >>>>> At Global Phasing, we have seen that there are still issues with the way >>>>> that different applications deal with mmCIF-format data, and this >>>>> continues to cause problems for users. I believe that part of the reason >>>>> for this is that the underlying syntax (the STAR format) is not >>>>> universally understood, and that a common and complete understanding of >>>>> the full STAR syntax amongst programmers who deal with the format will >>>>> help with some of the existing problems. >>>>> >>>>> I wrote some code for low-level handling of the STAR format a while ago >>>>> that I have been meaning to release for over a year. Garry Battle's >>>>> announcement on 23 August about the mmCIF/PDBx workshop at the EBI has >>>>> prompted me into action: I have written a short article that discusses >>>>> some examples of the issues that we have encountered, and made my code >>>>> available for download. The references in the article are given primarily >>>>> as web links: more conventional citations can usually be found in the >>>>> pages that I link to. This code has not been used in any released >>>>> products, but it has had some internal use at Global Phasing. There is an >>>>> MX bias in the article's discussion, but the issues are not restricted to >>>>> MX. >>>>> >>>>> As I explain in the article, the handling of the input data is based on >>>>> an enourmous regular expression that matches STAR data, with only a >>>>> little logic in the code itself. The regular expression should be usable >>>>> with a variety of other languages, not only in Java (which I have used in >>>>> this case). The code, or the regular expression on its own, may be freely >>>>> used in other projects: see the included licencing for details, but >>>>> basically you should: (i) give credit for using it, and (ii) if you >>>>> choose to modify the regular expression, state that you have done so in >>>>> that credit. >>>>> >>>>> The article, which contains links to a tar file containing the code, and >>>>> the documentation, is here: >>>>> >>>>> <http://www.globalphasing.com/startools/> >>>>> >>>>> Hoping that others will find this useful and/or help to resolve or >>>>> clarify outstanding questions, >>>>> >>>>> Peter. >>>>> >>>>> -- >>>>> Peter Keller Tel.: +44 (0)1223 353033 >>>>> Global Phasing Ltd., Fax.: +44 (0)1223 366889 >>>>> Sheraton House, >>>>> Castle Park, >>>>> Cambridge CB3 0AX >>>>> United Kingdom >>>> >>> -- Scanned by iCritical.
