On Sun, Nov 20, 2016 at 11:57 AM, Boone, Paul <paulbo...@pitt.edu> wrote: > Can you explain more how BSON fits in here? If CJSON were supposed to be a > file format for internal interchange, then I have nothing against storing > coords as one array and then processing them when you read it in or out.
BSON is used by MongoDB, and it is one of the things that motivated the development of CJSON - it was useful to be able to move to/from BSON with a fairly compact format, which also matched our in-memory layout. Frankly I feel like this is all a matter of interpretation, one array is convenient in a number of settings, and the meaning is quite clear, tuples within arrays can also be convenient (in C++ you can have the best or both worlds when using flat arrays cast to Eigen fixed size vectors for example) CJSON was quite clearly originally developed out of a need for a simple format for getting data in and out of Avogadro 2. We wanted something efficient, based on existing parsers, that could easily be extended. It has been developed in an ad-hoc way to satisfy that need, in the last year or two we have been looking at standardizing the format for wider use. It has always embedded a key-value pair to allow for breaking changes, with the ability to retain code to continue working with legacy data. > > For the python interface though, we were going to use CJSON as a public > interchange format, and for a public interface, I’d be adamant about > sticking to the principles of (1) readability (i.e. the format making sense > to somebody just reading the text format) and (2) and explicitness (i.e. the > structure of the file should represent the underlying data, without needing > to interpret it in any way). Otherwise, we’re letting our internal > implementation determine the structure of the format, when we want the > semantics of the underlying data to determine the format. I understand what you are saying. I think it is just as readable whether stored as tuples or a flat array, and don't see any ambiguity. I guess this is why others have begun using this as an interchange format in some places, I have used it in C++, Python and JavaScript without issue. > > I specifically wouldn’t worry about space considerations of the sub-arrays, > but I don’t ever worry about space for JSON since it just wasn’t intended > for that. Right now I think the biggest CJSON file we’re testing with for > the python interface is about 117k, which I don’t think of as large. But I > have no insight into how this format is being used elsewhere… Are you using > it for really large structures? There are a few test cases at a few million, in the case of bonds and BSON your proposed format would nearly double storage for little gain from my perspective. > > So the fundamental question for me that I’m sure you all can provide some > insight on is: > > - are there really two formats here: (1) a format designed for internal use, > ease of importing / exporting straight to/from avogadro internal structures, > and optimized for minimal size, likely using BSON and (2) a public format > designed for readablity and semantic explicitness? > - or, is the public format sufficient for both purposes? I would say that for most uses the existing format is useful for public interchange, you are adamant it is not for your use case. There are some existing projects that are making use of the current layout in multiple languages, I would like to continue supporting them at least in the short term. I think semantic explicitness would likely be better achieved in a JSON-LD based format mentioned below. You obviously have strong feelings on what you want the Python interchange format to look like, and in the short term a good direction might be to simply develop that, do as you wish, develop the API you want, and the public interchange format you want, and not worry about the CJSON format (or use it as a starting point). I think we can live with some duplication, this seems like a very focused effort on Python - Avogadro exchange at this point, and you would like to take advantage of language features between JSON and Python that seems fair enough. > > You also mention some additional changes you were thinking of making. Can > you tell me about those? > There is an NWChem JSON format we have been developing as part of a collaborative project with Berkeley Lab. That has explored another layout that uses more objects, and push further into electronic structure. There was also a workshop at the EPA, and we have been looking at developing a JSON-LD format that actually has some semantic meaning. There is also the GSoC project, and I am fixing up a few remaining warnings, to get electronic structure from cclib into Avogadro via an extended CJSON representation. Hopefully this makes my thinking more transparent to you. The Python integration is not the only user of this exchange format. When I started working on this it was clear there were a number of ways to represent these concepts in JSON, including the use of tuples inside arrays. The transformations are also relatively simple, so I chose simple representations that made reasonable sense. I am quite adamant that I would like to retain this layout for some existing work/demos with data stored/processed in this order. I forgot to send this earlier, but the compromise of an optional 3dtuples key as Geoff would support both use cases well with minimal code duplication, I would love to explore the use of convenience conventions that make things easier for certain use cases, such as using an optional symbols array that passes the atomic symbols in addition to the numbers for cases when it is convenient and the Python/JavaScript code doesn't want to carry around the atomic number to symbol map. It can be very convenient to use the format as an API, but I am not sure it is generally the best path. ------------------------------------------------------------------------------ _______________________________________________ Avogadro-devel mailing list Avogadro-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/avogadro-devel