Re: [Avogadro-devel] CJSON format proposal

Marcus D. Hanwell Sun, 20 Nov 2016 17:00:07 -0800

On Sun, Nov 20, 2016 at 11:57 AM, Boone, Paul <paulbo...@pitt.edu> wrote:
> Can you explain more how BSON fits in here? If CJSON were supposed to be a
> file format for internal interchange, then I have nothing against storing
> coords as one array and then processing them when you read it in or out.


BSON is used by MongoDB, and it is one of the things that motivated
the development of CJSON - it was useful to be able to move to/from
BSON with a fairly compact format, which also matched our in-memory
layout. Frankly I feel like this is all a matter of interpretation,
one array is convenient in a number of settings, and the meaning is
quite clear, tuples within arrays can also be convenient (in C++ you
can have the best or both worlds when using flat arrays cast to Eigen
fixed size vectors for example)

CJSON was quite clearly originally developed out of a need for a
simple format for getting data in and out of Avogadro 2. We wanted
something efficient, based on existing  parsers, that could easily be
extended. It has been developed in an ad-hoc way to satisfy that need,
in the last year or two we have been looking at standardizing the
format for wider use. It has always embedded a key-value pair to allow
for breaking changes, with the ability to retain code to continue
working with legacy data.
>
> For the python interface though, we were going to use CJSON as a public
> interchange format, and for a public interface, I’d be adamant about
> sticking to the principles of (1) readability (i.e. the format making sense
> to somebody just reading the text format) and (2) and explicitness (i.e. the
> structure of the file should represent the underlying data, without needing
> to interpret it in any way). Otherwise, we’re letting our internal
> implementation determine the structure of the format, when we want the
> semantics of the underlying data to determine the format.

I understand what you are saying. I think it is just as readable
whether stored as tuples or a flat array, and don't see any ambiguity.
I guess this is why others have begun using this as an interchange
format in some places, I have used it in C++, Python and JavaScript
without issue.
>
> I specifically wouldn’t worry about space considerations of the sub-arrays,
> but I don’t ever worry about space for JSON since it just wasn’t intended
> for that. Right now I think the biggest CJSON file we’re testing with for
> the python interface is about 117k, which I don’t think of as large. But I
> have no insight into how this format is being used elsewhere… Are you using
> it for really large structures?

There are a few test cases at a few million, in the case of bonds and
BSON your proposed format would nearly double storage for little gain
from my perspective.
>
> So the fundamental question for me that I’m sure you all can provide some
> insight on is:
>
> - are there really two formats here: (1) a format designed for internal use,
> ease of importing / exporting straight to/from avogadro internal structures,
> and optimized for minimal size, likely using BSON and (2) a public format
> designed for readablity and semantic explicitness?
> - or, is the public format sufficient for both purposes?

I would say that for most uses the existing format is useful for
public interchange, you are adamant it is not for your use case. There
are some existing projects that are making use of the current layout
in multiple languages, I would like to continue supporting them at
least in the short term. I think semantic explicitness would likely be
better achieved in a JSON-LD based format mentioned below.

You obviously have strong feelings on what you want the Python
interchange format to look like, and in the short term a good
direction might be to simply develop that, do as you wish, develop the
API you want, and the public interchange format you want, and not
worry about the CJSON format (or use it as a starting point). I think
we can live with some duplication, this seems like a very focused
effort on Python - Avogadro exchange at this point, and you would like
to take advantage of language features between JSON and Python that
seems fair enough.
>
> You also mention some additional changes you were thinking of making. Can
> you tell me about those?
>
There is an NWChem JSON format we have been developing as part of a
collaborative project with Berkeley Lab. That has explored another
layout that uses more objects, and push further into electronic
structure. There was also a workshop at the EPA, and we have been
looking at developing a JSON-LD format that actually has some semantic
meaning. There is also the GSoC project, and I am fixing up a few
remaining warnings, to get electronic structure from cclib into
Avogadro via an extended CJSON representation.

Hopefully this makes my thinking more transparent to you. The Python
integration is not the only user of this exchange format. When I
started working on this it was clear there were a number of ways to
represent these concepts in JSON, including the use of tuples inside
arrays. The transformations are also relatively simple, so I chose
simple representations that made reasonable sense. I am quite adamant
that I would like to retain this layout for some existing work/demos
with data stored/processed in this order.

I forgot to send this earlier, but the compromise of an optional
3dtuples key as Geoff would support both use cases well with minimal
code duplication, I would love to explore the use of convenience
conventions that make things easier for certain use cases, such as
using an optional symbols array that passes the atomic symbols in
addition to the numbers for cases when it is convenient and the
Python/JavaScript code doesn't want to carry around the atomic number
to symbol map. It can be very convenient to use the format as an API,
but I am not sure it is generally the best path.

------------------------------------------------------------------------------
_______________________________________________
Avogadro-devel mailing list
Avogadro-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/avogadro-devel

Re: [Avogadro-devel] CJSON format proposal

Reply via email to