Starting with JSON seems like a reasonable choice. Mike
Marvin Humphrey <[email protected]> wrote: > Greets, > > Lucy indexes will contain significant metadata, which should be written in a > human-readable format for easy spelunking and debugging. There are probably > four main contenders for choice of encoding: JSON, YAML, XML, and a custom > format. > > If we go with a custom format, IMO it should be an extension of JSON. Our > needs will not be limited to simple key-value pairs, and designing our own > full-featured data-description language would be foolish. Let's try to avoid > custom formats until we decide that there's no other choice. > > XML and YAML are certainly sophisticated enough to handle our data needs. > However, they both require large, heavyweight parsers, and I think we should > try to avoid imposing such a dependency on future Lucy C users. > > Furthermore, XML is less well-matched to the scalar-list-mapping data > structures common to the dynamic languages that Lucy targets than either YAML > or JSON. > > YAML offers the advantage of extensible data types. That's become more > appealing as I've tried to figure out how to serialize entire schemas in JSON, > including Analyzer and Similarity specifications. However, the YAML spec is > very large. If we decide that we need YAML's features, I think we ought to > try to limit ourselves to a subset of the spec. > > Still, it would be for the best if we could avoid that kind of complexity, and > go with the simplest human-readable option that supports scalar-list-mapping > data structures: JSON. > > This excerpt from the YAML 1.2 draft spec points a way forward: > > http://yaml.org/spec/1.2 > > 1.4. Relation to JSON > > Both JSON and YAML aim to be human readable data interchange formats. > However, JSON and YAML have different priorities. JSON’s foremost design > goal is simplicity and universality. Thus, JSON is trivial to generate and > parse, at the cost of reduced human readability. It also uses a lowest > common denominator information model, ensuring any JSON data can be easily > processed by every modern programming environment. > > In contrast, YAML’s foremost design goals are human readability and > support for serializing arbitrary native data structures. Thus, YAML > allows for extremely readable files, but is more complex to generate and > parse. In addition, YAML ventures beyond the lowest common denominator > data types, requiring more complex processing when crossing between > different programming environments. > > YAML can therefore be viewed as a natural superset of JSON, offering > improved human readability and a more complete information model. This is > also the case in practice; every JSON file is also a valid YAML file. This > makes it easy to migrate from JSON to YAML if/when the additional features > are required. > > It may be useful to define a intermediate format between YAML and JSON. > Such a format would be trivial to parse (but not very human readable), > like JSON. At the same time, it would allow for serializing arbitrary > native data structures, like YAML. Such a format might also serve as > YAML’s "canonical format". > > Defining such a "YSON" format (YSON is a Serialized Object Notation) can > be done either by enhancing the JSON specification or by restricting the > YAML specification. Such a definition is beyond the scope of this > specification. > > (Note that YAML version 1.2 is not well supported yet; most parsers support > 1.0 or 1.1.) > > I'm sure we can hammer all the data we need into JSON; it's just a matter of > at what point it becomes so inelegant that wandering outside the JSON spec > into YAML becomes the best solution. That's not a threshold we should cross > lightly, so for now I advocate that we try to work within JSON's constraints. > > Marvin Humphrey > >
