Yes, definitely, sorry to not make that more clear. As part of this process we should draw up a documentation page about how to interpret the version numbers as a third party user, and how we will handle documenting experimental features. For example, we might add an experimental new logical type and decide after a few minor versions that we need to change its memory representation.
On Wed, Jul 26, 2017 at 3:03 PM, Julian Hyde <jh...@apache.org> wrote: > It sounds as if you agree with me: It is very important that we clearly state > which bits of Arrow are fixed and which bits are not. > >> On Jul 26, 2017, at 11:56 AM, Wes McKinney <wesmck...@gmail.com> wrote: >> >> Given the nature of the Arrow project, where any number of different >> implementations will be in flux at any given time, claiming any sort >> of API stability at the code level across the whole project seems >> impossible any time soon. >> >> The important commitment of a 1.0 release is that the metadata and >> memory format is not changing (without a change in the major version >> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the >> memory format and serialized metadata representation. That is, the >> files in >> >> https://github.com/apache/arrow/tree/master/format >> >> Having this kind of stability is really important so that if any >> systems know how to parse or emit Arrow 1.x data, but aren't >> necessarily using the libraries provided by the project, they can have >> some assurance that we aren't going to break the Flatbuffers or the >> arrangement of bytes in a record batch on the wire. If that makes >> sense. >> >> - Wes >> >> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote: >>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment >>> to not change public APIs. If it weren’t for that, 1.0 would have vague >>> marketing connotations of robustness, adoption etc. but otherwise be no >>> different from another release. >>> >>> So, if API and data format lifecycle and compatibility is the goal here, >>> would it be useful to introduce explicit flags on API maturity? Call out >>> which APIs are public, and therefore bound by the semantic versioning >>> contract. This will also give Arrow some room to add experimental features >>> after 1.0, and avoid calcification. >>> >>> Julian >>> >>> >>> >>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <wesmck...@gmail.com> wrote: >>>> >>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about >>>> integration testing remaining data types. We are so close to having >>>> everything tested and stable, we should push to complete these as soon >>>> as possible (save for Map, which has only just been added to the >>>> metadata) >>>> >>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>>>> I agree those things would be nice to have. Hardening the memory >>>>> format details probably would not take longer than a month or so if we >>>>> were to focus in on it. >>>>> >>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will >>>>> require a design period and then initial implementation. I think >>>>> having the streaming format implementations is a good start, but the >>>>> streams are a bit monothic -- e.g. in REST you might want to request >>>>> metadata only, or only record batches given a known schema. We should >>>>> create a proposal document (Google docs?) for the community to comment >>>>> where we can iterate on requirements >>>>> >>>>> Separately, I'm interested in embedding Arrow streams in other >>>>> transport layers, like GRPC. The recent refactoring in C++ to make the >>>>> streams less monolithic was intended to help with that. >>>>> >>>>> - Wes >>>>> >>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <jacq...@apache.org> >>>>> wrote: >>>>>> Top things on my list: >>>>>> >>>>>> - Formalize Arrow RPC and/or REST >>>>>> - Some reference transformation algorithms >>>>>> - Prototype IPC >>>>>> >>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <wesmck...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> hi folks, >>>>>>> >>>>>>> In recent discussions, since the Arrow memory format and metadata has >>>>>>> become reasonably stabilized, and we're more likely to add new data >>>>>>> types than change existing ones, we may consider making a 1.0.0 to >>>>>>> declare to the rest of the open source world that "Arrow is open for >>>>>>> business" and can be relied upon in production applications (which >>>>>>> some reasonable tolerance for library API changes from major release >>>>>>> to major release). I hope we can all agree that forward and backward >>>>>>> compatibility in the zero-copy wire format and metadata is the most >>>>>>> essential thing. >>>>>>> >>>>>>> To that end, I'd like to collect ideas for what needs to be >>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0 >>>>>>> release. I think it would be a good show of project stability / >>>>>>> production-readiness to do this (with the caveat the APIs will >>>>>>> continue to evolve). >>>>>>> >>>>>>> The main things on my end are hardening the memory format and >>>>>>> integration tests for the remaining data types: >>>>>>> >>>>>>> - Decimals >>>>>>> - Lingering issues with 128-bit decimals >>>>>>> - Need integration tests >>>>>>> - Fixed size list >>>>>>> - Java has implemented, but not C++. Need integration tests >>>>>>> - Union >>>>>>> - Two kinds of unions, Java only implements one. Need integration >>>>>>> tests >>>>>>> >>>>>>> On these, Decimals have the most work since the memory format needs to >>>>>>> be specified. On Unions, we may decide to not implement the dense >>>>>>> variant and focus on integration testing the sparse variant. I don't >>>>>>> think this is going to be too much work, but it needs to get sorted >>>>>>> out so we don't have incomplete or under-tested parts of the >>>>>>> specification. >>>>>>> >>>>>>> There's some other things being discussed, like a Map logical type, >>>>>>> but that (at least as currently proposed) won't require any disruptive >>>>>>> modifications to the metadata. >>>>>>> >>>>>>> As far as the metadata and memory format, we would use the Open/Closed >>>>>>> principle to guide our efforts >>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it >>>>>>> would be possible to add compression or encoding at the field level >>>>>>> without disrupting earlier versions of the software that lack these >>>>>>> features. >>>>>>> >>>>>>> In the event that we do need to change the metadata or memory format >>>>>>> in the future (which would probably be an extreme circumstance), we >>>>>>> have the option of increasing the MetadataVersion which is one of the >>>>>>> first tags accompanying Arrow messages >>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22). >>>>>>> So if you encounter a message that you do not support, you can raise >>>>>>> an appropriate exception. >>>>>>> >>>>>>> There are some other things that would be nice to prototype or >>>>>>> specify, like a REST protocol for exposing Arrow datasets in a >>>>>>> client-server model (sending Arrow record batches via REST HTTP >>>>>>> calls). >>>>>>> >>>>>>> Anything else that would need to go to move to a 1.x mainline for >>>>>>> development? One idea would be if we need to make any breaking changes >>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into >>>>>>> maintenance mode. >>>>>>> >>>>>>> Thanks >>>>>>> Wes >>>>>>> >>> >