Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Wes McKinney Wed, 26 Jul 2017 12:10:37 -0700

Yes, definitely, sorry to not make that more clear. As part of this
process we should draw up a documentation page about how to interpret
the version numbers as a third party user, and how we will handle
documenting experimental features. For example, we might add an
experimental new logical type and decide after a few minor versions
that we need to change its memory representation.


On Wed, Jul 26, 2017 at 3:03 PM, Julian Hyde <[email protected]> wrote:
> It sounds as if you agree with me: It is very important that we clearly state 
> which bits of Arrow are fixed and which bits are not.
>
>> On Jul 26, 2017, at 11:56 AM, Wes McKinney <[email protected]> wrote:
>>
>> Given the nature of the Arrow project, where any number of different
>> implementations will be in flux at any given time, claiming any sort
>> of API stability at the code level across the whole project seems
>> impossible any time soon.
>>
>> The important commitment of a 1.0 release is that the metadata and
>> memory format is not changing (without a change in the major version
>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
>> memory format and serialized metadata representation. That is, the
>> files in
>>
>> https://github.com/apache/arrow/tree/master/format
>>
>> Having this kind of stability is really important so that if any
>> systems know how to parse or emit Arrow 1.x data, but aren't
>> necessarily using the libraries provided by the project, they can have
>> some assurance that we aren't going to break the Flatbuffers or the
>> arrangement of bytes in a record batch on the wire. If that makes
>> sense.
>>
>> - Wes
>>
>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <[email protected]> wrote:
>>> 1.0 is a Big Deal because, under semantic versioning, there is a commitment 
>>> to not change public APIs. If it weren’t for that, 1.0 would have vague 
>>> marketing connotations of robustness, adoption etc. but otherwise be no 
>>> different from another release.
>>>
>>> So, if API and data format lifecycle and compatibility is the goal here, 
>>> would it be useful to introduce explicit flags on API maturity? Call out 
>>> which APIs are public, and therefore bound by the semantic versioning 
>>> contract. This will also give Arrow some room to add experimental features 
>>> after 1.0, and avoid calcification.
>>>
>>> Julian
>>>
>>>
>>>
>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <[email protected]> wrote:
>>>>
>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>>> integration testing remaining data types. We are so close to having
>>>> everything tested and stable, we should push to complete these as soon
>>>> as possible (save for Map, which has only just been added to the
>>>> metadata)
>>>>
>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <[email protected]> wrote:
>>>>> I agree those things would be nice to have. Hardening the memory
>>>>> format details probably would not take longer than a month or so if we
>>>>> were to focus in on it.
>>>>>
>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>>> require a design period and then initial implementation. I think
>>>>> having the streaming format implementations is a good start, but the
>>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>>> metadata only, or only record batches given a known schema. We should
>>>>> create a proposal document (Google docs?) for the community to comment
>>>>> where we can iterate on requirements
>>>>>
>>>>> Separately, I'm interested in embedding Arrow streams in other
>>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>>> streams less monolithic was intended to help with that.
>>>>>
>>>>> - Wes
>>>>>
>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <[email protected]> 
>>>>> wrote:
>>>>>> Top things on my list:
>>>>>>
>>>>>> - Formalize Arrow RPC and/or REST
>>>>>> - Some reference transformation algorithms
>>>>>> - Prototype IPC
>>>>>>
>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> hi folks,
>>>>>>>
>>>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>>>> business" and can be relied upon in production applications (which
>>>>>>> some reasonable tolerance for library API changes from major release
>>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>>> essential thing.
>>>>>>>
>>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>>> release. I think it would be a good show of project stability /
>>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>>> continue to evolve).
>>>>>>>
>>>>>>> The main things on my end are hardening the memory format and
>>>>>>> integration tests for the remaining data types:
>>>>>>>
>>>>>>> - Decimals
>>>>>>>   - Lingering issues with 128-bit decimals
>>>>>>>   - Need integration tests
>>>>>>> - Fixed size list
>>>>>>>   - Java has implemented, but not C++. Need integration tests
>>>>>>> - Union
>>>>>>>   - Two kinds of unions, Java only implements one. Need integration 
>>>>>>> tests
>>>>>>>
>>>>>>> On these, Decimals have the most work since the memory format needs to
>>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>>> specification.
>>>>>>>
>>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>>> modifications to the metadata.
>>>>>>>
>>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>>> principle to guide our efforts
>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>>>> would be possible to add compression or encoding at the field level
>>>>>>> without disrupting earlier versions of the software that lack these
>>>>>>> features.
>>>>>>>
>>>>>>> In the event that we do need to change the metadata or memory format
>>>>>>> in the future (which would probably be an extreme circumstance), we
>>>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>>>> first tags accompanying Arrow messages
>>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>>> an appropriate exception.
>>>>>>>
>>>>>>> There are some other things that would be nice to prototype or
>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>>> calls).
>>>>>>>
>>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>>> maintenance mode.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Wes
>>>>>>>
>>>
>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Reply via email to