hermanschaaf commented on code in PR #34806: URL: https://github.com/apache/arrow/pull/34806#discussion_r1157705189
########## go/arrow/array/diff.go: ########## @@ -0,0 +1,248 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package array + +import ( + "fmt" + + "github.com/apache/arrow/go/v12/arrow" +) + +// Diff compares two arrays, returning an edit script which expresses the difference +// between them. The edit script can be applied to the base array to produce the target. +// +// An edit script represents the difference between two arrays. +// Each element of "insert" determines whether an element was inserted into (true) +// or deleted from (false) base. Each insertion or deletion is followed by a run of +// elements which are unchanged from base to target; the length of this run is stored +// in RunLength. (Note that the edit script begins and ends with a run of shared +// elements but both fields of the struct must have the same length. To accommodate this +// the first element of "insert" should be ignored.) +// +// For example for base "hlloo" and target "hello", the edit script would be +// [ +// +// {"insert": false, "run_length": 1}, // leading run of length 1 ("h") +// {"insert": true, "run_length": 3}, // insert("e") then a run of length 3 ("llo") +// {"insert": false, "run_length": 0} // delete("o") then an empty run +// +// ] +// base: baseline for comparison +// target: an array of identical type to base whose elements differ from base's +func Diff(base, target arrow.Array) (inserts []bool, runLengths []int64, err error) { Review Comment: I'm fine either way. I think the main downside is that a single `Edit` is somewhat meaningless in isolation - it needs to be part of a full edit script to be useful, as only the edits leading up to this one can give you the position in the array it is referring to. I also considered something like: ``` type EditScript struct { Inserts []bool RunLengths []int64 } ``` But ended up going with returning `inserts []bool, runLengths []int64` separately so we don't need to introduce a new exported type to the `array` package. We can also do ``` type Edit struct { Insert bool RunLength int64 } ``` like you suggested, paired with ``` type Edits []Edit ``` which could then be returned by `Diff`, and perhaps we could then add a `String()` method on the `Edits` type that returns the diff in unified format? So usage would be like: ``` edits, err := array.Diff(left, right) if err != nil { ... } diffStr := edits.String() ``` I think I like this the most :) Lots of options here, and it also depends on how much leeway we have to go our own way in the Go implementation. Let me know your preferred solution and I'll update it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
