hermanschaaf commented on code in PR #34806:
URL: https://github.com/apache/arrow/pull/34806#discussion_r1157705189


##########
go/arrow/array/diff.go:
##########
@@ -0,0 +1,248 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing, software
+// distributed under the License is distributed on an "AS IS" BASIS,
+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+package array
+
+import (
+       "fmt"
+
+       "github.com/apache/arrow/go/v12/arrow"
+)
+
+// Diff compares two arrays, returning an edit script which expresses the 
difference
+// between them. The edit script can be applied to the base array to produce 
the target.
+//
+// An edit script represents the difference between two arrays.
+// Each element of "insert" determines whether an element was inserted into 
(true)
+// or deleted from (false) base. Each insertion or deletion is followed by a 
run of
+// elements which are unchanged from base to target; the length of this run is 
stored
+// in RunLength. (Note that the edit script begins and ends with a run of 
shared
+// elements but both fields of the struct must have the same length. To 
accommodate this
+// the first element of "insert" should be ignored.)
+//
+// For example for base "hlloo" and target "hello", the edit script would be
+// [
+//
+//     {"insert": false, "run_length": 1}, // leading run of length 1 ("h")
+//     {"insert": true, "run_length": 3}, // insert("e") then a run of length 
3 ("llo")
+//     {"insert": false, "run_length": 0} // delete("o") then an empty run
+//
+// ]
+// base: baseline for comparison
+// target: an array of identical type to base whose elements differ from base's
+func Diff(base, target arrow.Array) (inserts []bool, runLengths []int64, err 
error) {

Review Comment:
   I'm fine either way. I think the main downside is that a single `Edit` is 
somewhat meaningless in isolation - it needs to be part of a full edit script 
to be useful, as only the edits leading up to this one can give you the 
position in the array it is referring to.
   
   I also considered something like:
   
   ```
   type EditScript struct {
        Inserts []bool
        RunLengths []int64
   }
   ```
   
   But ended up going with returning `inserts []bool, runLengths []int64` 
separately so we don't need to introduce a new exported type to the `array` 
package. 
   
   We can also do 
   
   ```
   type Edit struct {
       Insert bool
       RunLength int64
   }
   ```
   
   like you suggested, paired with
   
   ```
   type Edits []Edit
   ```
   
   which could then be returned by `Diff`, and perhaps we could then add a 
`String()` method on the `Edits` type that returns the diff in unified format?
   
   So usage would be like:
   
   ```
   edits, err := array.Diff(left, right)
   if err != nil { ... }
   diffStr := edits.String()
   ```
   
   I think I like this the most :) Lots of options here, and it also depends on 
how much leeway we have to go our own way in the Go implementation. Let me know 
your preferred solution and I'll update it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to