Update :
the following code should be more efficient for getting the actual
substring(s) replacing the loop starting "For($i;1;$Size)".
The following code, uses the built up 2D array to determine where the
longest substring(s) end, this gives a position in the shorter string
where the common string ends. Then we 'walk backwards' over the shorter
string by common string length -1 to build the common substring(s).
This avoids a lot of thrashing with position and substring.
ARRAY TEXT($aMatch;0)
For ($i;1;$m) //$m - is length of shorter string/dimension of 2D array
$Found:=Find in array($LCSub{$i};$longest)
If ($Found>0)
$string:=""
For ($j;1;$longest) //Build up the common substring
$String:=$shorter[[$Found-$j+1]]+$string
End for
append to array($aMatch;$String) // add to array to return
End if
End for
On Mon, 26 Aug 2019 13:55:02 -0400, Chip Scheide via 4D_Tech wrote:
> Keith,
> Thanks for you input!!
> we seem to have a nice collaboration on text utilities :)
>
> I ran the code you provided, and found an issue.
> Given:
> "This is my dog" & "My Dog does not have fleas"
>
> - Your initial code generates NO matches - you use Position with Case
> sensitivity ON, and as for position > 1, missing matches starting at
> position 1.
> - With out case sensitivity, & Position > 0, Your code generates the
> longest common string as length 5, and "my do"
> This is because you end up truncating the last character comparison due
> to using [[$i-1]]
>
> Follows, modified code to handle requests for both case sensitivity,
> and not, as well as removing the truncation issue.
>
>
> I am thinking that it might be slightly more efficient to:
> - compare the first 2 characters -- removing 1 iteration over the
> longer string. But it might not be worth the housekeeping to code that.
> - to build the final longest string as you go, rather then reiterate
> back over the strings looking for matches of the longest length.
> Where X = the number of times the longest common string length can be
> found in the shortest string... it would save X substrings, 2X
> positions X find in array, 2X assignments, and 2X comparisons + the
> loop counter incrementation and comparison.
>
>
> For relatively short strings, there would likely be no significant
> difference in execution time, however, very long strings (what length
> is 'long' I do not know), or repeated execution of this code in some
> sort of loop the difference might be large.
>
>
> Code start::
>
> // ----------------------------------------------------
> // Method: Suffix_Tree - from Wiki examples
> // -
> // INPUT1: Text - to compare
> // INPUT2: Text - to compare
> // INPUT3: Pointer - array of longest matches
> // INPUT4: Boolean - Do case sensitive comparisons (default = false)
> // OUTPUT: Longint - length
> // ----------------------------------------------------
> C_LONGINT($i;$size;$longest;$0)
> C_TEXT($longer;$shorter;$temp)
> C_POINTER($3;$Return_Array)
> C_BOOLEAN($4;$Case_Sensitive;$Same)
> $str1:=$1
> $str2:=$2
> $Return_Array:=$3
>
> If (Count parameters>=4)
> $Case_Sensitive:=$4
> End if
>
> If (Length($str1)>Length($str2))
> $longer:=$str1
> $shorter:=$str2
> Else
> $longer:=$str2
> $shorter:=$str1
> End if
>
> $m:=Length($longer) // is the longer string
> $n:=Length($shorter)
>
> ARRAY LONGINT($LCSub;$m;$n)
> $longest:=0
>
> For ($i;1;$m)
> For ($j;1;$n)
>
> If ($Case_Sensitive) //doing case sensitive comparision - yes
> $Same:=(Character code($longer[[$i]])=Character code($shorter[[$j]]))
> Else //- no
> $Same:=($longer[[$i]]=$shorter[[$j]])
> End if
>
> If ($Same) //are the current strings the same? - yes
>
> //depending on the comparision point add this character count
> Case of
> : ($i=1)
> $LCSub{$i}{$j}:=$LCSub{$i}{$j}+1
> : ($j=1)
> $LCSub{$i}{$j}:=$LCSub{$i-1}{$j}+1
> Else
> $LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1
> End case
>
> //longer then the previous longest?
> If ($longest<$LCSub{$i}{$j}) //
> $longest:=$LCSub{$i}{$j} //update the longest substring found
> End if
> Else
> $LCSub{$i}{$j}:=0
> End if
>
> End for
> End for
>
> // search for matches of $longest length
> ARRAY TEXT($aMatch;0)
> $size:=Length($shorter)-$longest+1
>
> //for every possible substring of the longest length
> //try to find substring matches
> For ($i;1;$size)
> $temp:=Substring($shorter;$i;$longest)
>
> If ($Case_Sensitive) //case sensitive comparison - Yes
> $Found:=Position($temp;$longer;*)
> Else //-no
> $Found:=Position($temp;$longer)
> End if
>
> If ($Found>0) //note was >1
>
> //hve we found this substrig before?
> If (Find in array($aMatch;$temp)=-1)
> APPEND TO ARRAY($aMatch;$temp) //- no
> End if
> End if
> End for
>
> COPY ARRAY($aMatch;$Return_Array->)
> $0:=$longest
>
>
> On Sat, 24 Aug 2019 16:23:56 -0500, Keith Culotta via 4D_Tech wrote:
>> That's great. It's fast, and reduces the number of searches.
>>
>> Keith - CDI
>>
>>
>> // ----------------------------------------------------
>> // Method: Suffix_Tree - from Wiki examples
>> // -
>> // INPUT1: Text - to compare
>> // INPUT2: Text - to compare
>> // INPUT3: Pointer - array of longest matches
>> // OUTPUT: Longint - length
>> // ----------------------------------------------------
>> C_LONGINT($i;$size;$longest;$0)
>> C_TEXT($longer;$shorter;$temp)
>> $str1:=$1
>> $str2:=$2
>>
>> If (Length($str1)>Length($str2))
>> $longer:=$str1
>> $shorter:=$str2
>> Else
>> $longer:=$str2
>> $shorter:=$str1
>> End if
>>
>> $m:=Length($longer) // is the longer string
>> $n:=Length($shorter)
>>
>> ARRAY LONGINT($LCSub;$m;$n)
>> $longest:=0
>>
>> For ($i;1;$m)
>> For ($j;1;$n)
>> If ($i=1) | ($j=1)
>> $LCSub{$i}{$j}:=0
>>
>> Else
>> If ($longer[[$i-1]]=$shorter[[$j-1]])
>> $LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1
>> If ($longest<$LCSub{$i}{$j})
>> $longest:=$LCSub{$i}{$j}
>> End if
>> Else
>> $LCSub{$i}{$j}:=0
>> End if
>> End if
>>
>> End for
>> End for
>>
>> // search for matches of $longest length
>> ARRAY TEXT($aMatch;0)
>> $size:=Length($shorter)-$longest+1
>> For ($i;1;$size)
>> $temp:=Substring($shorter;$i;$longest)
>> If (Position($temp;$longer;*)>1)
>> If (Find in array($aMatch;$temp)=-1)
>> APPEND TO ARRAY($aMatch;$temp)
>> End if
>> End if
>> End for
>>
>> COPY ARRAY($aMatch;$3->)
>> $0:=$longest
>>
>>
>>
>>
>>> On Aug 24, 2019, at 2:41 PM, John DeSoi via 4D_Tech
>>> <[email protected]> wrote:
>>>
>>> See
>>>
>>> https://en.wikipedia.org/wiki/Longest_common_substring_problem
>>>
>>> John DeSoi, Ph.D.
>>>
>>>
>>>> On Aug 24, 2019, at 10:48 AM, Keith Culotta via 4D_Tech
>>>> <[email protected]> wrote:
>>>>
>>>> This version stands alone, and runs a little more efficiently.
>>>> It's not been tested in every way, but the results are
>>>> encouraging. Interesting problem. I can't think of a way to do it
>>>> without comparing every character combination. The new "Split
>>>> string" command would speed part of this up if Collections could be
>>>> used.
>>>
>>
>> **********************************************************************
>> 4D Internet Users Group (4D iNUG)
>> Archive: http://lists.4d.com/archives.html
>> Options: https://lists.4d.com/mailman/options/4d_tech
>> Unsub: mailto:[email protected]
>> **********************************************************************
> ---------------
> Gas is for washing parts
> Alcohol is for drinkin'
> Nitromethane is for racing
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> Archive: http://lists.4d.com/archives.html
> Options: https://lists.4d.com/mailman/options/4d_tech
> Unsub: mailto:[email protected]
> **********************************************************************
---------------
Gas is for washing parts
Alcohol is for drinkin'
Nitromethane is for racing
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive: http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub: mailto:[email protected]
**********************************************************************