Keith,
Thanks for you input!!
we seem to have a nice collaboration on text utilities :)
I ran the code you provided, and found an issue.
Given:
"This is my dog" & "My Dog does not have fleas"
- Your initial code generates NO matches - you use Position with Case
sensitivity ON, and as for position > 1, missing matches starting at
position 1.
- With out case sensitivity, & Position > 0, Your code generates the
longest common string as length 5, and "my do"
This is because you end up truncating the last character comparison due
to using [[$i-1]]
Follows, modified code to handle requests for both case sensitivity,
and not, as well as removing the truncation issue.
I am thinking that it might be slightly more efficient to:
- compare the first 2 characters -- removing 1 iteration over the
longer string. But it might not be worth the housekeeping to code that.
- to build the final longest string as you go, rather then reiterate
back over the strings looking for matches of the longest length.
Where X = the number of times the longest common string length can be
found in the shortest string... it would save X substrings, 2X
positions X find in array, 2X assignments, and 2X comparisons + the
loop counter incrementation and comparison.
For relatively short strings, there would likely be no significant
difference in execution time, however, very long strings (what length
is 'long' I do not know), or repeated execution of this code in some
sort of loop the difference might be large.
Code start::
// ----------------------------------------------------
// Method: Suffix_Tree - from Wiki examples
// -
// INPUT1: Text - to compare
// INPUT2: Text - to compare
// INPUT3: Pointer - array of longest matches
// INPUT4: Boolean - Do case sensitive comparisons (default = false)
// OUTPUT: Longint - length
// ----------------------------------------------------
C_LONGINT($i;$size;$longest;$0)
C_TEXT($longer;$shorter;$temp)
C_POINTER($3;$Return_Array)
C_BOOLEAN($4;$Case_Sensitive;$Same)
$str1:=$1
$str2:=$2
$Return_Array:=$3
If (Count parameters>=4)
$Case_Sensitive:=$4
End if
If (Length($str1)>Length($str2))
$longer:=$str1
$shorter:=$str2
Else
$longer:=$str2
$shorter:=$str1
End if
$m:=Length($longer) // is the longer string
$n:=Length($shorter)
ARRAY LONGINT($LCSub;$m;$n)
$longest:=0
For ($i;1;$m)
For ($j;1;$n)
If ($Case_Sensitive) //doing case sensitive comparision - yes
$Same:=(Character code($longer[[$i]])=Character code($shorter[[$j]]))
Else //- no
$Same:=($longer[[$i]]=$shorter[[$j]])
End if
If ($Same) //are the current strings the same? - yes
//depending on the comparision point add this character count
Case of
: ($i=1)
$LCSub{$i}{$j}:=$LCSub{$i}{$j}+1
: ($j=1)
$LCSub{$i}{$j}:=$LCSub{$i-1}{$j}+1
Else
$LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1
End case
//longer then the previous longest?
If ($longest<$LCSub{$i}{$j}) //
$longest:=$LCSub{$i}{$j} //update the longest substring found
End if
Else
$LCSub{$i}{$j}:=0
End if
End for
End for
// search for matches of $longest length
ARRAY TEXT($aMatch;0)
$size:=Length($shorter)-$longest+1
//for every possible substring of the longest length
//try to find substring matches
For ($i;1;$size)
$temp:=Substring($shorter;$i;$longest)
If ($Case_Sensitive) //case sensitive comparison - Yes
$Found:=Position($temp;$longer;*)
Else //-no
$Found:=Position($temp;$longer)
End if
If ($Found>0) //note was >1
//hve we found this substrig before?
If (Find in array($aMatch;$temp)=-1)
APPEND TO ARRAY($aMatch;$temp) //- no
End if
End if
End for
COPY ARRAY($aMatch;$Return_Array->)
$0:=$longest
On Sat, 24 Aug 2019 16:23:56 -0500, Keith Culotta via 4D_Tech wrote:
> That's great. It's fast, and reduces the number of searches.
>
> Keith - CDI
>
>
> // ----------------------------------------------------
> // Method: Suffix_Tree - from Wiki examples
> // -
> // INPUT1: Text - to compare
> // INPUT2: Text - to compare
> // INPUT3: Pointer - array of longest matches
> // OUTPUT: Longint - length
> // ----------------------------------------------------
> C_LONGINT($i;$size;$longest;$0)
> C_TEXT($longer;$shorter;$temp)
> $str1:=$1
> $str2:=$2
>
> If (Length($str1)>Length($str2))
> $longer:=$str1
> $shorter:=$str2
> Else
> $longer:=$str2
> $shorter:=$str1
> End if
>
> $m:=Length($longer) // is the longer string
> $n:=Length($shorter)
>
> ARRAY LONGINT($LCSub;$m;$n)
> $longest:=0
>
> For ($i;1;$m)
> For ($j;1;$n)
> If ($i=1) | ($j=1)
> $LCSub{$i}{$j}:=0
>
> Else
> If ($longer[[$i-1]]=$shorter[[$j-1]])
> $LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1
> If ($longest<$LCSub{$i}{$j})
> $longest:=$LCSub{$i}{$j}
> End if
> Else
> $LCSub{$i}{$j}:=0
> End if
> End if
>
> End for
> End for
>
> // search for matches of $longest length
> ARRAY TEXT($aMatch;0)
> $size:=Length($shorter)-$longest+1
> For ($i;1;$size)
> $temp:=Substring($shorter;$i;$longest)
> If (Position($temp;$longer;*)>1)
> If (Find in array($aMatch;$temp)=-1)
> APPEND TO ARRAY($aMatch;$temp)
> End if
> End if
> End for
>
> COPY ARRAY($aMatch;$3->)
> $0:=$longest
>
>
>
>
>> On Aug 24, 2019, at 2:41 PM, John DeSoi via 4D_Tech
>> <[email protected]> wrote:
>>
>> See
>>
>> https://en.wikipedia.org/wiki/Longest_common_substring_problem
>>
>> John DeSoi, Ph.D.
>>
>>
>>> On Aug 24, 2019, at 10:48 AM, Keith Culotta via 4D_Tech
>>> <[email protected]> wrote:
>>>
>>> This version stands alone, and runs a little more efficiently.
>>> It's not been tested in every way, but the results are
>>> encouraging. Interesting problem. I can't think of a way to do it
>>> without comparing every character combination. The new "Split
>>> string" command would speed part of this up if Collections could be
>>> used.
>>
>
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> Archive: http://lists.4d.com/archives.html
> Options: https://lists.4d.com/mailman/options/4d_tech
> Unsub: mailto:[email protected]
> **********************************************************************
---------------
Gas is for washing parts
Alcohol is for drinkin'
Nitromethane is for racing
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive: http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub: mailto:[email protected]
**********************************************************************