Re: Regex expert needed??

Chip Scheide via 4D_Tech Mon, 26 Aug 2019 10:57:25 -0700

Keith,
Thanks for you input!!
we seem to have a nice collaboration on text utilities  :)


I ran the code you provided, and found an issue.
Given:
"This is my dog" & "My Dog does not have fleas"

- Your initial code generates NO matches - you use Position with Case 
sensitivity ON, and as for position > 1, missing matches starting at 
position 1.
- With out case sensitivity, & Position > 0, Your code generates the 
longest common string as length 5, and "my do"
This is because you end up truncating the last character comparison due 
to using [[$i-1]] 

Follows, modified code to handle requests for both case sensitivity, 
and not, as well as removing the truncation issue.


I am thinking that it might be slightly more efficient to:
- compare the first 2 characters -- removing 1 iteration over the 
longer string. But it might not be worth the housekeeping to code that.
- to build the final longest string as you go, rather then reiterate 
back over the strings looking for matches of the longest length.  
Where X = the number of times the longest common string length can be 
found in the shortest string... it would save X substrings, 2X 
positions X find in array, 2X assignments, and 2X comparisons + the 
loop counter incrementation and comparison.


For relatively short strings, there would likely be no significant 
difference in execution time, however, very long strings (what length 
is 'long' I do not know), or repeated execution of this code in some 
sort of loop the difference might be large.


Code start::

   // ----------------------------------------------------
  // Method: Suffix_Tree - from Wiki examples
  // - 
  // INPUT1: Text - to compare
  // INPUT2: Text - to compare
  // INPUT3: Pointer - array of longest matches
  // INPUT4: Boolean - Do case sensitive comparisons (default = false)
  // OUTPUT: Longint - length 
  // ---------------------------------------------------- 
C_LONGINT($i;$size;$longest;$0)
C_TEXT($longer;$shorter;$temp)
C_POINTER($3;$Return_Array)
C_BOOLEAN($4;$Case_Sensitive;$Same)
$str1:=$1
$str2:=$2
$Return_Array:=$3

If (Count parameters>=4)
$Case_Sensitive:=$4
End if 

If (Length($str1)>Length($str2))
$longer:=$str1
$shorter:=$str2
Else 
$longer:=$str2
$shorter:=$str1
End if 

$m:=Length($longer)  // is the longer string
$n:=Length($shorter)

ARRAY LONGINT($LCSub;$m;$n)
$longest:=0

For ($i;1;$m)
For ($j;1;$n)

If ($Case_Sensitive)  //doing case sensitive comparision - yes
$Same:=(Character code($longer[[$i]])=Character code($shorter[[$j]]))
Else   //- no
$Same:=($longer[[$i]]=$shorter[[$j]])
End if 

If ($Same)  //are the current strings the same? - yes

  //depending on the comparision point add this character count
Case of 
: ($i=1)
$LCSub{$i}{$j}:=$LCSub{$i}{$j}+1
: ($j=1)
$LCSub{$i}{$j}:=$LCSub{$i-1}{$j}+1
Else 
$LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1
End case 

  //longer then the previous longest?
If ($longest<$LCSub{$i}{$j})  //
$longest:=$LCSub{$i}{$j}  //update the longest substring found
End if 
Else 
$LCSub{$i}{$j}:=0
End if 

End for 
End for 

  // search for matches of $longest length
ARRAY TEXT($aMatch;0)
$size:=Length($shorter)-$longest+1

  //for every possible substring of the longest length
  //try to find substring matches
For ($i;1;$size)
$temp:=Substring($shorter;$i;$longest)

If ($Case_Sensitive)  //case sensitive comparison - Yes
$Found:=Position($temp;$longer;*)
Else   //-no
$Found:=Position($temp;$longer)
End if 

If ($Found>0)  //note was >1

  //hve we found this substrig before?
If (Find in array($aMatch;$temp)=-1)
APPEND TO ARRAY($aMatch;$temp)  //- no
End if 
End if 
End for 

COPY ARRAY($aMatch;$Return_Array->)
$0:=$longest


On Sat, 24 Aug 2019 16:23:56 -0500, Keith Culotta via 4D_Tech wrote:
> That's great.  It's fast, and reduces the number of searches.
> 
> Keith - CDI
> 
> 
>   // ----------------------------------------------------
>   // Method: Suffix_Tree - from Wiki examples
>   // - 
>   // INPUT1: Text - to compare
>   // INPUT2: Text - to compare
>   // INPUT3: Pointer - array of longest matches
>   // OUTPUT: Longint - length 
>   // ---------------------------------------------------- 
> C_LONGINT($i;$size;$longest;$0)
> C_TEXT($longer;$shorter;$temp)
> $str1:=$1
> $str2:=$2
> 
> If (Length($str1)>Length($str2))
>       $longer:=$str1
>       $shorter:=$str2
> Else 
>       $longer:=$str2
>       $shorter:=$str1
> End if 
> 
> $m:=Length($longer)  // is the longer string
> $n:=Length($shorter)
> 
> ARRAY LONGINT($LCSub;$m;$n)
> $longest:=0
> 
> For ($i;1;$m)
>       For ($j;1;$n)
>               If ($i=1) | ($j=1)
>                       $LCSub{$i}{$j}:=0
> 
>               Else 
>                       If ($longer[[$i-1]]=$shorter[[$j-1]])
>                               $LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1
>                               If ($longest<$LCSub{$i}{$j})
>                                       $longest:=$LCSub{$i}{$j}
>                               End if 
>                       Else 
>                               $LCSub{$i}{$j}:=0
>                       End if 
>               End if 
> 
>       End for 
> End for 
> 
>   // search for matches of $longest length
> ARRAY TEXT($aMatch;0)
> $size:=Length($shorter)-$longest+1
> For ($i;1;$size)
>       $temp:=Substring($shorter;$i;$longest)
>       If (Position($temp;$longer;*)>1)
>               If (Find in array($aMatch;$temp)=-1)
>                       APPEND TO ARRAY($aMatch;$temp) 
>               End if 
>       End if 
> End for 
> 
> COPY ARRAY($aMatch;$3->)
> $0:=$longest
> 
> 
> 
> 
>> On Aug 24, 2019, at 2:41 PM, John DeSoi via 4D_Tech 
>> <[email protected]> wrote:
>> 
>> See 
>> 
>> https://en.wikipedia.org/wiki/Longest_common_substring_problem
>> 
>> John DeSoi, Ph.D.
>> 
>> 
>>> On Aug 24, 2019, at 10:48 AM, Keith Culotta via 4D_Tech 
>>> <[email protected]> wrote:
>>> 
>>> This version stands alone, and runs a little more efficiently.  
>>> It's not been tested in every way, but the results are 
>>> encouraging.  Interesting problem.  I can't think of a way to do it 
>>> without comparing every character combination.  The new "Split 
>>> string" command would speed part of this up if Collections could be 
>>> used.
>> 
> 
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> Archive:  http://lists.4d.com/archives.html
> Options: https://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:[email protected]
> **********************************************************************
---------------
Gas is for washing parts
Alcohol is for drinkin'
Nitromethane is for racing 
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[email protected]
**********************************************************************

Re: Regex expert needed??

Reply via email to