Re: Regex expert needed??

Chip Scheide via 4D_Tech Mon, 26 Aug 2019 11:33:00 -0700

Update  :

the following code should be more efficient for getting the actual 
substring(s) replacing the loop starting "For($i;1;$Size)".


The following code, uses the built up 2D array to determine where the 
longest substring(s) end, this gives a position in the shorter string 
where the common string ends. Then we 'walk backwards' over the shorter 
string by common string length -1 to build the common substring(s).
This avoids a lot of thrashing with position and substring.

ARRAY TEXT($aMatch;0)


For ($i;1;$m)  //$m - is length of shorter string/dimension of 2D array
$Found:=Find in array($LCSub{$i};$longest)

If ($Found>0)
$string:=""

For ($j;1;$longest)  //Build up the common substring
$String:=$shorter[[$Found-$j+1]]+$string
End for 
append to array($aMatch;$String)  // add to array to return
End if 
End for 


On Mon, 26 Aug 2019 13:55:02 -0400, Chip Scheide via 4D_Tech wrote:
> Keith,
> Thanks for you input!!
> we seem to have a nice collaboration on text utilities  :)
> 
> I ran the code you provided, and found an issue.
> Given:
> "This is my dog" & "My Dog does not have fleas"
> 
> - Your initial code generates NO matches - you use Position with Case 
> sensitivity ON, and as for position > 1, missing matches starting at 
> position 1.
> - With out case sensitivity, & Position > 0, Your code generates the 
> longest common string as length 5, and "my do"
> This is because you end up truncating the last character comparison due 
> to using [[$i-1]] 
> 
> Follows, modified code to handle requests for both case sensitivity, 
> and not, as well as removing the truncation issue.
> 
> 
> I am thinking that it might be slightly more efficient to:
> - compare the first 2 characters -- removing 1 iteration over the 
> longer string. But it might not be worth the housekeeping to code that.
> - to build the final longest string as you go, rather then reiterate 
> back over the strings looking for matches of the longest length.  
> Where X = the number of times the longest common string length can be 
> found in the shortest string... it would save X substrings, 2X 
> positions X find in array, 2X assignments, and 2X comparisons + the 
> loop counter incrementation and comparison.
> 
> 
> For relatively short strings, there would likely be no significant 
> difference in execution time, however, very long strings (what length 
> is 'long' I do not know), or repeated execution of this code in some 
> sort of loop the difference might be large.
> 
> 
> Code start::
> 
>    // ----------------------------------------------------
>   // Method: Suffix_Tree - from Wiki examples
>   // - 
>   // INPUT1: Text - to compare
>   // INPUT2: Text - to compare
>   // INPUT3: Pointer - array of longest matches
>   // INPUT4: Boolean - Do case sensitive comparisons (default = false)
>   // OUTPUT: Longint - length 
>   // ---------------------------------------------------- 
> C_LONGINT($i;$size;$longest;$0)
> C_TEXT($longer;$shorter;$temp)
> C_POINTER($3;$Return_Array)
> C_BOOLEAN($4;$Case_Sensitive;$Same)
> $str1:=$1
> $str2:=$2
> $Return_Array:=$3
> 
> If (Count parameters>=4)
> $Case_Sensitive:=$4
> End if 
> 
> If (Length($str1)>Length($str2))
> $longer:=$str1
> $shorter:=$str2
> Else 
> $longer:=$str2
> $shorter:=$str1
> End if 
> 
> $m:=Length($longer)  // is the longer string
> $n:=Length($shorter)
> 
> ARRAY LONGINT($LCSub;$m;$n)
> $longest:=0
> 
> For ($i;1;$m)
> For ($j;1;$n)
> 
> If ($Case_Sensitive)  //doing case sensitive comparision - yes
> $Same:=(Character code($longer[[$i]])=Character code($shorter[[$j]]))
> Else   //- no
> $Same:=($longer[[$i]]=$shorter[[$j]])
> End if 
> 
> If ($Same)  //are the current strings the same? - yes
> 
>   //depending on the comparision point add this character count
> Case of 
> : ($i=1)
> $LCSub{$i}{$j}:=$LCSub{$i}{$j}+1
> : ($j=1)
> $LCSub{$i}{$j}:=$LCSub{$i-1}{$j}+1
> Else 
> $LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1
> End case 
> 
>   //longer then the previous longest?
> If ($longest<$LCSub{$i}{$j})  //
> $longest:=$LCSub{$i}{$j}  //update the longest substring found
> End if 
> Else 
> $LCSub{$i}{$j}:=0
> End if 
> 
> End for 
> End for 
> 
>   // search for matches of $longest length
> ARRAY TEXT($aMatch;0)
> $size:=Length($shorter)-$longest+1
> 
>   //for every possible substring of the longest length
>   //try to find substring matches
> For ($i;1;$size)
> $temp:=Substring($shorter;$i;$longest)
> 
> If ($Case_Sensitive)  //case sensitive comparison - Yes
> $Found:=Position($temp;$longer;*)
> Else   //-no
> $Found:=Position($temp;$longer)
> End if 
> 
> If ($Found>0)  //note was >1
> 
>   //hve we found this substrig before?
> If (Find in array($aMatch;$temp)=-1)
> APPEND TO ARRAY($aMatch;$temp)  //- no
> End if 
> End if 
> End for 
> 
> COPY ARRAY($aMatch;$Return_Array->)
> $0:=$longest
> 
> 
> On Sat, 24 Aug 2019 16:23:56 -0500, Keith Culotta via 4D_Tech wrote:
>> That's great.  It's fast, and reduces the number of searches.
>> 
>> Keith - CDI
>> 
>> 
>>   // ----------------------------------------------------
>>   // Method: Suffix_Tree - from Wiki examples
>>   // - 
>>   // INPUT1: Text - to compare
>>   // INPUT2: Text - to compare
>>   // INPUT3: Pointer - array of longest matches
>>   // OUTPUT: Longint - length 
>>   // ---------------------------------------------------- 
>> C_LONGINT($i;$size;$longest;$0)
>> C_TEXT($longer;$shorter;$temp)
>> $str1:=$1
>> $str2:=$2
>> 
>> If (Length($str1)>Length($str2))
>>      $longer:=$str1
>>      $shorter:=$str2
>> Else 
>>      $longer:=$str2
>>      $shorter:=$str1
>> End if 
>> 
>> $m:=Length($longer)  // is the longer string
>> $n:=Length($shorter)
>> 
>> ARRAY LONGINT($LCSub;$m;$n)
>> $longest:=0
>> 
>> For ($i;1;$m)
>>      For ($j;1;$n)
>>              If ($i=1) | ($j=1)
>>                      $LCSub{$i}{$j}:=0
>> 
>>              Else 
>>                      If ($longer[[$i-1]]=$shorter[[$j-1]])
>>                              $LCSub{$i}{$j}:=$LCSub{$i-1}{$j-1}+1
>>                              If ($longest<$LCSub{$i}{$j})
>>                                      $longest:=$LCSub{$i}{$j}
>>                              End if 
>>                      Else 
>>                              $LCSub{$i}{$j}:=0
>>                      End if 
>>              End if 
>> 
>>      End for 
>> End for 
>> 
>>   // search for matches of $longest length
>> ARRAY TEXT($aMatch;0)
>> $size:=Length($shorter)-$longest+1
>> For ($i;1;$size)
>>      $temp:=Substring($shorter;$i;$longest)
>>      If (Position($temp;$longer;*)>1)
>>              If (Find in array($aMatch;$temp)=-1)
>>                      APPEND TO ARRAY($aMatch;$temp) 
>>              End if 
>>      End if 
>> End for 
>> 
>> COPY ARRAY($aMatch;$3->)
>> $0:=$longest
>> 
>> 
>> 
>> 
>>> On Aug 24, 2019, at 2:41 PM, John DeSoi via 4D_Tech 
>>> <[email protected]> wrote:
>>> 
>>> See 
>>> 
>>> https://en.wikipedia.org/wiki/Longest_common_substring_problem
>>> 
>>> John DeSoi, Ph.D.
>>> 
>>> 
>>>> On Aug 24, 2019, at 10:48 AM, Keith Culotta via 4D_Tech 
>>>> <[email protected]> wrote:
>>>> 
>>>> This version stands alone, and runs a little more efficiently.  
>>>> It's not been tested in every way, but the results are 
>>>> encouraging.  Interesting problem.  I can't think of a way to do it 
>>>> without comparing every character combination.  The new "Split 
>>>> string" command would speed part of this up if Collections could be 
>>>> used.
>>> 
>> 
>> **********************************************************************
>> 4D Internet Users Group (4D iNUG)
>> Archive:  http://lists.4d.com/archives.html
>> Options: https://lists.4d.com/mailman/options/4d_tech
>> Unsub:  mailto:[email protected]
>> **********************************************************************
> ---------------
> Gas is for washing parts
> Alcohol is for drinkin'
> Nitromethane is for racing 
> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> Archive:  http://lists.4d.com/archives.html
> Options: https://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:[email protected]
> **********************************************************************
---------------
Gas is for washing parts
Alcohol is for drinkin'
Nitromethane is for racing 
**********************************************************************
4D Internet Users Group (4D iNUG)
Archive:  http://lists.4d.com/archives.html
Options: https://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[email protected]
**********************************************************************

Re: Regex expert needed??

Reply via email to