Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Louis de Forcrand Sun, 30 Jun 2019 02:13:49 -0700

Hi,

Try this:


maxs=: 1 : 0
 for_i. i.-#y do.
  if. (i{y) +./@:u (<<<i){y do.
   y=. (<<<i){y
  end.
 end.
 y
)

cmp=: +./@E.&>

fast=: cmp maxs       NB. fast
fstr=: fast@(\: #@>)    NB. faster?

maxs is a generic adverb that takes a boolean-valued verb and an array.
The verb represents a partial order on the elements of the array, and maxs 
returns the set of maximal elements of the array, that is those which are 
strictly less than no other element (and also removes duplicates of those 
elements).

This applies to the current problem as substring inclusion <= is a partial 
order (reflexivity: x <= x is always true; antisymmetry: x <= y <= x implies x 
= y; and transitivity: x <= y <= z implies x <= z).
cmp is the partial order at hand, and fast is the corresponding verb.

The key observation is that if we look at one element y and find that y <= z, 
we can safely remove y from the array and continue our search for maximal 
elements without comparing against y as, if for some x we have x <= y, then by 
transitivity x <= z and x will be removed anyway even if y is already gone.

This should give a pretty good speedup when there is a lot of repetition / 
inclusion.

Another thing to try for this particular problem when you have lines with a 
wide range of different sizes, is to sort the lines in decreasing order of 
lengths. This could give some benefit as short lines are more probable to be 
included in others, and maxs removes array elements starting from the end of 
the array and so with a little luck quickly does much less work.
The corresponding verb is fstr.

In the end though I don’t know how much better than quadratic time you can hope 
for, as with the current specification it doesn’t seem obvious to keep from 
comparing most elements to all the others.

Might come back with some more ideas.

Cheers!
Louis

> On 28 Jun 2019, at 20:20, Raul Miller <rauldmil...@gmail.com> wrote:
> 
> Does order of the substrings matter? (In other words, would it be ok to
> treat 'B002|A001' like 'A001|B002'?)  -- Why or why not?
> 
> Are all the part numbers equal width?
> 
> Would any of the lines be duplicates of other lines? If so, how should that
> be handled?
> 
> My inclination would be to build a list of unique part numbers, replace
> each line with a list of indices into then treat this as a numeric problem
> (probably using sorting, possibly grouping the lines based on length). But
> to do that properly, I would need a deeper understanding of the
> requirements.
> 
> Thanks,
> 
> -- 
> Raul
> 
>> On Friday, June 28, 2019, vadim . <vadim3128m...@gmail.com> wrote:
>> 
>> Example: given a file (a string), where each line is list of part
>> numbers with separators, how to exclude lines which are substrings of
>> other lines (cut on separators)? E.g. (first line is to be excluded):
>> 
>>   z =: 0 : 0
>> A001|B002
>> C003|A001|B002
>> B002|A001
>> C003|D004|A001
>> E005|F006
>> D004|C003
>> )
>>   [lines =: <;._2 z
>> +---------+--------------+---------+--------------+---------+---------+
>> |A001|B002|C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
>> +---------+--------------+---------+--------------+---------+---------+
>>   [syms =: (s:@:('|'&,))&.> lines
>> +-----------+-----------------+-----------+-----------------
>> +-----------+-----------+
>> |`A001 `B002|`C003 `A001 `B002|`B002 `A001|`C003 `D004 `A001|`E005
>> `F006|`D004 `C003|
>> +-----------+-----------------+-----------+-----------------
>> +-----------+-----------+
>>   (+./@:E.)&.(> :.])/~ syms
>> 1 1 0 0 0 0
>> 0 1 0 0 0 0
>> 0 0 1 0 0 0
>> 0 0 0 1 0 0
>> 0 0 0 0 1 0
>> 0 0 0 0 0 1
>>   [idx =: 1&= +/"1 (+./@:E.)&.(> :.])/~ syms
>> 0 1 1 1 1 1
>>   [result =: idx#lines
>> +--------------+---------+--------------+---------+---------+
>> |C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
>> +--------------+---------+--------------+---------+---------+
>> 
>>   I'm worried about performance of this line:
>> 
>> (+./@:E.)&.(> :.])/~ syms
>> 
>> other details are not very important, as I'm only learning. Phrase
>> above uses form "+./@:E.", recommended for speed in J Wiki. I think
>> table adverb must be optimized, too. But, adding some weight:
>> 
>>   z =: 0 : 0
>> 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>> 6Y1352|4P0489|2757803
>> 3419308|3514531|3525716|3557019|3586192|3635776|3783741
>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854
>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>> 3H5788|1F7854|8W1152|8R0721
>> 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>> 6Y1352|4P0489|1336934
>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377
>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788
>> 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>> 6Y1352|4P0489|2757803
>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>> 6Y1352|4P0489|1336934
>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377
>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>> 3H5788|1F7854|8W1152
>> )
>>   lines =: <;._2 z
>>   syms =: (s:@:('|'&,))&.> lines
>>   syms10 =: ,(i.10) ]"0 _ syms
>>   syms100 =: ,(i.100) ]"0 _ syms
>>   syms1000 =: ,(i.1000) ]"0 _ syms
>>   10 (6!:2) '(+./@:E.)&.(> :.])/~ syms'
>> 3.936e_5
>>   10 (6!:2) '(+./@:E.)&.(> :.])/~ syms10'
>> 0.00652701
>>   10 (6!:2) '(+./@:E.)&.(> :.])/~ syms100'
>> 0.283609
>>   1 (6!:2) '(+./@:E.)&.(> :.])/~ syms1000'
>> 28.7405
>> 
>> 28 seconds for 11000 short lines is unacceptable. Am I doing something
>> totally wrong? For example, for this task Perl shows close to linear
>> (definitely not quadratic) dependency, runs hundreds and thousands
>> times faster.
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Reply via email to