Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Louis de Forcrand Sun, 30 Jun 2019 06:57:27 -0700

Hi again,

I haven’t much time right now, but I’ve come up with two new programs:


fastmax=: 1 : 0
 B=. C=. 1"_1 y
 while. +./C do.
  i=. C i.1
  B=. 0 i} B
  B=. -.@:u&(i{y)&.(B&#) y
  C=. C *. B
  B=. 1 i} B
 end.
 B#y
)

recmax=: 1 : 0
 if. 1>:n=. #y do. y return. end.
 'a b'=. (>.-:n) ({. ,&<&(u mxs3) }.) y
 b=. b #~ -. +./"1 b u/ a
 a=. a #~ -. +./"1 a u/ b
 a,b
)

cmp=: +./@E.&>

fastincl=: cmp fastmax
recincl =: cmp recmax

I’m on my phone so times are quite slow anyhow, but fastincl is very fast on 
your example data:

   ts=: timespacex
   ts 'fastincl syms1000'
0.044349 203264
   ts 'recincl syms1000'
0.316018 515072

As you can see recincl is slower. However I can’t tell you for sure which is 
better in The Real World.
fastmax is a lot like maxs: it loops through each array element x, comparing it 
against all others. However instead of removing x if it is less than another 
element, it removes all items less than x, which makes for far less wasted 
computation and removes items at a much faster rate.
recmax is recursive and hinges on the following fact:
Split your argument list X into two disjoint parts, A and B. Then an element a 
of A is maximal in X (there is no item of X larger than it) _if and only if_ it 
is maximal in A and none of the maximal elements of B are larger than it.

fastmax is so quick on syms1000 simply because of the amount of repetition. My 
guess is that recmax would compare much more favorably on less particular data.

Cheers,
Louis

> On 30 Jun 2019, at 14:09, vadim . <vadim3128m...@gmail.com> wrote:
> 
> Thank you for valuable answers. Louis, it will take me some time to
> study your answer, it looks like the "J way" of solving problems which
> I'm only hoping to grasp. Raul, I'll think on your suggestion to
> convert the task into numeric problem.
> 
> For me it's just learning exercise, a "real life" problem posted on
> another language forum, with incomplete specification. Data sample was
> provided by OP, part numbers can have different lengths. It's
> reasonable to assume that one of duplicate lines, if any happen, must
> be preserved. Whether to treat lines as unordered sets or strings --
> both interpretations are interesting, for now let them be strings.
> 
> I don't have large enough data sample, OP said "millions of lines",
> for lack of anything better I'm using concatenated copies. Now I see
> why fast existing solution (Perl, actually -- concatenate lines into
> one string using separators as guards, search for lines) is so fast:
> it rejects a line as soon as second copy of it is found, i.e. almost
> immediately. Real life "millions of lines" will scale quadratically,
> of course.
> 
> With J, the "(+./@:E.)" is optimized to stop searching if found once.
> 
> However, it may be interesting for you what I found, one approach is
> to work with characters, another to convert part numbers into symbols,
> as I did above.
> 
> Let "z" and "syms1000" be nouns as above.
> 
>    lines =: (<@:('|'&,@:,&'|'));._2 z    NB. Lines with guards
>    lines1000 =: ,(i.1000) ]"0 _ lines
>    (6!:2) '1&= +/"1 lines1000 ((>@:[)E.])"0 _ ;lines1000'
> 48.2911
>    (6!:2) '1&= +/"1 syms1000 ((>@:[)E.])"0 _ ;(,&(s:'' ''))&.>syms1000'
> 3.1777
> 
> Using symbols against one huge concatenated array of symbols is
> fastest solution so far. Not much of "J way", maybe. Let's see if
> "(+./@:E.)" really aborts further search, so time must drop to that of
> Perl's solution, ~ 20 times per second, even though it will produce
> wrong answer:
> 
> 
>    (6!:2) '1&= +/"1 lines1000 ((>@:[)(+./@:E.)])"0 _ ;lines1000'
> 0.013242
>    (6!:2) '1&= +/"1 syms1000 ((>@:[)(+./@:E.)])"0 _ ;(,&(s:'' ''))&.>syms1000'
> 0.810288
> 
> That's curios. In case of symbols, execution time doesn't decrease as 
> expected.
> 
> Unfortunately, I can't modify "E." so that it stops searching after 2
> matches, or can I? Or, stop after match and return index? I think it
> can be done with pcrp2, but then it all becomes more like Perl and not
> J. Not very interesting.
> 
>> On Sun, Jun 30, 2019 at 12:13 PM Louis de Forcrand <ol...@bluewin.ch> wrote:
>> 
>> Hi,
>> 
>> Try this:
>> 
>> maxs=: 1 : 0
>> for_i. i.-#y do.
>>  if. (i{y) +./@:u (<<<i){y do.
>>   y=. (<<<i){y
>>  end.
>> end.
>> y
>> )
>> 
>> cmp=: +./@E.&>
>> 
>> fast=: cmp maxs       NB. fast
>> fstr=: fast@(\: #@>)    NB. faster?
>> 
>> maxs is a generic adverb that takes a boolean-valued verb and an array.
>> The verb represents a partial order on the elements of the array, and maxs 
>> returns the set of maximal elements of the array, that is those which are 
>> strictly less than no other element (and also removes duplicates of those 
>> elements).
>> 
>> This applies to the current problem as substring inclusion <= is a partial 
>> order (reflexivity: x <= x is always true; antisymmetry: x <= y <= x implies 
>> x = y; and transitivity: x <= y <= z implies x <= z).
>> cmp is the partial order at hand, and fast is the corresponding verb.
>> 
>> The key observation is that if we look at one element y and find that y <= 
>> z, we can safely remove y from the array and continue our search for maximal 
>> elements without comparing against y as, if for some x we have x <= y, then 
>> by transitivity x <= z and x will be removed anyway even if y is already 
>> gone.
>> 
>> This should give a pretty good speedup when there is a lot of repetition / 
>> inclusion.
>> 
>> Another thing to try for this particular problem when you have lines with a 
>> wide range of different sizes, is to sort the lines in decreasing order of 
>> lengths. This could give some benefit as short lines are more probable to be 
>> included in others, and maxs removes array elements starting from the end of 
>> the array and so with a little luck quickly does much less work.
>> The corresponding verb is fstr.
>> 
>> In the end though I don’t know how much better than quadratic time you can 
>> hope for, as with the current specification it doesn’t seem obvious to keep 
>> from comparing most elements to all the others.
>> 
>> Might come back with some more ideas.
>> 
>> Cheers!
>> Louis
>> 
>>> On 28 Jun 2019, at 20:20, Raul Miller <rauldmil...@gmail.com> wrote:
>>> 
>>> Does order of the substrings matter? (In other words, would it be ok to
>>> treat 'B002|A001' like 'A001|B002'?)  -- Why or why not?
>>> 
>>> Are all the part numbers equal width?
>>> 
>>> Would any of the lines be duplicates of other lines? If so, how should that
>>> be handled?
>>> 
>>> My inclination would be to build a list of unique part numbers, replace
>>> each line with a list of indices into then treat this as a numeric problem
>>> (probably using sorting, possibly grouping the lines based on length). But
>>> to do that properly, I would need a deeper understanding of the
>>> requirements.
>>> 
>>> Thanks,
>>> 
>>> --
>>> Raul
>>> 
>>>> On Friday, June 28, 2019, vadim . <vadim3128m...@gmail.com> wrote:
>>>> 
>>>> Example: given a file (a string), where each line is list of part
>>>> numbers with separators, how to exclude lines which are substrings of
>>>> other lines (cut on separators)? E.g. (first line is to be excluded):
>>>> 
>>>>  z =: 0 : 0
>>>> A001|B002
>>>> C003|A001|B002
>>>> B002|A001
>>>> C003|D004|A001
>>>> E005|F006
>>>> D004|C003
>>>> )
>>>>  [lines =: <;._2 z
>>>> +---------+--------------+---------+--------------+---------+---------+
>>>> |A001|B002|C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
>>>> +---------+--------------+---------+--------------+---------+---------+
>>>>  [syms =: (s:@:('|'&,))&.> lines
>>>> +-----------+-----------------+-----------+-----------------
>>>> +-----------+-----------+
>>>> |`A001 `B002|`C003 `A001 `B002|`B002 `A001|`C003 `D004 `A001|`E005
>>>> `F006|`D004 `C003|
>>>> +-----------+-----------------+-----------+-----------------
>>>> +-----------+-----------+
>>>>  (+./@:E.)&.(> :.])/~ syms
>>>> 1 1 0 0 0 0
>>>> 0 1 0 0 0 0
>>>> 0 0 1 0 0 0
>>>> 0 0 0 1 0 0
>>>> 0 0 0 0 1 0
>>>> 0 0 0 0 0 1
>>>>  [idx =: 1&= +/"1 (+./@:E.)&.(> :.])/~ syms
>>>> 0 1 1 1 1 1
>>>>  [result =: idx#lines
>>>> +--------------+---------+--------------+---------+---------+
>>>> |C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
>>>> +--------------+---------+--------------+---------+---------+
>>>> 
>>>>  I'm worried about performance of this line:
>>>> 
>>>> (+./@:E.)&.(> :.])/~ syms
>>>> 
>>>> other details are not very important, as I'm only learning. Phrase
>>>> above uses form "+./@:E.", recommended for speed in J Wiki. I think
>>>> table adverb must be optimized, too. But, adding some weight:
>>>> 
>>>>  z =: 0 : 0
>>>> 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>>>> 6Y1352|4P0489|2757803
>>>> 3419308|3514531|3525716|3557019|3586192|3635776|3783741
>>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854
>>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>> 3H5788|1F7854|8W1152|8R0721
>>>> 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>>>> 6Y1352|4P0489|1336934
>>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377
>>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788
>>>> 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>>>> 6Y1352|4P0489|2757803
>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>>>> 6Y1352|4P0489|1336934
>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377
>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>> 3H5788|1F7854|8W1152
>>>> )
>>>>  lines =: <;._2 z
>>>>  syms =: (s:@:('|'&,))&.> lines
>>>>  syms10 =: ,(i.10) ]"0 _ syms
>>>>  syms100 =: ,(i.100) ]"0 _ syms
>>>>  syms1000 =: ,(i.1000) ]"0 _ syms
>>>>  10 (6!:2) '(+./@:E.)&.(> :.])/~ syms'
>>>> 3.936e_5
>>>>  10 (6!:2) '(+./@:E.)&.(> :.])/~ syms10'
>>>> 0.00652701
>>>>  10 (6!:2) '(+./@:E.)&.(> :.])/~ syms100'
>>>> 0.283609
>>>>  1 (6!:2) '(+./@:E.)&.(> :.])/~ syms1000'
>>>> 28.7405
>>>> 
>>>> 28 seconds for 11000 short lines is unacceptable. Am I doing something
>>>> totally wrong? For example, for this task Perl shows close to linear
>>>> (definitely not quadratic) dependency, runs hundreds and thousands
>>>> times faster.
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> 
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Reply via email to