Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Louis de Forcrand Sun, 30 Jun 2019 11:26:01 -0700

woops!

Sorry, I changed the name of the adverb before sending the message (to 
something more descriptive than mxs3 :) but I forgot to switch the recursive 
call. Replace mxs3 by recmax and you should be good!


Louis

> On 30 Jun 2019, at 17:59, vadim . <vadim3128m...@gmail.com> wrote:
> 
> It's amazing. Thank you so much. I'm not yet comfortable with defining
> my own adverbs, and you show me constructs like "&.(B&#)" are
> possible. That will keep me busy for a while. One thing though, what's
> definition of "mxs3"?
> 
> 
>> On Sun, Jun 30, 2019 at 4:57 PM Louis de Forcrand <ol...@bluewin.ch> wrote:
>> 
>> Hi again,
>> 
>> I haven’t much time right now, but I’ve come up with two new programs:
>> 
>> fastmax=: 1 : 0
>> B=. C=. 1"_1 y
>> while. +./C do.
>>  i=. C i.1
>>  B=. 0 i} B
>>  B=. -.@:u&(i{y)&.(B&#) y
>>  C=. C *. B
>>  B=. 1 i} B
>> end.
>> B#y
>> )
>> 
>> recmax=: 1 : 0
>> if. 1>:n=. #y do. y return. end.
>> 'a b'=. (>.-:n) ({. ,&<&(u mxs3) }.) y
>> b=. b #~ -. +./"1 b u/ a
>> a=. a #~ -. +./"1 a u/ b
>> a,b
>> )
>> 
>> cmp=: +./@E.&>
>> 
>> fastincl=: cmp fastmax
>> recincl =: cmp recmax
>> 
>> I’m on my phone so times are quite slow anyhow, but fastincl is very fast on 
>> your example data:
>> 
>>   ts=: timespacex
>>   ts 'fastincl syms1000'
>> 0.044349 203264
>>   ts 'recincl syms1000'
>> 0.316018 515072
>> 
>> As you can see recincl is slower. However I can’t tell you for sure which is 
>> better in The Real World.
>> fastmax is a lot like maxs: it loops through each array element x, comparing 
>> it against all others. However instead of removing x if it is less than 
>> another element, it removes all items less than x, which makes for far less 
>> wasted computation and removes items at a much faster rate.
>> recmax is recursive and hinges on the following fact:
>> Split your argument list X into two disjoint parts, A and B. Then an element 
>> a of A is maximal in X (there is no item of X larger than it) _if and only 
>> if_ it is maximal in A and none of the maximal elements of B are larger than 
>> it.
>> 
>> fastmax is so quick on syms1000 simply because of the amount of repetition. 
>> My guess is that recmax would compare much more favorably on less particular 
>> data.
>> 
>> Cheers,
>> Louis
>> 
>>> On 30 Jun 2019, at 14:09, vadim . <vadim3128m...@gmail.com> wrote:
>>> 
>>> Thank you for valuable answers. Louis, it will take me some time to
>>> study your answer, it looks like the "J way" of solving problems which
>>> I'm only hoping to grasp. Raul, I'll think on your suggestion to
>>> convert the task into numeric problem.
>>> 
>>> For me it's just learning exercise, a "real life" problem posted on
>>> another language forum, with incomplete specification. Data sample was
>>> provided by OP, part numbers can have different lengths. It's
>>> reasonable to assume that one of duplicate lines, if any happen, must
>>> be preserved. Whether to treat lines as unordered sets or strings --
>>> both interpretations are interesting, for now let them be strings.
>>> 
>>> I don't have large enough data sample, OP said "millions of lines",
>>> for lack of anything better I'm using concatenated copies. Now I see
>>> why fast existing solution (Perl, actually -- concatenate lines into
>>> one string using separators as guards, search for lines) is so fast:
>>> it rejects a line as soon as second copy of it is found, i.e. almost
>>> immediately. Real life "millions of lines" will scale quadratically,
>>> of course.
>>> 
>>> With J, the "(+./@:E.)" is optimized to stop searching if found once.
>>> 
>>> However, it may be interesting for you what I found, one approach is
>>> to work with characters, another to convert part numbers into symbols,
>>> as I did above.
>>> 
>>> Let "z" and "syms1000" be nouns as above.
>>> 
>>>   lines =: (<@:('|'&,@:,&'|'));._2 z    NB. Lines with guards
>>>   lines1000 =: ,(i.1000) ]"0 _ lines
>>>   (6!:2) '1&= +/"1 lines1000 ((>@:[)E.])"0 _ ;lines1000'
>>> 48.2911
>>>   (6!:2) '1&= +/"1 syms1000 ((>@:[)E.])"0 _ ;(,&(s:'' ''))&.>syms1000'
>>> 3.1777
>>> 
>>> Using symbols against one huge concatenated array of symbols is
>>> fastest solution so far. Not much of "J way", maybe. Let's see if
>>> "(+./@:E.)" really aborts further search, so time must drop to that of
>>> Perl's solution, ~ 20 times per second, even though it will produce
>>> wrong answer:
>>> 
>>> 
>>>   (6!:2) '1&= +/"1 lines1000 ((>@:[)(+./@:E.)])"0 _ ;lines1000'
>>> 0.013242
>>>   (6!:2) '1&= +/"1 syms1000 ((>@:[)(+./@:E.)])"0 _ ;(,&(s:'' 
>>> ''))&.>syms1000'
>>> 0.810288
>>> 
>>> That's curios. In case of symbols, execution time doesn't decrease as 
>>> expected.
>>> 
>>> Unfortunately, I can't modify "E." so that it stops searching after 2
>>> matches, or can I? Or, stop after match and return index? I think it
>>> can be done with pcrp2, but then it all becomes more like Perl and not
>>> J. Not very interesting.
>>> 
>>>> On Sun, Jun 30, 2019 at 12:13 PM Louis de Forcrand <ol...@bluewin.ch> 
>>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Try this:
>>>> 
>>>> maxs=: 1 : 0
>>>> for_i. i.-#y do.
>>>> if. (i{y) +./@:u (<<<i){y do.
>>>>  y=. (<<<i){y
>>>> end.
>>>> end.
>>>> y
>>>> )
>>>> 
>>>> cmp=: +./@E.&>
>>>> 
>>>> fast=: cmp maxs       NB. fast
>>>> fstr=: fast@(\: #@>)    NB. faster?
>>>> 
>>>> maxs is a generic adverb that takes a boolean-valued verb and an array.
>>>> The verb represents a partial order on the elements of the array, and maxs 
>>>> returns the set of maximal elements of the array, that is those which are 
>>>> strictly less than no other element (and also removes duplicates of those 
>>>> elements).
>>>> 
>>>> This applies to the current problem as substring inclusion <= is a partial 
>>>> order (reflexivity: x <= x is always true; antisymmetry: x <= y <= x 
>>>> implies x = y; and transitivity: x <= y <= z implies x <= z).
>>>> cmp is the partial order at hand, and fast is the corresponding verb.
>>>> 
>>>> The key observation is that if we look at one element y and find that y <= 
>>>> z, we can safely remove y from the array and continue our search for 
>>>> maximal elements without comparing against y as, if for some x we have x 
>>>> <= y, then by transitivity x <= z and x will be removed anyway even if y 
>>>> is already gone.
>>>> 
>>>> This should give a pretty good speedup when there is a lot of repetition / 
>>>> inclusion.
>>>> 
>>>> Another thing to try for this particular problem when you have lines with 
>>>> a wide range of different sizes, is to sort the lines in decreasing order 
>>>> of lengths. This could give some benefit as short lines are more probable 
>>>> to be included in others, and maxs removes array elements starting from 
>>>> the end of the array and so with a little luck quickly does much less work.
>>>> The corresponding verb is fstr.
>>>> 
>>>> In the end though I don’t know how much better than quadratic time you can 
>>>> hope for, as with the current specification it doesn’t seem obvious to 
>>>> keep from comparing most elements to all the others.
>>>> 
>>>> Might come back with some more ideas.
>>>> 
>>>> Cheers!
>>>> Louis
>>>> 
>>>>> On 28 Jun 2019, at 20:20, Raul Miller <rauldmil...@gmail.com> wrote:
>>>>> 
>>>>> Does order of the substrings matter? (In other words, would it be ok to
>>>>> treat 'B002|A001' like 'A001|B002'?)  -- Why or why not?
>>>>> 
>>>>> Are all the part numbers equal width?
>>>>> 
>>>>> Would any of the lines be duplicates of other lines? If so, how should 
>>>>> that
>>>>> be handled?
>>>>> 
>>>>> My inclination would be to build a list of unique part numbers, replace
>>>>> each line with a list of indices into then treat this as a numeric problem
>>>>> (probably using sorting, possibly grouping the lines based on length). But
>>>>> to do that properly, I would need a deeper understanding of the
>>>>> requirements.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> --
>>>>> Raul
>>>>> 
>>>>>> On Friday, June 28, 2019, vadim . <vadim3128m...@gmail.com> wrote:
>>>>>> 
>>>>>> Example: given a file (a string), where each line is list of part
>>>>>> numbers with separators, how to exclude lines which are substrings of
>>>>>> other lines (cut on separators)? E.g. (first line is to be excluded):
>>>>>> 
>>>>>> z =: 0 : 0
>>>>>> A001|B002
>>>>>> C003|A001|B002
>>>>>> B002|A001
>>>>>> C003|D004|A001
>>>>>> E005|F006
>>>>>> D004|C003
>>>>>> )
>>>>>> [lines =: <;._2 z
>>>>>> +---------+--------------+---------+--------------+---------+---------+
>>>>>> |A001|B002|C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
>>>>>> +---------+--------------+---------+--------------+---------+---------+
>>>>>> [syms =: (s:@:('|'&,))&.> lines
>>>>>> +-----------+-----------------+-----------+-----------------
>>>>>> +-----------+-----------+
>>>>>> |`A001 `B002|`C003 `A001 `B002|`B002 `A001|`C003 `D004 `A001|`E005
>>>>>> `F006|`D004 `C003|
>>>>>> +-----------+-----------------+-----------+-----------------
>>>>>> +-----------+-----------+
>>>>>> (+./@:E.)&.(> :.])/~ syms
>>>>>> 1 1 0 0 0 0
>>>>>> 0 1 0 0 0 0
>>>>>> 0 0 1 0 0 0
>>>>>> 0 0 0 1 0 0
>>>>>> 0 0 0 0 1 0
>>>>>> 0 0 0 0 0 1
>>>>>> [idx =: 1&= +/"1 (+./@:E.)&.(> :.])/~ syms
>>>>>> 0 1 1 1 1 1
>>>>>> [result =: idx#lines
>>>>>> +--------------+---------+--------------+---------+---------+
>>>>>> |C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
>>>>>> +--------------+---------+--------------+---------+---------+
>>>>>> 
>>>>>> I'm worried about performance of this line:
>>>>>> 
>>>>>> (+./@:E.)&.(> :.])/~ syms
>>>>>> 
>>>>>> other details are not very important, as I'm only learning. Phrase
>>>>>> above uses form "+./@:E.", recommended for speed in J Wiki. I think
>>>>>> table adverb must be optimized, too. But, adding some weight:
>>>>>> 
>>>>>> z =: 0 : 0
>>>>>> 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>>>>>> 6Y1352|4P0489|2757803
>>>>>> 3419308|3514531|3525716|3557019|3586192|3635776|3783741
>>>>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854
>>>>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>>>> 3H5788|1F7854|8W1152|8R0721
>>>>>> 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>>>>>> 6Y1352|4P0489|1336934
>>>>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377
>>>>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788
>>>>>> 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>>>>>> 6Y1352|4P0489|2757803
>>>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
>>>>>> 6Y1352|4P0489|1336934
>>>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377
>>>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
>>>>>> 3H5788|1F7854|8W1152
>>>>>> )
>>>>>> lines =: <;._2 z
>>>>>> syms =: (s:@:('|'&,))&.> lines
>>>>>> syms10 =: ,(i.10) ]"0 _ syms
>>>>>> syms100 =: ,(i.100) ]"0 _ syms
>>>>>> syms1000 =: ,(i.1000) ]"0 _ syms
>>>>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms'
>>>>>> 3.936e_5
>>>>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms10'
>>>>>> 0.00652701
>>>>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms100'
>>>>>> 0.283609
>>>>>> 1 (6!:2) '(+./@:E.)&.(> :.])/~ syms1000'
>>>>>> 28.7405
>>>>>> 
>>>>>> 28 seconds for 11000 short lines is unacceptable. Am I doing something
>>>>>> totally wrong? For example, for this task Perl shows close to linear
>>>>>> (definitely not quadratic) dependency, runs hundreds and thousands
>>>>>> times faster.
>>>>>> ----------------------------------------------------------------------
>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>> 
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Reply via email to