Re: [Jprogramming] How to improve speed searching for substrings using "E."?

vadim . Sun, 30 Jun 2019 09:01:11 -0700

It's amazing. Thank you so much. I'm not yet comfortable with defining
my own adverbs, and you show me constructs like "&.(B&#)" are
possible. That will keep me busy for a while. One thing though, what's
definition of "mxs3"?



On Sun, Jun 30, 2019 at 4:57 PM Louis de Forcrand <[email protected]> wrote:
>
> Hi again,
>
> I haven’t much time right now, but I’ve come up with two new programs:
>
> fastmax=: 1 : 0
>  B=. C=. 1"_1 y
>  while. +./C do.
>   i=. C i.1
>   B=. 0 i} B
>   B=. -.@:u&(i{y)&.(B&#) y
>   C=. C *. B
>   B=. 1 i} B
>  end.
>  B#y
> )
>
> recmax=: 1 : 0
>  if. 1>:n=. #y do. y return. end.
>  'a b'=. (>.-:n) ({. ,&<&(u mxs3) }.) y
>  b=. b #~ -. +./"1 b u/ a
>  a=. a #~ -. +./"1 a u/ b
>  a,b
> )
>
> cmp=: +./@E.&>
>
> fastincl=: cmp fastmax
> recincl =: cmp recmax
>
> I’m on my phone so times are quite slow anyhow, but fastincl is very fast on 
> your example data:
>
>    ts=: timespacex
>    ts 'fastincl syms1000'
> 0.044349 203264
>    ts 'recincl syms1000'
> 0.316018 515072
>
> As you can see recincl is slower. However I can’t tell you for sure which is 
> better in The Real World.
> fastmax is a lot like maxs: it loops through each array element x, comparing 
> it against all others. However instead of removing x if it is less than 
> another element, it removes all items less than x, which makes for far less 
> wasted computation and removes items at a much faster rate.
> recmax is recursive and hinges on the following fact:
> Split your argument list X into two disjoint parts, A and B. Then an element 
> a of A is maximal in X (there is no item of X larger than it) _if and only 
> if_ it is maximal in A and none of the maximal elements of B are larger than 
> it.
>
> fastmax is so quick on syms1000 simply because of the amount of repetition. 
> My guess is that recmax would compare much more favorably on less particular 
> data.
>
> Cheers,
> Louis
>
> > On 30 Jun 2019, at 14:09, vadim . <[email protected]> wrote:
> >
> > Thank you for valuable answers. Louis, it will take me some time to
> > study your answer, it looks like the "J way" of solving problems which
> > I'm only hoping to grasp. Raul, I'll think on your suggestion to
> > convert the task into numeric problem.
> >
> > For me it's just learning exercise, a "real life" problem posted on
> > another language forum, with incomplete specification. Data sample was
> > provided by OP, part numbers can have different lengths. It's
> > reasonable to assume that one of duplicate lines, if any happen, must
> > be preserved. Whether to treat lines as unordered sets or strings --
> > both interpretations are interesting, for now let them be strings.
> >
> > I don't have large enough data sample, OP said "millions of lines",
> > for lack of anything better I'm using concatenated copies. Now I see
> > why fast existing solution (Perl, actually -- concatenate lines into
> > one string using separators as guards, search for lines) is so fast:
> > it rejects a line as soon as second copy of it is found, i.e. almost
> > immediately. Real life "millions of lines" will scale quadratically,
> > of course.
> >
> > With J, the "(+./@:E.)" is optimized to stop searching if found once.
> >
> > However, it may be interesting for you what I found, one approach is
> > to work with characters, another to convert part numbers into symbols,
> > as I did above.
> >
> > Let "z" and "syms1000" be nouns as above.
> >
> >    lines =: (<@:('|'&,@:,&'|'));._2 z    NB. Lines with guards
> >    lines1000 =: ,(i.1000) ]"0 _ lines
> >    (6!:2) '1&= +/"1 lines1000 ((>@:[)E.])"0 _ ;lines1000'
> > 48.2911
> >    (6!:2) '1&= +/"1 syms1000 ((>@:[)E.])"0 _ ;(,&(s:'' ''))&.>syms1000'
> > 3.1777
> >
> > Using symbols against one huge concatenated array of symbols is
> > fastest solution so far. Not much of "J way", maybe. Let's see if
> > "(+./@:E.)" really aborts further search, so time must drop to that of
> > Perl's solution, ~ 20 times per second, even though it will produce
> > wrong answer:
> >
> >
> >    (6!:2) '1&= +/"1 lines1000 ((>@:[)(+./@:E.)])"0 _ ;lines1000'
> > 0.013242
> >    (6!:2) '1&= +/"1 syms1000 ((>@:[)(+./@:E.)])"0 _ ;(,&(s:'' 
> > ''))&.>syms1000'
> > 0.810288
> >
> > That's curios. In case of symbols, execution time doesn't decrease as 
> > expected.
> >
> > Unfortunately, I can't modify "E." so that it stops searching after 2
> > matches, or can I? Or, stop after match and return index? I think it
> > can be done with pcrp2, but then it all becomes more like Perl and not
> > J. Not very interesting.
> >
> >> On Sun, Jun 30, 2019 at 12:13 PM Louis de Forcrand <[email protected]> 
> >> wrote:
> >>
> >> Hi,
> >>
> >> Try this:
> >>
> >> maxs=: 1 : 0
> >> for_i. i.-#y do.
> >>  if. (i{y) +./@:u (<<<i){y do.
> >>   y=. (<<<i){y
> >>  end.
> >> end.
> >> y
> >> )
> >>
> >> cmp=: +./@E.&>
> >>
> >> fast=: cmp maxs       NB. fast
> >> fstr=: fast@(\: #@>)    NB. faster?
> >>
> >> maxs is a generic adverb that takes a boolean-valued verb and an array.
> >> The verb represents a partial order on the elements of the array, and maxs 
> >> returns the set of maximal elements of the array, that is those which are 
> >> strictly less than no other element (and also removes duplicates of those 
> >> elements).
> >>
> >> This applies to the current problem as substring inclusion <= is a partial 
> >> order (reflexivity: x <= x is always true; antisymmetry: x <= y <= x 
> >> implies x = y; and transitivity: x <= y <= z implies x <= z).
> >> cmp is the partial order at hand, and fast is the corresponding verb.
> >>
> >> The key observation is that if we look at one element y and find that y <= 
> >> z, we can safely remove y from the array and continue our search for 
> >> maximal elements without comparing against y as, if for some x we have x 
> >> <= y, then by transitivity x <= z and x will be removed anyway even if y 
> >> is already gone.
> >>
> >> This should give a pretty good speedup when there is a lot of repetition / 
> >> inclusion.
> >>
> >> Another thing to try for this particular problem when you have lines with 
> >> a wide range of different sizes, is to sort the lines in decreasing order 
> >> of lengths. This could give some benefit as short lines are more probable 
> >> to be included in others, and maxs removes array elements starting from 
> >> the end of the array and so with a little luck quickly does much less work.
> >> The corresponding verb is fstr.
> >>
> >> In the end though I don’t know how much better than quadratic time you can 
> >> hope for, as with the current specification it doesn’t seem obvious to 
> >> keep from comparing most elements to all the others.
> >>
> >> Might come back with some more ideas.
> >>
> >> Cheers!
> >> Louis
> >>
> >>> On 28 Jun 2019, at 20:20, Raul Miller <[email protected]> wrote:
> >>>
> >>> Does order of the substrings matter? (In other words, would it be ok to
> >>> treat 'B002|A001' like 'A001|B002'?)  -- Why or why not?
> >>>
> >>> Are all the part numbers equal width?
> >>>
> >>> Would any of the lines be duplicates of other lines? If so, how should 
> >>> that
> >>> be handled?
> >>>
> >>> My inclination would be to build a list of unique part numbers, replace
> >>> each line with a list of indices into then treat this as a numeric problem
> >>> (probably using sorting, possibly grouping the lines based on length). But
> >>> to do that properly, I would need a deeper understanding of the
> >>> requirements.
> >>>
> >>> Thanks,
> >>>
> >>> --
> >>> Raul
> >>>
> >>>> On Friday, June 28, 2019, vadim . <[email protected]> wrote:
> >>>>
> >>>> Example: given a file (a string), where each line is list of part
> >>>> numbers with separators, how to exclude lines which are substrings of
> >>>> other lines (cut on separators)? E.g. (first line is to be excluded):
> >>>>
> >>>>  z =: 0 : 0
> >>>> A001|B002
> >>>> C003|A001|B002
> >>>> B002|A001
> >>>> C003|D004|A001
> >>>> E005|F006
> >>>> D004|C003
> >>>> )
> >>>>  [lines =: <;._2 z
> >>>> +---------+--------------+---------+--------------+---------+---------+
> >>>> |A001|B002|C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
> >>>> +---------+--------------+---------+--------------+---------+---------+
> >>>>  [syms =: (s:@:('|'&,))&.> lines
> >>>> +-----------+-----------------+-----------+-----------------
> >>>> +-----------+-----------+
> >>>> |`A001 `B002|`C003 `A001 `B002|`B002 `A001|`C003 `D004 `A001|`E005
> >>>> `F006|`D004 `C003|
> >>>> +-----------+-----------------+-----------+-----------------
> >>>> +-----------+-----------+
> >>>>  (+./@:E.)&.(> :.])/~ syms
> >>>> 1 1 0 0 0 0
> >>>> 0 1 0 0 0 0
> >>>> 0 0 1 0 0 0
> >>>> 0 0 0 1 0 0
> >>>> 0 0 0 0 1 0
> >>>> 0 0 0 0 0 1
> >>>>  [idx =: 1&= +/"1 (+./@:E.)&.(> :.])/~ syms
> >>>> 0 1 1 1 1 1
> >>>>  [result =: idx#lines
> >>>> +--------------+---------+--------------+---------+---------+
> >>>> |C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003|
> >>>> +--------------+---------+--------------+---------+---------+
> >>>>
> >>>>  I'm worried about performance of this line:
> >>>>
> >>>> (+./@:E.)&.(> :.])/~ syms
> >>>>
> >>>> other details are not very important, as I'm only learning. Phrase
> >>>> above uses form "+./@:E.", recommended for speed in J Wiki. I think
> >>>> table adverb must be optimized, too. But, adding some weight:
> >>>>
> >>>>  z =: 0 : 0
> >>>> 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
> >>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
> >>>> 6Y1352|4P0489|2757803
> >>>> 3419308|3514531|3525716|3557019|3586192|3635776|3783741
> >>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854
> >>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
> >>>> 3H5788|1F7854|8W1152|8R0721
> >>>> 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
> >>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
> >>>> 6Y1352|4P0489|1336934
> >>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377
> >>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788
> >>>> 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
> >>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
> >>>> 6Y1352|4P0489|2757803
> >>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
> >>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|
> >>>> 6Y1352|4P0489|1336934
> >>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377
> >>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|
> >>>> 3H5788|1F7854|8W1152
> >>>> )
> >>>>  lines =: <;._2 z
> >>>>  syms =: (s:@:('|'&,))&.> lines
> >>>>  syms10 =: ,(i.10) ]"0 _ syms
> >>>>  syms100 =: ,(i.100) ]"0 _ syms
> >>>>  syms1000 =: ,(i.1000) ]"0 _ syms
> >>>>  10 (6!:2) '(+./@:E.)&.(> :.])/~ syms'
> >>>> 3.936e_5
> >>>>  10 (6!:2) '(+./@:E.)&.(> :.])/~ syms10'
> >>>> 0.00652701
> >>>>  10 (6!:2) '(+./@:E.)&.(> :.])/~ syms100'
> >>>> 0.283609
> >>>>  1 (6!:2) '(+./@:E.)&.(> :.])/~ syms1000'
> >>>> 28.7405
> >>>>
> >>>> 28 seconds for 11000 short lines is unacceptable. Am I doing something
> >>>> totally wrong? For example, for this task Perl shows close to linear
> >>>> (definitely not quadratic) dependency, runs hundreds and thousands
> >>>> times faster.
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>>>
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Reply via email to