woops! Sorry, I changed the name of the adverb before sending the message (to something more descriptive than mxs3 :) but I forgot to switch the recursive call. Replace mxs3 by recmax and you should be good!
Louis > On 30 Jun 2019, at 17:59, vadim . <vadim3128m...@gmail.com> wrote: > > It's amazing. Thank you so much. I'm not yet comfortable with defining > my own adverbs, and you show me constructs like "&.(B&#)" are > possible. That will keep me busy for a while. One thing though, what's > definition of "mxs3"? > > >> On Sun, Jun 30, 2019 at 4:57 PM Louis de Forcrand <ol...@bluewin.ch> wrote: >> >> Hi again, >> >> I haven’t much time right now, but I’ve come up with two new programs: >> >> fastmax=: 1 : 0 >> B=. C=. 1"_1 y >> while. +./C do. >> i=. C i.1 >> B=. 0 i} B >> B=. -.@:u&(i{y)&.(B&#) y >> C=. C *. B >> B=. 1 i} B >> end. >> B#y >> ) >> >> recmax=: 1 : 0 >> if. 1>:n=. #y do. y return. end. >> 'a b'=. (>.-:n) ({. ,&<&(u mxs3) }.) y >> b=. b #~ -. +./"1 b u/ a >> a=. a #~ -. +./"1 a u/ b >> a,b >> ) >> >> cmp=: +./@E.&> >> >> fastincl=: cmp fastmax >> recincl =: cmp recmax >> >> I’m on my phone so times are quite slow anyhow, but fastincl is very fast on >> your example data: >> >> ts=: timespacex >> ts 'fastincl syms1000' >> 0.044349 203264 >> ts 'recincl syms1000' >> 0.316018 515072 >> >> As you can see recincl is slower. However I can’t tell you for sure which is >> better in The Real World. >> fastmax is a lot like maxs: it loops through each array element x, comparing >> it against all others. However instead of removing x if it is less than >> another element, it removes all items less than x, which makes for far less >> wasted computation and removes items at a much faster rate. >> recmax is recursive and hinges on the following fact: >> Split your argument list X into two disjoint parts, A and B. Then an element >> a of A is maximal in X (there is no item of X larger than it) _if and only >> if_ it is maximal in A and none of the maximal elements of B are larger than >> it. >> >> fastmax is so quick on syms1000 simply because of the amount of repetition. >> My guess is that recmax would compare much more favorably on less particular >> data. >> >> Cheers, >> Louis >> >>> On 30 Jun 2019, at 14:09, vadim . <vadim3128m...@gmail.com> wrote: >>> >>> Thank you for valuable answers. Louis, it will take me some time to >>> study your answer, it looks like the "J way" of solving problems which >>> I'm only hoping to grasp. Raul, I'll think on your suggestion to >>> convert the task into numeric problem. >>> >>> For me it's just learning exercise, a "real life" problem posted on >>> another language forum, with incomplete specification. Data sample was >>> provided by OP, part numbers can have different lengths. It's >>> reasonable to assume that one of duplicate lines, if any happen, must >>> be preserved. Whether to treat lines as unordered sets or strings -- >>> both interpretations are interesting, for now let them be strings. >>> >>> I don't have large enough data sample, OP said "millions of lines", >>> for lack of anything better I'm using concatenated copies. Now I see >>> why fast existing solution (Perl, actually -- concatenate lines into >>> one string using separators as guards, search for lines) is so fast: >>> it rejects a line as soon as second copy of it is found, i.e. almost >>> immediately. Real life "millions of lines" will scale quadratically, >>> of course. >>> >>> With J, the "(+./@:E.)" is optimized to stop searching if found once. >>> >>> However, it may be interesting for you what I found, one approach is >>> to work with characters, another to convert part numbers into symbols, >>> as I did above. >>> >>> Let "z" and "syms1000" be nouns as above. >>> >>> lines =: (<@:('|'&,@:,&'|'));._2 z NB. Lines with guards >>> lines1000 =: ,(i.1000) ]"0 _ lines >>> (6!:2) '1&= +/"1 lines1000 ((>@:[)E.])"0 _ ;lines1000' >>> 48.2911 >>> (6!:2) '1&= +/"1 syms1000 ((>@:[)E.])"0 _ ;(,&(s:'' ''))&.>syms1000' >>> 3.1777 >>> >>> Using symbols against one huge concatenated array of symbols is >>> fastest solution so far. Not much of "J way", maybe. Let's see if >>> "(+./@:E.)" really aborts further search, so time must drop to that of >>> Perl's solution, ~ 20 times per second, even though it will produce >>> wrong answer: >>> >>> >>> (6!:2) '1&= +/"1 lines1000 ((>@:[)(+./@:E.)])"0 _ ;lines1000' >>> 0.013242 >>> (6!:2) '1&= +/"1 syms1000 ((>@:[)(+./@:E.)])"0 _ ;(,&(s:'' >>> ''))&.>syms1000' >>> 0.810288 >>> >>> That's curios. In case of symbols, execution time doesn't decrease as >>> expected. >>> >>> Unfortunately, I can't modify "E." so that it stops searching after 2 >>> matches, or can I? Or, stop after match and return index? I think it >>> can be done with pcrp2, but then it all becomes more like Perl and not >>> J. Not very interesting. >>> >>>> On Sun, Jun 30, 2019 at 12:13 PM Louis de Forcrand <ol...@bluewin.ch> >>>> wrote: >>>> >>>> Hi, >>>> >>>> Try this: >>>> >>>> maxs=: 1 : 0 >>>> for_i. i.-#y do. >>>> if. (i{y) +./@:u (<<<i){y do. >>>> y=. (<<<i){y >>>> end. >>>> end. >>>> y >>>> ) >>>> >>>> cmp=: +./@E.&> >>>> >>>> fast=: cmp maxs NB. fast >>>> fstr=: fast@(\: #@>) NB. faster? >>>> >>>> maxs is a generic adverb that takes a boolean-valued verb and an array. >>>> The verb represents a partial order on the elements of the array, and maxs >>>> returns the set of maximal elements of the array, that is those which are >>>> strictly less than no other element (and also removes duplicates of those >>>> elements). >>>> >>>> This applies to the current problem as substring inclusion <= is a partial >>>> order (reflexivity: x <= x is always true; antisymmetry: x <= y <= x >>>> implies x = y; and transitivity: x <= y <= z implies x <= z). >>>> cmp is the partial order at hand, and fast is the corresponding verb. >>>> >>>> The key observation is that if we look at one element y and find that y <= >>>> z, we can safely remove y from the array and continue our search for >>>> maximal elements without comparing against y as, if for some x we have x >>>> <= y, then by transitivity x <= z and x will be removed anyway even if y >>>> is already gone. >>>> >>>> This should give a pretty good speedup when there is a lot of repetition / >>>> inclusion. >>>> >>>> Another thing to try for this particular problem when you have lines with >>>> a wide range of different sizes, is to sort the lines in decreasing order >>>> of lengths. This could give some benefit as short lines are more probable >>>> to be included in others, and maxs removes array elements starting from >>>> the end of the array and so with a little luck quickly does much less work. >>>> The corresponding verb is fstr. >>>> >>>> In the end though I don’t know how much better than quadratic time you can >>>> hope for, as with the current specification it doesn’t seem obvious to >>>> keep from comparing most elements to all the others. >>>> >>>> Might come back with some more ideas. >>>> >>>> Cheers! >>>> Louis >>>> >>>>> On 28 Jun 2019, at 20:20, Raul Miller <rauldmil...@gmail.com> wrote: >>>>> >>>>> Does order of the substrings matter? (In other words, would it be ok to >>>>> treat 'B002|A001' like 'A001|B002'?) -- Why or why not? >>>>> >>>>> Are all the part numbers equal width? >>>>> >>>>> Would any of the lines be duplicates of other lines? If so, how should >>>>> that >>>>> be handled? >>>>> >>>>> My inclination would be to build a list of unique part numbers, replace >>>>> each line with a list of indices into then treat this as a numeric problem >>>>> (probably using sorting, possibly grouping the lines based on length). But >>>>> to do that properly, I would need a deeper understanding of the >>>>> requirements. >>>>> >>>>> Thanks, >>>>> >>>>> -- >>>>> Raul >>>>> >>>>>> On Friday, June 28, 2019, vadim . <vadim3128m...@gmail.com> wrote: >>>>>> >>>>>> Example: given a file (a string), where each line is list of part >>>>>> numbers with separators, how to exclude lines which are substrings of >>>>>> other lines (cut on separators)? E.g. (first line is to be excluded): >>>>>> >>>>>> z =: 0 : 0 >>>>>> A001|B002 >>>>>> C003|A001|B002 >>>>>> B002|A001 >>>>>> C003|D004|A001 >>>>>> E005|F006 >>>>>> D004|C003 >>>>>> ) >>>>>> [lines =: <;._2 z >>>>>> +---------+--------------+---------+--------------+---------+---------+ >>>>>> |A001|B002|C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003| >>>>>> +---------+--------------+---------+--------------+---------+---------+ >>>>>> [syms =: (s:@:('|'&,))&.> lines >>>>>> +-----------+-----------------+-----------+----------------- >>>>>> +-----------+-----------+ >>>>>> |`A001 `B002|`C003 `A001 `B002|`B002 `A001|`C003 `D004 `A001|`E005 >>>>>> `F006|`D004 `C003| >>>>>> +-----------+-----------------+-----------+----------------- >>>>>> +-----------+-----------+ >>>>>> (+./@:E.)&.(> :.])/~ syms >>>>>> 1 1 0 0 0 0 >>>>>> 0 1 0 0 0 0 >>>>>> 0 0 1 0 0 0 >>>>>> 0 0 0 1 0 0 >>>>>> 0 0 0 0 1 0 >>>>>> 0 0 0 0 0 1 >>>>>> [idx =: 1&= +/"1 (+./@:E.)&.(> :.])/~ syms >>>>>> 0 1 1 1 1 1 >>>>>> [result =: idx#lines >>>>>> +--------------+---------+--------------+---------+---------+ >>>>>> |C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003| >>>>>> +--------------+---------+--------------+---------+---------+ >>>>>> >>>>>> I'm worried about performance of this line: >>>>>> >>>>>> (+./@:E.)&.(> :.])/~ syms >>>>>> >>>>>> other details are not very important, as I'm only learning. Phrase >>>>>> above uses form "+./@:E.", recommended for speed in J Wiki. I think >>>>>> table adverb must be optimized, too. But, adding some weight: >>>>>> >>>>>> z =: 0 : 0 >>>>>> 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377| >>>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908| >>>>>> 6Y1352|4P0489|2757803 >>>>>> 3419308|3514531|3525716|3557019|3586192|3635776|3783741 >>>>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854 >>>>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377| >>>>>> 3H5788|1F7854|8W1152|8R0721 >>>>>> 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377| >>>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908| >>>>>> 6Y1352|4P0489|1336934 >>>>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377 >>>>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 >>>>>> 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377| >>>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908| >>>>>> 6Y1352|4P0489|2757803 >>>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377| >>>>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908| >>>>>> 6Y1352|4P0489|1336934 >>>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377 >>>>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377| >>>>>> 3H5788|1F7854|8W1152 >>>>>> ) >>>>>> lines =: <;._2 z >>>>>> syms =: (s:@:('|'&,))&.> lines >>>>>> syms10 =: ,(i.10) ]"0 _ syms >>>>>> syms100 =: ,(i.100) ]"0 _ syms >>>>>> syms1000 =: ,(i.1000) ]"0 _ syms >>>>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms' >>>>>> 3.936e_5 >>>>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms10' >>>>>> 0.00652701 >>>>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms100' >>>>>> 0.283609 >>>>>> 1 (6!:2) '(+./@:E.)&.(> :.])/~ syms1000' >>>>>> 28.7405 >>>>>> >>>>>> 28 seconds for 11000 short lines is unacceptable. Am I doing something >>>>>> totally wrong? For example, for this task Perl shows close to linear >>>>>> (definitely not quadratic) dependency, runs hundreds and thousands >>>>>> times faster. >>>>>> ---------------------------------------------------------------------- >>>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>>>> >>>>> ---------------------------------------------------------------------- >>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm