It's amazing. Thank you so much. I'm not yet comfortable with defining my own adverbs, and you show me constructs like "&.(B&#)" are possible. That will keep me busy for a while. One thing though, what's definition of "mxs3"?
On Sun, Jun 30, 2019 at 4:57 PM Louis de Forcrand <ol...@bluewin.ch> wrote: > > Hi again, > > I haven’t much time right now, but I’ve come up with two new programs: > > fastmax=: 1 : 0 > B=. C=. 1"_1 y > while. +./C do. > i=. C i.1 > B=. 0 i} B > B=. -.@:u&(i{y)&.(B&#) y > C=. C *. B > B=. 1 i} B > end. > B#y > ) > > recmax=: 1 : 0 > if. 1>:n=. #y do. y return. end. > 'a b'=. (>.-:n) ({. ,&<&(u mxs3) }.) y > b=. b #~ -. +./"1 b u/ a > a=. a #~ -. +./"1 a u/ b > a,b > ) > > cmp=: +./@E.&> > > fastincl=: cmp fastmax > recincl =: cmp recmax > > I’m on my phone so times are quite slow anyhow, but fastincl is very fast on > your example data: > > ts=: timespacex > ts 'fastincl syms1000' > 0.044349 203264 > ts 'recincl syms1000' > 0.316018 515072 > > As you can see recincl is slower. However I can’t tell you for sure which is > better in The Real World. > fastmax is a lot like maxs: it loops through each array element x, comparing > it against all others. However instead of removing x if it is less than > another element, it removes all items less than x, which makes for far less > wasted computation and removes items at a much faster rate. > recmax is recursive and hinges on the following fact: > Split your argument list X into two disjoint parts, A and B. Then an element > a of A is maximal in X (there is no item of X larger than it) _if and only > if_ it is maximal in A and none of the maximal elements of B are larger than > it. > > fastmax is so quick on syms1000 simply because of the amount of repetition. > My guess is that recmax would compare much more favorably on less particular > data. > > Cheers, > Louis > > > On 30 Jun 2019, at 14:09, vadim . <vadim3128m...@gmail.com> wrote: > > > > Thank you for valuable answers. Louis, it will take me some time to > > study your answer, it looks like the "J way" of solving problems which > > I'm only hoping to grasp. Raul, I'll think on your suggestion to > > convert the task into numeric problem. > > > > For me it's just learning exercise, a "real life" problem posted on > > another language forum, with incomplete specification. Data sample was > > provided by OP, part numbers can have different lengths. It's > > reasonable to assume that one of duplicate lines, if any happen, must > > be preserved. Whether to treat lines as unordered sets or strings -- > > both interpretations are interesting, for now let them be strings. > > > > I don't have large enough data sample, OP said "millions of lines", > > for lack of anything better I'm using concatenated copies. Now I see > > why fast existing solution (Perl, actually -- concatenate lines into > > one string using separators as guards, search for lines) is so fast: > > it rejects a line as soon as second copy of it is found, i.e. almost > > immediately. Real life "millions of lines" will scale quadratically, > > of course. > > > > With J, the "(+./@:E.)" is optimized to stop searching if found once. > > > > However, it may be interesting for you what I found, one approach is > > to work with characters, another to convert part numbers into symbols, > > as I did above. > > > > Let "z" and "syms1000" be nouns as above. > > > > lines =: (<@:('|'&,@:,&'|'));._2 z NB. Lines with guards > > lines1000 =: ,(i.1000) ]"0 _ lines > > (6!:2) '1&= +/"1 lines1000 ((>@:[)E.])"0 _ ;lines1000' > > 48.2911 > > (6!:2) '1&= +/"1 syms1000 ((>@:[)E.])"0 _ ;(,&(s:'' ''))&.>syms1000' > > 3.1777 > > > > Using symbols against one huge concatenated array of symbols is > > fastest solution so far. Not much of "J way", maybe. Let's see if > > "(+./@:E.)" really aborts further search, so time must drop to that of > > Perl's solution, ~ 20 times per second, even though it will produce > > wrong answer: > > > > > > (6!:2) '1&= +/"1 lines1000 ((>@:[)(+./@:E.)])"0 _ ;lines1000' > > 0.013242 > > (6!:2) '1&= +/"1 syms1000 ((>@:[)(+./@:E.)])"0 _ ;(,&(s:'' > > ''))&.>syms1000' > > 0.810288 > > > > That's curios. In case of symbols, execution time doesn't decrease as > > expected. > > > > Unfortunately, I can't modify "E." so that it stops searching after 2 > > matches, or can I? Or, stop after match and return index? I think it > > can be done with pcrp2, but then it all becomes more like Perl and not > > J. Not very interesting. > > > >> On Sun, Jun 30, 2019 at 12:13 PM Louis de Forcrand <ol...@bluewin.ch> > >> wrote: > >> > >> Hi, > >> > >> Try this: > >> > >> maxs=: 1 : 0 > >> for_i. i.-#y do. > >> if. (i{y) +./@:u (<<<i){y do. > >> y=. (<<<i){y > >> end. > >> end. > >> y > >> ) > >> > >> cmp=: +./@E.&> > >> > >> fast=: cmp maxs NB. fast > >> fstr=: fast@(\: #@>) NB. faster? > >> > >> maxs is a generic adverb that takes a boolean-valued verb and an array. > >> The verb represents a partial order on the elements of the array, and maxs > >> returns the set of maximal elements of the array, that is those which are > >> strictly less than no other element (and also removes duplicates of those > >> elements). > >> > >> This applies to the current problem as substring inclusion <= is a partial > >> order (reflexivity: x <= x is always true; antisymmetry: x <= y <= x > >> implies x = y; and transitivity: x <= y <= z implies x <= z). > >> cmp is the partial order at hand, and fast is the corresponding verb. > >> > >> The key observation is that if we look at one element y and find that y <= > >> z, we can safely remove y from the array and continue our search for > >> maximal elements without comparing against y as, if for some x we have x > >> <= y, then by transitivity x <= z and x will be removed anyway even if y > >> is already gone. > >> > >> This should give a pretty good speedup when there is a lot of repetition / > >> inclusion. > >> > >> Another thing to try for this particular problem when you have lines with > >> a wide range of different sizes, is to sort the lines in decreasing order > >> of lengths. This could give some benefit as short lines are more probable > >> to be included in others, and maxs removes array elements starting from > >> the end of the array and so with a little luck quickly does much less work. > >> The corresponding verb is fstr. > >> > >> In the end though I don’t know how much better than quadratic time you can > >> hope for, as with the current specification it doesn’t seem obvious to > >> keep from comparing most elements to all the others. > >> > >> Might come back with some more ideas. > >> > >> Cheers! > >> Louis > >> > >>> On 28 Jun 2019, at 20:20, Raul Miller <rauldmil...@gmail.com> wrote: > >>> > >>> Does order of the substrings matter? (In other words, would it be ok to > >>> treat 'B002|A001' like 'A001|B002'?) -- Why or why not? > >>> > >>> Are all the part numbers equal width? > >>> > >>> Would any of the lines be duplicates of other lines? If so, how should > >>> that > >>> be handled? > >>> > >>> My inclination would be to build a list of unique part numbers, replace > >>> each line with a list of indices into then treat this as a numeric problem > >>> (probably using sorting, possibly grouping the lines based on length). But > >>> to do that properly, I would need a deeper understanding of the > >>> requirements. > >>> > >>> Thanks, > >>> > >>> -- > >>> Raul > >>> > >>>> On Friday, June 28, 2019, vadim . <vadim3128m...@gmail.com> wrote: > >>>> > >>>> Example: given a file (a string), where each line is list of part > >>>> numbers with separators, how to exclude lines which are substrings of > >>>> other lines (cut on separators)? E.g. (first line is to be excluded): > >>>> > >>>> z =: 0 : 0 > >>>> A001|B002 > >>>> C003|A001|B002 > >>>> B002|A001 > >>>> C003|D004|A001 > >>>> E005|F006 > >>>> D004|C003 > >>>> ) > >>>> [lines =: <;._2 z > >>>> +---------+--------------+---------+--------------+---------+---------+ > >>>> |A001|B002|C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003| > >>>> +---------+--------------+---------+--------------+---------+---------+ > >>>> [syms =: (s:@:('|'&,))&.> lines > >>>> +-----------+-----------------+-----------+----------------- > >>>> +-----------+-----------+ > >>>> |`A001 `B002|`C003 `A001 `B002|`B002 `A001|`C003 `D004 `A001|`E005 > >>>> `F006|`D004 `C003| > >>>> +-----------+-----------------+-----------+----------------- > >>>> +-----------+-----------+ > >>>> (+./@:E.)&.(> :.])/~ syms > >>>> 1 1 0 0 0 0 > >>>> 0 1 0 0 0 0 > >>>> 0 0 1 0 0 0 > >>>> 0 0 0 1 0 0 > >>>> 0 0 0 0 1 0 > >>>> 0 0 0 0 0 1 > >>>> [idx =: 1&= +/"1 (+./@:E.)&.(> :.])/~ syms > >>>> 0 1 1 1 1 1 > >>>> [result =: idx#lines > >>>> +--------------+---------+--------------+---------+---------+ > >>>> |C003|A001|B002|B002|A001|C003|D004|A001|E005|F006|D004|C003| > >>>> +--------------+---------+--------------+---------+---------+ > >>>> > >>>> I'm worried about performance of this line: > >>>> > >>>> (+./@:E.)&.(> :.])/~ syms > >>>> > >>>> other details are not very important, as I'm only learning. Phrase > >>>> above uses form "+./@:E.", recommended for speed in J Wiki. I think > >>>> table adverb must be optimized, too. But, adding some weight: > >>>> > >>>> z =: 0 : 0 > >>>> 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377| > >>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908| > >>>> 6Y1352|4P0489|2757803 > >>>> 3419308|3514531|3525716|3557019|3586192|3635776|3783741 > >>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1F7854 > >>>> 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377| > >>>> 3H5788|1F7854|8W1152|8R0721 > >>>> 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377| > >>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908| > >>>> 6Y1352|4P0489|1336934 > >>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377 > >>>> 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 > >>>> 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377| > >>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908| > >>>> 6Y1352|4P0489|2757803 > >>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377| > >>>> 3H5788|1F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908| > >>>> 6Y1352|4P0489|1336934 > >>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377 > >>>> 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377| > >>>> 3H5788|1F7854|8W1152 > >>>> ) > >>>> lines =: <;._2 z > >>>> syms =: (s:@:('|'&,))&.> lines > >>>> syms10 =: ,(i.10) ]"0 _ syms > >>>> syms100 =: ,(i.100) ]"0 _ syms > >>>> syms1000 =: ,(i.1000) ]"0 _ syms > >>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms' > >>>> 3.936e_5 > >>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms10' > >>>> 0.00652701 > >>>> 10 (6!:2) '(+./@:E.)&.(> :.])/~ syms100' > >>>> 0.283609 > >>>> 1 (6!:2) '(+./@:E.)&.(> :.])/~ syms1000' > >>>> 28.7405 > >>>> > >>>> 28 seconds for 11000 short lines is unacceptable. Am I doing something > >>>> totally wrong? For example, for this task Perl shows close to linear > >>>> (definitely not quadratic) dependency, runs hundreds and thousands > >>>> times faster. > >>>> ---------------------------------------------------------------------- > >>>> For information about J forums see http://www.jsoftware.com/forums.htm > >>>> > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm