Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Louis de Forcrand Sun, 07 Jul 2019 13:13:14 -0700

Hmm,

I’m disappointed the slight « loopiness » of my solutions makes them that much 
slower than the naïve approach :(
There probably is however little inclusion among random words, maybe try again 
with, say, the first 10000 words? :)


Also, there is special code for calculating +/x E.y as x +/@:E. y; try the 
following modification of your verb (probably won’t be much faster but will 
definitely use less space):

naiveS=: 3 : 0
 joined=. ; (,&'|')&.> y
  y #~ 1 = +/@E.&joined@> y
)

Will keep thinking!
Louis

PS: If naiveS isn’t any better try +/@:E. instead of +/@E., as only the former 
is actually listed in the special code section of the dictionary even though 
they are equivalent.

> On 7 Jul 2019, at 20:15, vadim . <vadim3128m...@gmail.com> wrote:
> 
> Hi, Louis,
> 
> I've got mixed results here. The "cmaxs" solution is, indeed, fastest
> so far, but (1) even then I'm not brave enough to test it against more
> than 10000 lines (certainly not "millions"), and, (2) perhaps because
> of "words_alpha" (simple arrays of characters) as test subject, very
> naive approach wins.
> 
>   s =. freads 'words_alpha.txt'
>   lines =: <;._2 s
>   lines10000 =: (10000 ?. #lines) { lines
> 
>   ts 'fastincl lines10000'
> 17.6488 190720
>   ts 'recincl lines10000'
> 16.871 3.3762e7
>   ts '(P cmaxs L) lines10000'
> 7.76267 395264
>   ts 'naive lines10000'
> 1.47661 1.07415e9
> 
> Where "naive" follows aforementioned Perl approach, except it doesn't
> stop searching after second match, and its huge memory appetite (my
> fault? could it be written better?) won't allow more than ~10000
> samples anyway (FWIW, Perl script, though on different randomly chosen
> shuffled 10000 lines, completes in 1.1 sec and of course doesn't eat 1
> Gb of RAM).
> 
>   naive =: 3 : 0
> joined =. ;(,&'|')&.> y
> (1&= +/"1 y ((>@:[)E.])"0 _ joined)#y
> )
> 
> On the other hand, I appreciate that your solutions are very general,
> thank you for shedding the light on correct path. E.g. "P" can be
> easily replaced to test for unordered subsets instead. Then another
> dilemma, whether repeated part numbers are allowed on one line...
> 
> Best regards,
> Vadim
> 
>> On Sat, Jul 6, 2019 at 5:35 AM Louis de Forcrand <ol...@bluewin.ch> wrote:
>> 
>> Hi again,
>> 
>> I’m back with a speedier solution.
>> comaxs is a conjunction, and if P is a boolean-valued dyadic verb and L any 
>> monadic verb on the items of the array A, P comaxs L A does the following:
>> 
>> P and L are supposed to represent orderings on the elements of y. P 
>> represents a partial order, such that x P y says whether x is less than or 
>> equal to y for x and y any items of A. L is a “labelling” of the items of A, 
>> which represents a _total_ order compatible with P, in the sense that x P y 
>> implies that (L x) <: L y (or more generally that L x comes first in /:~(L 
>> x),L y).
>> 
>> Then the result of the computation is the set of maximal items of A for the 
>> partial order P, that is, items not less than any other item.
>> 
>> As mentioned earlier, the particular problem of inclusion of part number 
>> lists is a special case of this; indeed we can take
>> 
>> P=: +./@E.&>
>> L=: #@>
>> 
>> since, for part number lists (or any kind of lists) x and y, if x is 
>> included in y, then certainly #x is no larger than #y.
>> 
>> To measure performance we need a big, rather random dataset. Unfortunately I 
>> don’t know of any place I can find huge lists of part numbers, some of which 
>> include others, but if we look instead at lists of letters, we have a great 
>> data source which is the dictionary. At the following website we can find a 
>> quite huge file words_alpha.txt containing lots (too many for any of my 
>> devices probably) of english words, many of which are included in others:
>> 
>> https://github.com/dwyl/english-words
>> 
>> Running freads on it allows us to try out cmaxs on a part of the English 
>> language. Sadly though I’m on my phone right now, and I can’t freads 
>> downloaded files, so either I’ll do it at a later time or someone else can 
>> try.
>> 
>> The code works thanks to the previous observation used in fastmax that we 
>> can safely remove an element a from A that is included in another element, 
>> and no longer check other items for inclusion in a. The reason this verb is 
>> much faster than fastmax is because now we also know that if #x is greater 
>> than #y then we never have to check if x is included in y. (More generally, 
>> (L x) > L y implies -. x P y.)
>> We therefore sort A by decreasing list length (L value) prior to running a 
>> modified version of fastmax which only checks if  (i{A) P (j{A) when i > j.
>> The last bit if speedup if achieved by manipulating bit arrays rather than 
>> the actual array A itself.
>> 
>> Cheers,
>> Louis
>> 
>> 
>> cmaxs=: 2 : 0
>> n=. #B=. 0"0 C=. 1"0 y=. (\: v) y
>> while. n>i=. C i.1 do.
>>  B=. 1 i}B
>>  C=. 0 i}C
>>  C=. -.@:u&(i{y)&.(C&#) y
>> end.
>> B#y
>> )
>> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] How to improve speed searching for substrings using "E."?

Reply via email to