Raul, Bob - I tried the blocked approach earlier on and realized the
infixes across boundaries would be a bit of a challenge. It was still
slower than I needed so I didn't fix up the boundary logic. It takes
84 seconds with a 1e6 block size and 43 seconds with a 10e6 block size
and around 43 seconds still for 20e6 block size, still under the
memory limit too which is good.  A 50e6 block size puts me over. It's
still faster than the memr method but I think that's mainly due to the
loop.

Devon - thanks for the link. I've been watching it on the Recent
Changes. I may try to apply it later

Code I was testing:

calc3 =: 4 : 0
blocks=: x[\y
uniq=:''
lim=.(# blocks)
for_i. (i. lim) do.
smoutput i
extra=.''
NB. get current block and subset from next
if. lim>(>:i) do.
extra=.(20{. (i+1)&{::blocks)
end.
blocktext=: (i{::blocks),extra
uniq=: ~. (uniq, ( 9[\ blocktext))
end.
# uniq
)

You can see how it produces the incorrect results -- I would need to
fix up the boundary logic.It's close on the large input. The correct
answer is 2,482,912 and this calculates 2482921. On a small input it's
clearly wrong

_9 calc3 'abcdefghijklmnopqrstuvwxyz'

Produces 35

Whereas the correct answer is: 3

# ~. _9[\'abcdefghijklmnopqrstuvwxyz'
3

Oh well

On Tue, Feb 3, 2015 at 4:44 PM, Devon McCormick <[email protected]> wrote:
> I've been updating work on my adverb for working with large files -
> http://www.jsoftware.com/jwiki/DevonMcCormick/Code/WorkOnLargeFiles - that
> may provide a general way to apply an arbitrary verb across a large file.
>
> On Tue, Feb 3, 2015 at 4:25 PM, Raul Miller <[email protected]> wrote:
>
>> Another approach, depending on what the processing looks like, might
>> involve selecting indices based on something smaller than nine
>> characters. For example, if certain character pairs were "valid" or
>> maybe even "better than others" you could get indices which select
>> those pairs and further refine from there.
>>
>> But the blocked approach can also be a very good one. (Just make sure
>> to handle infixes which span block boundaries.)
>>
>> Thanks,
>>
>> --
>> Raul
>>
>> On Tue, Feb 3, 2015 at 3:54 PM, Robert Bernecky
>> <[email protected]> wrote:
>> > I agree that you need to have more than one cell on hand to do the nub!
>> >
>> > However,  try this approach. It's crude, but it might match or beat C:
>> >
>> >    - decide how much input you can process at once.
>> >    - Result =. ''
>> >    - while more hunks, loop over:
>> >        -    hunkresult = nub-of-infix on current hunk of the input.
>> >        -    Result = nub Result,hunkresult
>> >
>> > This lets J do most of the heavy lifting, while keeping you
>> > within the memory constraints.
>> >
>> > Bob
>> >
>> > On 15-02-03 03:45 PM, Joe Bogner wrote:
>> >>
>> >> Sorry for the double email, I realized I replied to you directly. I'm
>> >> replying to you directly since you replied to me directly, just in
>> >> case you didn't want to send it to chat...
>> >>
>> >> In any case, the whole goal is to get the nub of the 9 character
>> >> infixes. There is special code for some of the mathematical operations
>> >> on infixes (such as sum) that don't generate all the infixes, however
>> >> I having found anything for nub (distinct), which makes some sense
>> >> since it can't be reduced to an atom like the math operations
>> >>
>> >> Thanks for the ideas
>> >>
>> >>
>> >> On Tue, Feb 3, 2015 at 3:42 PM, Robert Bernecky
>> >> <[email protected]> wrote:
>> >>>
>> >>> "9 character infixes" - that looks like an input to a convolution, does
>> >>> it
>> >>> not?
>> >>> If so, then there should be some J dohickey (adverb or conjunction)
>> >>>   (My J is very very rusty, and likely obsolete...)
>> >>> to do that that will invoke your kernel verb without generating all the
>> >>> infixes.
>> >>>
>> >>> By convolution here, I mean something like moving-window inner
>> >>> product, string search, or, err, convolution.
>> >>>
>> >>> What processing do you apply to each infix?
>> >>>
>> >>> Bob
>> >>>
>> >>>
>> >>>
>> >>> On 15-02-03 03:36 PM, Joe Bogner wrote:
>> >>>>
>> >>>> Thanks Bob. I agree that compiling often helps. I have previously
>> >>>> kicked around the idea of generating byte code or machine code and
>> >>>> interpreting it or executing it. I have called dlls in other
>> >>>> circumstances, but it would be nice to have everything in J.
>> >>>>
>> >>>> In this case, the challenge put a 2GB constraint on memory and 60
>> >>>> seconds of execution time. It's an artificial constraint, which is why
>> >>>> I posted to chat. Those constraint wouldn't exist in the real world.
>> >>>>
>> >>>> I'm scanning a 240MB text file and generating 9 character infixes and
>> >>>> processing them. I hit the memory limit using 9]\ so I figured I'd try
>> >>>> looping and using memr (15!:1). In an earlier version, about 270
>> >>>> seconds were spent in the looping construct and 40 seconds in memr. It
>> >>>> seemed like there should be a faster looping construct
>> >>>>
>> >>>> Here's a simple version:
>> >>>>
>> >>>> txt=: a. {~ (97&+(i. 26))
>> >>>> addr=: mema (# txt)
>> >>>> txt memw addr,0,(#txt)
>> >>>> 6!:2 '([:<:([:0&{::(] ; smoutput&([:15!:1(addr,9,~<:@:]))    ))) ^:]
>> >>>> (<:#txt)'
>> >>>>
>> >>>>
>> >>>>      yz
>> >>>> xyz
>> >>>> wxyz
>> >>>> vwxyz
>> >>>> uvwxyz
>> >>>> tuvwxyz
>> >>>> stuvwxyz
>> >>>> rstuvwxyz
>> >>>> qrstuvwxy
>> >>>> pqrstuvwx
>> >>>> opqrstuvw
>> >>>> nopqrstuv
>> >>>> mnopqrstu
>> >>>> lmnopqrst
>> >>>> klmnopqrs
>> >>>> jklmnopqr
>> >>>> ijklmnopq
>> >>>> hijklmnop
>> >>>> ghijklmno
>> >>>> fghijklmn
>> >>>> efghijklm
>> >>>> defghijkl
>> >>>> cdefghijk
>> >>>> bcdefghij
>> >>>> abcdefghi
>> >>>>
>> >>>>
>> >>>> I would need more processing of the results, but you can see roughly
>> >>>> how it compares to 9]\
>> >>>>
>> >>>> 9[\txt
>> >>>> abcdefghi
>> >>>> bcdefghij
>> >>>> cdefghijk
>> >>>> defghijkl
>> >>>> efghijklm
>> >>>> fghijklmn
>> >>>> ghijklmno
>> >>>> hijklmnop
>> >>>> ijklmnopq
>> >>>> jklmnopqr
>> >>>> klmnopqrs
>> >>>> lmnopqrst
>> >>>> mnopqrstu
>> >>>> nopqrstuv
>> >>>> opqrstuvw
>> >>>> pqrstuvwx
>> >>>> qrstuvwxy
>> >>>> rstuvwxyz
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Tue, Feb 3, 2015 at 3:05 PM, Robert Bernecky
>> >>>> <[email protected]> wrote:
>> >>>>>
>> >>>>> Compiling often helps. When I compile looping, scalar-dominated
>> >>>>> APL code, I generally get about a factor of 1000X speedup.
>> >>>>> You should get the same sort of speedup with compiled J.
>> >>>>> [No, I do not have a J compiler, but will be happy to create one
>> >>>>> if funding were to materialize.]
>> >>>>>
>> >>>>> However, you mentioned that you were "looping due to memory
>> >>>>> constraints...". Is it possible that a different algorithm would let
>> >>>>> you do less looping? E.g., looping over 1e4 elements at once,
>> >>>>> or using a different approach, such as dynamic programming?
>> >>>>>
>> >>>>> Bob
>> >>>>>
>> >>>>>
>> >>>>> On 15-02-03 02:53 PM, Joe Bogner wrote:
>> >>>>>>
>> >>>>>> I'm playing around with a coding challenge that I'm solving with a
>> >>>>>> loop that needs to execute over 200 million times.
>> >>>>>>
>> >>>>>> Setting aside whether that's a smart approach, what is the quickest
>> >>>>>> looping construct in J? I realize that J is not optimized for this
>> >>>>>> style of programming, but I'm still curious if it can be done
>> >>>>>> efficiently.
>> >>>>>>
>> >>>>>> The quickest I've found takes about 7 seconds to iterate 100 million
>> >>>>>> times, whereas the same loop in C takes 0.03 seconds.
>> >>>>>>
>> >>>>>> 6!:2 '(] [ ])^:] 100e6'
>> >>>>>> 7.00952
>> >>>>>>
>> >>>>>> The train is taking up some time, but eventually I will need it
>> >>>>>>
>> >>>>>> No train - 2.7 seconds
>> >>>>>>
>> >>>>>> 6!:2 ']^:] 100e6'
>> >>>>>> 2.70656
>> >>>>>>
>> >>>>>>
>> >>>>>> Explicit:
>> >>>>>>
>> >>>>>> loop=: 3 : 0
>> >>>>>> ctr=:0
>> >>>>>> while. 1 do.
>> >>>>>> ctr=:>:ctr
>> >>>>>> if. y < ctr do.
>> >>>>>> smoutput 'done'
>> >>>>>> return.
>> >>>>>> end.
>> >>>>>> end.
>> >>>>>> )
>> >>>>>>
>> >>>>>> 6!:2 'loop 100e6'
>> >>>>>> 122.48
>> >>>>>>
>> >>>>>> Clearly explicit isn't the way to go
>> >>>>>>
>> >>>>>> Back to the tacit:
>> >>>>>>
>> >>>>>> log=:smoutput bind ]
>> >>>>>>
>> >>>>>>       ([: <: [: 0&{:: ] ; log )^:] 10
>> >>>>>> 10
>> >>>>>> 9
>> >>>>>> 8
>> >>>>>> 7
>> >>>>>> 6
>> >>>>>> 5
>> >>>>>> 4
>> >>>>>> 3
>> >>>>>> 2
>> >>>>>> 1
>> >>>>>> 0
>> >>>>>>
>> >>>>>> I would replace log with my function that uses the counter value
>> >>>>>>
>> >>>>>> That's pretty slow though on 100 million iterations - 88 seconds
>> >>>>>>
>> >>>>>> 6!:2 '([: <: [: 0&{:: ] ; ] )^:] 100e6'
>> >>>>>> 88.3644
>> >>>>>>
>> >>>>>> Let's try a new log
>> >>>>>>
>> >>>>>>      log=: ] <:@:[ (smoutput bind ])
>> >>>>>>       log 55
>> >>>>>> 55
>> >>>>>> 54
>> >>>>>>
>> >>>>>>
>> >>>>>> That looks like it could work. Let's put a dummy in for now
>> >>>>>>
>> >>>>>>      log=: ] <:@:[ ]
>> >>>>>>
>> >>>>>> 6!:2 'log^:] 100e6'
>> >>>>>> 40.9542
>> >>>>>>
>> >>>>>> Still 40 seconds... Any ideas on how to speed up iteration like
>> this?
>> >>>>>>
>> >>>>>>
>> >>>>>> Sidenote: I'm looping due to memory constraints placed on the coding
>> >>>>>> challenge
>> >>>>>>
>> ----------------------------------------------------------------------
>> >>>>>> For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> >>>>>>
>> >>>>> --
>> >>>>> Robert Bernecky
>> >>>>> Snake Island Research Inc
>> >>>>> 18 Fifth Street
>> >>>>> Ward's Island
>> >>>>> Toronto, Ontario M5J 2B9
>> >>>>>
>> >>>>> [email protected]
>> >>>>> tel: +1 416 203 0854
>> >>>>>
>> >>>
>> >>> --
>> >>> Robert Bernecky
>> >>> Snake Island Research Inc
>> >>> 18 Fifth Street
>> >>> Ward's Island
>> >>> Toronto, Ontario M5J 2B9
>> >>>
>> >>> [email protected]
>> >>> tel: +1 416 203 0854
>> >>>
>> >
>> >
>> > --
>> > Robert Bernecky
>> > Snake Island Research Inc
>> > 18 Fifth Street
>> > Ward's Island
>> > Toronto, Ontario M5J 2B9
>> >
>> > [email protected]
>> > tel: +1 416 203 0854
>> >
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
>
>
> --
> Devon McCormick, CFA
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to