This looks like you are applying desc to an array that does not have rank 2. I don't see how that can happen if you entered this exactly, since the argument of desc must have shape 39 2:
desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w] Jay. On 12 September 2016 at 18:34, Ala'a Mohammad <[email protected]> wrote: > Thanks for the alternative, I'd tried to run it, but got Rank Error > > RANK ERROR > λ1[1] λ←⍵[⍒⍵[;2];] > ^ ^ > > How can I help debug this? > > Regards, > > Ala'a > > On Mon, Sep 12, 2016 at 5:32 PM, Jay Foad <[email protected]> wrote: > > Hi Ala'a, > > > > How about replacing the last line with this? It runs in about 1 minute > on my > > machine: > > > > desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w] > > > > Jay. > > > > On 11 September 2016 at 19:23, Ala'a Mohammad <[email protected]> wrote: > >> > >> Just an update as a reference, I'm now able to parse the big.txt file > >> (without WS full or killed process), but it takes around 2 Hours and > >> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The > >> process reach 1GiB (after parsing the words), and tops that with > >> 100MiB during the sequential 'Each' (thus a max of 1.1GiB). > >> > >> The only change is scanning each unique word against the whole words > >> vector. > >> > >> Below is the code with a sample timed run. > >> > >> Regards, > >> > >> Ala'a > >> > >> ⍝ fhist.apl > >> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' > >> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } > >> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 > >> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' > >> alphamask ← { ~ ⍵ ∊ nonalpha } > >> words ← { (alphamask ⍵) ⊂ downcase ⍵ } > >> desc ← {⍵[⍒⍵[;2];]} > >> ftxt ← { ⎕FIO[26] ⍵ } > >> > >> file ← '/misc/big.txt' ⍝ ~ 6.2M > >> ⎕ ← ⍴w ← words ftxt file > >> ⎕ ← ⍴u ← ∪w > >> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u > >> )OFF > >> > >> : time apl -s -f fhist.apl > >> 1098281 > >> 30377 > >> the 80003 > >> of 40025 > >> to 28760 > >> in 22048 > >> for 6936 > >> by 6736 > >> be 6154 > >> or 5349 > >> all 4141 > >> this 4058 > >> are 3627 > >> other 1488 > >> before 1363 > >> should 1297 > >> over 1282 > >> your 1276 > >> any 1204 > >> our 1065 > >> holmes 450 > >> country 417 > >> world 355 > >> project 286 > >> gutenberg 262 > >> laws 233 > >> sir 176 > >> series 128 > >> sure 123 > >> sherlock 101 > >> ebook 85 > >> copyright 69 > >> changing 44 > >> check 38 > >> arthur 30 > >> adventures 17 > >> redistributing 7 > >> header 7 > >> doyle 5 > >> downloading 5 > >> conan 4 > >> > >> apl -s -f fhist.apl 8901.96s user 5.78s system 99% cpu 2:28:38.61 total > >> > >> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <[email protected]> > >> wrote: > >> > Thanks to all for the input, > >> > > >> > Replacing Find and Each OR with Match helped, now I'm parsing a 159K > >> > (~1545 lines) text file (a sample chunk from the big.txt). > >> > > >> > The strange thing for me that I'm trying to understand is that the APL > >> > process (when fed the 159K text file) start allocating memory until it > >> > reaches 2.7GiB, then after printing the result settle down to 50MiB. > >> > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage > >> > collection utility) which can be used to mitigate this issue? > >> > > >> > Here is the updated code: > >> > > >> > a ← 'abcdefghijklmnopqrstuvwxyz' > >> > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' > >> > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] } > >> > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9 > >> > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&' > >> > alphamask ← { ~ ⍵ ∊ nonalpha } > >> > words ← { (alphamask ⍵) ⊂ downcase ⍵ } > >> > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper > >> > desc ← {⍵[⍒⍵[;2];]} > >> > ftxt ← { ⎕FIO[26] ⍵ } > >> > fhist ← { hist words ftxt ⍵ } > >> > > >> > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines > >> > ⎕ ← ⍴w ← words ftxt file > >> > ⎕ ← ⍴u ← ∪w > >> > desc 39 2 ⍴ fhist file > >> > > >> > And here is a sample run > >> > : apl -s -f fhist.apl > >> > 30186 > >> > 4155 > >> > the 1560 > >> > to 804 > >> > of 781 > >> > in 493 > >> > for 219 > >> > be 173 > >> > holmes 164 > >> > your 132 > >> > this 114 > >> > all 99 > >> > by 97 > >> > are 97 > >> > or 73 > >> > other 56 > >> > over 51 > >> > our 48 > >> > should 47 > >> > before 43 > >> > sherlock 39 > >> > any 35 > >> > sir 26 > >> > sure 13 > >> > country 9 > >> > project 6 > >> > gutenberg 6 > >> > ebook 5 > >> > adventures 5 > >> > world 5 > >> > arthur 4 > >> > conan 4 > >> > doyle 4 > >> > series 2 > >> > copyright 2 > >> > laws 2 > >> > check 2 > >> > header 2 > >> > changing 1 > >> > downloading 1 > >> > redistributing 1 > >> > > >> > Also attached the sample input file > >> > > >> > Regards, > >> > > >> > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <[email protected]> > >> > wrote: > >> >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote: > >> >>> the errors happened inside 'hist' function, and I presume mostly due > >> >>> to the jot dot find (if understand correctly, operating on a matrix > of > >> >>> length equal to : unique-length * words-length) > >> >> > >> >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵. > >> >> > >> >> -k > >> > > >
