This looks like you are applying desc to an array that does not have rank
2. I don't see how that can happen if you entered this exactly, since the
argument of desc must have shape 39 2:

desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]

Jay.

On 12 September 2016 at 18:34, Ala'a Mohammad <[email protected]> wrote:

> Thanks for the alternative, I'd tried to run it, but got Rank Error
>
> RANK ERROR
> λ1[1]  λ←⍵[⍒⍵[;2];]
>             ^    ^
>
> How can I help debug this?
>
> Regards,
>
> Ala'a
>
> On Mon, Sep 12, 2016 at 5:32 PM, Jay Foad <[email protected]> wrote:
> > Hi Ala'a,
> >
> > How about replacing the last line with this? It runs in about 1 minute
> on my
> > machine:
> >
> > desc 39 2⍴(⍪u),≢¨⊂⍨x[⍋x←u⍳w]
> >
> > Jay.
> >
> > On 11 September 2016 at 19:23, Ala'a Mohammad <[email protected]> wrote:
> >>
> >> Just an update as a reference, I'm now able to parse the big.txt file
> >> (without WS full or killed process), but it takes around 2 Hours and
> >> 20 Minutes +-10 minutes. (around 1M words, 30K are unique). The
> >> process reach 1GiB (after parsing the words), and tops that with
> >> 100MiB during the sequential 'Each' (thus a max of 1.1GiB).
> >>
> >> The only change is scanning each unique word against the whole words
> >> vector.
> >>
> >> Below is the code with a sample timed run.
> >>
> >> Regards,
> >>
> >> Ala'a
> >>
> >> ⍝ fhist.apl
> >> a ← 'abcdefghijklmnopqrstuvwxyz' ◊ A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> >> downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> >> nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> >> nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> >> alphamask ← { ~ ⍵ ∊ nonalpha }
> >> words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> >> desc ← {⍵[⍒⍵[;2];]}
> >> ftxt ← { ⎕FIO[26] ⍵ }
> >>
> >> file ← '/misc/big.txt' ⍝ ~ 6.2M
> >> ⎕ ← ⍴w ← words ftxt file
> >> ⎕ ← ⍴u ← ∪w
> >> desc 39 2 ⍴ (⍪u),{+/(⊂⍵)∘.≡w}¨u
> >> )OFF
> >>
> >> : time apl -s -f fhist.apl
> >> 1098281
> >> 30377
> >>  the            80003
> >>  of             40025
> >>  to             28760
> >>  in             22048
> >>  for             6936
> >>  by              6736
> >>  be              6154
> >>  or              5349
> >>  all             4141
> >>  this            4058
> >>  are             3627
> >>  other           1488
> >>  before          1363
> >>  should          1297
> >>  over            1282
> >>  your            1276
> >>  any             1204
> >>  our             1065
> >>  holmes           450
> >>  country          417
> >>  world            355
> >>  project          286
> >>  gutenberg        262
> >>  laws             233
> >>  sir              176
> >>  series           128
> >>  sure             123
> >>  sherlock         101
> >>  ebook             85
> >>  copyright         69
> >>  changing          44
> >>  check             38
> >>  arthur            30
> >>  adventures        17
> >>  redistributing     7
> >>  header             7
> >>  doyle              5
> >>  downloading        5
> >>  conan              4
> >>
> >> apl -s -f fhist.apl  8901.96s user 5.78s system 99% cpu 2:28:38.61 total
> >>
> >> On Sat, Sep 10, 2016 at 12:02 PM, Ala'a Mohammad <[email protected]>
> >> wrote:
> >> > Thanks to all for the input,
> >> >
> >> > Replacing Find and Each OR with Match helped, now I'm parsing a 159K
> >> > (~1545 lines) text file (a sample chunk from the big.txt).
> >> >
> >> > The strange thing for me that I'm trying to understand is that the APL
> >> > process (when fed the 159K text file) start allocating memory until it
> >> > reaches 2.7GiB, then after printing the result settle down to 50MiB.
> >> > Why do I need 2.7GiB? is there any memory utils (i.e. Garbage
> >> > collection utility) which can be used to mitigate this issue?
> >> >
> >> > Here is the updated code:
> >> >
> >> > a ← 'abcdefghijklmnopqrstuvwxyz'
> >> > A ← 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
> >> > downcase ← { (a,⎕AV)[(A,⎕AV)⍳⍵] }
> >> > nl ← ⎕UCS 13 ◊ cr ← ⎕UCS 10 ◊ tab ← ⎕UCS 9
> >> > nonalpha ← nl, cr, tab, ' 0123456789()[]!?%$,.:;/+*=<>-_#"`~@&'
> >> > alphamask ← { ~ ⍵ ∊ nonalpha }
> >> > words ← { (alphamask ⍵) ⊂ downcase ⍵ }
> >> > hist ← { (⍪∪⍵),+/(∪⍵)∘.≡⍵ } ⍝ as suggested by Kacper
> >> > desc ← {⍵[⍒⍵[;2];]}
> >> > ftxt ← { ⎕FIO[26] ⍵ }
> >> > fhist ← { hist words ftxt ⍵ }
> >> >
> >> > file ← '/misc/llaa' ⍝ llaa contains 1546 text lines
> >> > ⎕ ← ⍴w ← words ftxt file
> >> > ⎕ ← ⍴u ← ∪w
> >> > desc 39 2 ⍴ fhist file
> >> >
> >> > And here is a sample run
> >> > : apl -s -f fhist.apl
> >> > 30186
> >> > 4155
> >> >  the            1560
> >> >  to              804
> >> >  of              781
> >> >  in              493
> >> >  for             219
> >> >  be              173
> >> >  holmes          164
> >> >  your            132
> >> >  this            114
> >> >  all              99
> >> >  by               97
> >> >  are              97
> >> >  or               73
> >> >  other            56
> >> >  over             51
> >> >  our              48
> >> >  should           47
> >> >  before           43
> >> >  sherlock         39
> >> >  any              35
> >> >  sir              26
> >> >  sure             13
> >> >  country           9
> >> >  project           6
> >> >  gutenberg         6
> >> >  ebook             5
> >> >  adventures        5
> >> >  world             5
> >> >  arthur            4
> >> >  conan             4
> >> >  doyle             4
> >> >  series            2
> >> >  copyright         2
> >> >  laws              2
> >> >  check             2
> >> >  header            2
> >> >  changing          1
> >> >  downloading       1
> >> >  redistributing    1
> >> >
> >> > Also attached the sample input file
> >> >
> >> > Regards,
> >> >
> >> > On Sat, Sep 10, 2016 at 9:20 AM, Kacper Gutowski <[email protected]>
> >> > wrote:
> >> >> On 9 September 2016 at 23:39, Ala'a Mohammad wrote:
> >> >>> the errors happened inside 'hist' function, and I presume mostly due
> >> >>> to the jot dot find (if understand correctly, operating on a matrix
> of
> >> >>> length equal to : unique-length * words-length)
> >> >>
> >> >> Try (∪⍵)∘.≡⍵ instead of ∨/¨(∪⍵)∘.⍷⍵.
> >> >>
> >> >> -k
> >>
> >
>

Reply via email to