Re: [Jprogramming] File Cleanup

Don Guinn Wed, 21 Feb 2018 09:08:35 -0800

Defining a verb get to retrieve the index of the desired word as tacit does
make get pretty much unreadable; however, there is a possible performance
gain as the hash table for i. gets built only once when get is defined. If
you will be running get many times this could result in a significant
performance gain.


Of course, once read in words must not be modified without rebuilding get.
But if it turns out that you don't need words for anything else than in get
then you could erase words after get is defined so storage used by a big
verb is offset by not having words around any more.

On Wed, Feb 21, 2018 at 9:31 AM, R.E. Boss <[email protected]> wrote:

>  vec {~ (<'adults') i.~ words
> is perhaps what you are looking for
>
>
> R.E. Boss
>
>
> > -----Original Message-----
> > From: Programming [mailto:[email protected]]
> > On Behalf Of Skip Cave
> > Sent: woensdag 21 februari 2018 17:09
> > To: [email protected]
> > Subject: Re: [Jprogramming] File Cleanup
> >
> > Thanks to Raul and Mike for the suggestions.
> >
> > I read in the data:
> >
> >
> > nb =: <'C:\numberbatch-en.txt'
> >
> > nbs =. fread nb
> >
> >
> > Then I tried to clean it up:
> >
> >
> > Mike's method ran out of memory:
> >
> > nbs4 =. ( i.&' ' ({.;0 ". }.)] ) every nbs
> >
> > |out of memory
> >
> > When I tried to run it on a smaller set:
> >
> > nbs4=: (i.&' '({.;0".}.)])every 100000{. nbs
> >
> > nbs4
> >
> > ...
> >
> > │0││
> >
> > ├─┼┤
> >
> > │0││
> >
> > ├─┼┤
> >
> > │3││
> >
> > ├─┼┤
> >
> > │5││
> >
> > ├─┼┤
> >
> > │ ││
> >
> > ├─┼┤
> >
> > │0││
> >
> > ├─┼┤
> >
> > │.││
> >
> > ├─┼┤
> >
> > │0││
> >
> > ├─┼┤
> >
> > │7││
> >
> > ├─┼┤
> >
> > │8││
> >
> > ├─┼┤
> >
> > │2││
> >
> > ├─┼┤
> >
> > So that wasn't working for me.
> >
> > I tried Raul's suggestion:
> >
> > words=. <@({.~ i.&' ');._2 nbs
> >
> > vec =. 0 1 }. _&".;._2 nbs
> >
> >
> > $words
> >
> > 417195
> >
> >
> > Looking good....
> >
> >
> > ,.20{. 6000}. words
> >
> > ┌────────────┐
> >
> > │adultly │
> >
> > ├────────────┤
> >
> > │adultness │
> >
> > ├────────────┤
> >
> > │adultoid │
> >
> > ├────────────┤
> >
> > │adultress │
> >
> > ├────────────┤
> >
> > │adults │
> >
> > ├────────────┤
> >
> > │adultship │
> >
> > ├────────────┤
> >
> > │adulty │
> >
> > ├────────────┤
> >
> > │adumbral │
> >
> > ├────────────┤
> >
> > │adumbrant │
> >
> > ├────────────┤
> >
> > │adumbrate │
> >
> > ├────────────┤
> >
> > │adumbrated │
> >
> > ├────────────┤
> >
> > │adumbrates │
> >
> > ├────────────┤
> >
> > │adumbrating │
> >
> > ├────────────┤
> >
> > │adumbration │
> >
> > ├────────────┤
> >
> > │adumbrations│
> >
> > ├────────────┤
> >
> > │adumbrative │
> >
> > ├────────────┤
> >
> > │adunation │
> >
> > ├────────────┤
> >
> > │adunc │
> >
> > ├────────────┤
> >
> > │aduncate │
> >
> > ├────────────┤
> >
> > │aduncity │
> >
> > └────────────┘
> >
> > $vec
> >
> > 417195 300
> >
> > 3 {. }.vec
> >
> > _0.0264 0.0468 _0.0099 _0.0242 _0.0762 0.0562 0.0863 0.0115 _0.0471
> 0.0442
> > _0.0875 0.0376 _0.0404 _0.0086 0.0161 _0.1689 0.1485 _0.0201 0.1021
> _0.0635
> > _0.0317 0.0142 0.0588 _0.1299 _0.0905 0.0389 _0.0452 0.1352 0.0731 0.0648
> > 0.1309 0.0493 0.0785 0.015...
> >
> > _0.0096 0.0318 _0.0095 _0.042 _0.0831 0.1103 0.075 0.024 _0.0237 0.0398
> > _0.1274 _0.0299 _0.0209 _0.0195 _0.0043 _0.1033 0.1378 _0.0499 0.0517
> > _0.0958 _0.0651 0.0214 0.0096 _0.0855 _0.1049 0.036 _0.0562 0.043 0.0616
> > 0.1124 0.152 0.0418 0.0628 _0.018...
> >
> > _0.0364 0.0254 _0.0448 _0.0327 _0.0712 0.1548 0.1004 0.0033 _0.039 0.0635
> > _0.1179 _0.0703 _0.0359 0.0296 _0.0594 _0.0954 0.1904 _0.0301 0.0078
> > _0.0607 _0.0344 0.034 _0.0059 _0.1453 _0.0429 _0.0061 _0.05 0.0377 0.0959
> > 0.1313 0.1238 0.0302 0.0043 _0.038...
> >
> >
> > So this looks good!
> >
> >
> > Now I need a verb that will let me specify a word, and it will return the
> > associated vector.
> >
> > Here's how it should work:
> >
> >
> > tst =. get 'adults'
> >
> >
> > tst
> >
> > 0.1144 0.0444 0.0574 0.0387 0.082 _0.0271 0.209 _0.006 _0.1896 0.1038
> > _0.0257 0.0646 0.0488 _0.0065 0.0486 0.0422 0.0239 _0.1006 _0.0541 0.0511
> > _0.0254 _0.0121 0.0216 0.0324 _0.1349 0.0237 0.0049 0.0061 0.0349 _0.0264
> > 0.0086 0.0919 _0.0174 0.0645 ...
> >
> >
> > To build the 'get' verb we need to try to find the location of the word
> 'adults'
> > in the boxed words array:
> >
> > 'adults' = each words
> >
> > |length error
> >
> > | 'adults' =each words
> >
> >
> > Nope, that didn't work... Do I need to box the word?
> >
> >
> > (<'adults')=each words
> >
> > |length error
> >
> > | (<'adults') =each words
> >
> >
> > Nope! How do I find a specific word in the boxed word array?
> >
> >
> >
> >
> >
> >
> >
> >
> > Skip Cave
> > Cave Consulting LLC
> >
> > On Wed, Feb 21, 2018 at 2:36 AM, Skip Cave <[email protected]>
> > wrote:
> >
> > > I read in a text file of word vectors using fread. The format looks
> > > like
> > > this:
> > >
> > > bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210 ...
> > >
> > > bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942 -0.0237
> ...
> > >
> > > belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855 ...
> > >
> > > belfast 0.1190 -0.0440 -0.0254 -0.2090 0.2144 0.0348 -0.1467 0.1256 ...
> > >
> > > Everything is literal text.
> > >
> > > The basic layout for each line is:
> > >
> > > word(s) (could contain multiple words separated by underscores) space
> > > number (positive or negative) in text format space number (positive or
> > > negative) in text format space
> > > ......   repeat for 300 numbers (in text)
> > >
> > > the last number is followed by a line feed for the next line
> > >
> > > I need to:
> > > 1. Convert all the the high minuses (-) to J's low minus (_) 2.
> > > Extract the word(s) up to the first space into a separate array
> > > (words) 3. Convert the text numbers into a 2D array of ? x 300
> > > floating point numbers
> > >
> > > I know how to do #1 (string replace), and #3 (".) once I get rid of
> > > the words, but I don't know how to strip out the initial word on each
> > > line and put them in a separate array. Any help is appreciated.
> > >
> > > Skip
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] File Cleanup

Reply via email to