Re: [Jprogramming] File Cleanup

Don Guinn Wed, 21 Feb 2018 08:24:37 -0800

You need to convert words to a list. Also, night use &> instead of each as
It needs to be unboxed to use as an index.


On Feb 21, 2018 9:09 AM, "Skip Cave" <[email protected]> wrote:

> Thanks to Raul and Mike for the suggestions.
>
> I read in the data:
>
>
> nb =: <'C:\numberbatch-en.txt'
>
> nbs =. fread nb
>
>
> Then I tried to clean it up:
>
>
> Mike's method ran out of memory:
>
> nbs4 =. ( i.&' ' ({.;0 ". }.)] ) every nbs
>
> |out of memory
>
> When I tried to run it on a smaller set:
>
> nbs4=: (i.&' '({.;0".}.)])every 100000{. nbs
>
> nbs4
>
> ...
>
> │0││
>
> ├─┼┤
>
> │0││
>
> ├─┼┤
>
> │3││
>
> ├─┼┤
>
> │5││
>
> ├─┼┤
>
> │ ││
>
> ├─┼┤
>
> │0││
>
> ├─┼┤
>
> │.││
>
> ├─┼┤
>
> │0││
>
> ├─┼┤
>
> │7││
>
> ├─┼┤
>
> │8││
>
> ├─┼┤
>
> │2││
>
> ├─┼┤
>
> So that wasn't working for me.
>
> I tried Raul's suggestion:
>
> words=. <@({.~ i.&' ');._2 nbs
>
> vec =. 0 1 }. _&".;._2 nbs
>
>
> $words
>
> 417195
>
>
> Looking good....
>
>
> ,.20{. 6000}. words
>
> ┌────────────┐
>
> │adultly │
>
> ├────────────┤
>
> │adultness │
>
> ├────────────┤
>
> │adultoid │
>
> ├────────────┤
>
> │adultress │
>
> ├────────────┤
>
> │adults │
>
> ├────────────┤
>
> │adultship │
>
> ├────────────┤
>
> │adulty │
>
> ├────────────┤
>
> │adumbral │
>
> ├────────────┤
>
> │adumbrant │
>
> ├────────────┤
>
> │adumbrate │
>
> ├────────────┤
>
> │adumbrated │
>
> ├────────────┤
>
> │adumbrates │
>
> ├────────────┤
>
> │adumbrating │
>
> ├────────────┤
>
> │adumbration │
>
> ├────────────┤
>
> │adumbrations│
>
> ├────────────┤
>
> │adumbrative │
>
> ├────────────┤
>
> │adunation │
>
> ├────────────┤
>
> │adunc │
>
> ├────────────┤
>
> │aduncate │
>
> ├────────────┤
>
> │aduncity │
>
> └────────────┘
>
> $vec
>
> 417195 300
>
> 3 {. }.vec
>
> _0.0264 0.0468 _0.0099 _0.0242 _0.0762 0.0562 0.0863 0.0115 _0.0471 0.0442
> _0.0875 0.0376 _0.0404 _0.0086 0.0161 _0.1689 0.1485 _0.0201 0.1021 _0.0635
> _0.0317 0.0142 0.0588 _0.1299 _0.0905 0.0389 _0.0452 0.1352 0.0731 0.0648
> 0.1309 0.0493 0.0785 0.015...
>
> _0.0096 0.0318 _0.0095 _0.042 _0.0831 0.1103 0.075 0.024 _0.0237 0.0398
> _0.1274 _0.0299 _0.0209 _0.0195 _0.0043 _0.1033 0.1378 _0.0499 0.0517
> _0.0958 _0.0651 0.0214 0.0096 _0.0855 _0.1049 0.036 _0.0562 0.043 0.0616
> 0.1124 0.152 0.0418 0.0628 _0.018...
>
> _0.0364 0.0254 _0.0448 _0.0327 _0.0712 0.1548 0.1004 0.0033 _0.039 0.0635
> _0.1179 _0.0703 _0.0359 0.0296 _0.0594 _0.0954 0.1904 _0.0301 0.0078
> _0.0607 _0.0344 0.034 _0.0059 _0.1453 _0.0429 _0.0061 _0.05 0.0377 0.0959
> 0.1313 0.1238 0.0302 0.0043 _0.038...
>
>
> So this looks good!
>
>
> Now I need a verb that will let me specify a word, and it will return the
> associated vector.
>
> Here's how it should work:
>
>
> tst =. get 'adults'
>
>
> tst
>
> 0.1144 0.0444 0.0574 0.0387 0.082 _0.0271 0.209 _0.006 _0.1896 0.1038
> _0.0257 0.0646 0.0488 _0.0065 0.0486 0.0422 0.0239 _0.1006 _0.0541 0.0511
> _0.0254 _0.0121 0.0216 0.0324 _0.1349 0.0237 0.0049 0.0061 0.0349 _0.0264
> 0.0086 0.0919 _0.0174 0.0645 ...
>
>
> To build the 'get' verb we need to try to find the location of the word
> 'adults' in the boxed words array:
>
> 'adults' = each words
>
> |length error
>
> | 'adults' =each words
>
>
> Nope, that didn't work... Do I need to box the word?
>
>
> (<'adults')=each words
>
> |length error
>
> | (<'adults') =each words
>
>
> Nope! How do I find a specific word in the boxed word array?
>
>
>
>
>
>
>
>
> Skip Cave
> Cave Consulting LLC
>
> On Wed, Feb 21, 2018 at 2:36 AM, Skip Cave <[email protected]>
> wrote:
>
> > I read in a text file of word vectors using fread. The format looks like
> > this:
> >
> > bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210 ...
> >
> > bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942 -0.0237 ...
> >
> > belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855 ...
> >
> > belfast 0.1190 -0.0440 -0.0254 -0.2090 0.2144 0.0348 -0.1467 0.1256 ...
> >
> > Everything is literal text.
> >
> > The basic layout for each line is:
> >
> > word(s) (could contain multiple words separated by underscores)
> > space
> > number (positive or negative) in text format
> > space
> > number (positive or negative) in text format
> > space
> > ......   repeat for 300 numbers (in text)
> >
> > the last number is followed by a line feed for the next line
> >
> > I need to:
> > 1. Convert all the the high minuses (-) to J's low minus (_)
> > 2. Extract the word(s) up to the first space into a separate array
> (words)
> > 3. Convert the text numbers into a 2D array of ? x 300 floating point
> > numbers
> >
> > I know how to do #1 (string replace), and #3 (".) once I get rid of the
> > words,
> > but I don't know how to strip out the initial word on each line and put
> > them in a separate array. Any help is appreciated.
> >
> > Skip
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] File Cleanup

Reply via email to