Re: [Jprogramming] File Cleanup

Raul Miller Wed, 21 Feb 2018 09:40:12 -0800

Sure, that could work, but this definition only returns the word
index, not the row(s) from vec. So it should get a different name.


That said, you could also do something like (what I had been
originally thinking of, though not the implementation I had posted):

   get=: 13 :'vec{~words i. ;:^:(0=L.)y'

Thanks,

-- 
Raul


On Wed, Feb 21, 2018 at 12:35 PM, Don Guinn <[email protected]> wrote:
> How about defining get as
>
> get=:13 : '(,words)i.boxopen y'
>
>
> Then it can take a single word unboxed or a boxed list.
>
>
> On Wed, Feb 21, 2018 at 10:18 AM, Raul Miller <[email protected]> wrote:
>
>> That's an interesting point.
>>
>> That said, if you give get a large list of words to look up, it's the
>> sort of issue which might be buried in everything else that's going on
>> (the cost per word gets divided by the number of words being looked up
>> at once).
>>
>> Thanks,
>>
>> --
>> Raul
>>
>>
>> On Wed, Feb 21, 2018 at 12:08 PM, Don Guinn <[email protected]> wrote:
>> > Defining a verb get to retrieve the index of the desired word as tacit
>> does
>> > make get pretty much unreadable; however, there is a possible performance
>> > gain as the hash table for i. gets built only once when get is defined.
>> If
>> > you will be running get many times this could result in a significant
>> > performance gain.
>> >
>> > Of course, once read in words must not be modified without rebuilding
>> get.
>> > But if it turns out that you don't need words for anything else than in
>> get
>> > then you could erase words after get is defined so storage used by a big
>> > verb is offset by not having words around any more.
>> >
>> > On Wed, Feb 21, 2018 at 9:31 AM, R.E. Boss <[email protected]> wrote:
>> >
>> >>  vec {~ (<'adults') i.~ words
>> >> is perhaps what you are looking for
>> >>
>> >>
>> >> R.E. Boss
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Programming [mailto:[email protected]]
>> >> > On Behalf Of Skip Cave
>> >> > Sent: woensdag 21 februari 2018 17:09
>> >> > To: [email protected]
>> >> > Subject: Re: [Jprogramming] File Cleanup
>> >> >
>> >> > Thanks to Raul and Mike for the suggestions.
>> >> >
>> >> > I read in the data:
>> >> >
>> >> >
>> >> > nb =: <'C:\numberbatch-en.txt'
>> >> >
>> >> > nbs =. fread nb
>> >> >
>> >> >
>> >> > Then I tried to clean it up:
>> >> >
>> >> >
>> >> > Mike's method ran out of memory:
>> >> >
>> >> > nbs4 =. ( i.&' ' ({.;0 ". }.)] ) every nbs
>> >> >
>> >> > |out of memory
>> >> >
>> >> > When I tried to run it on a smaller set:
>> >> >
>> >> > nbs4=: (i.&' '({.;0".}.)])every 100000{. nbs
>> >> >
>> >> > nbs4
>> >> >
>> >> > ...
>> >> >
>> >> > │0││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │0││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │3││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │5││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │ ││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │0││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │.││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │0││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │7││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │8││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > │2││
>> >> >
>> >> > ├─┼┤
>> >> >
>> >> > So that wasn't working for me.
>> >> >
>> >> > I tried Raul's suggestion:
>> >> >
>> >> > words=. <@({.~ i.&' ');._2 nbs
>> >> >
>> >> > vec =. 0 1 }. _&".;._2 nbs
>> >> >
>> >> >
>> >> > $words
>> >> >
>> >> > 417195
>> >> >
>> >> >
>> >> > Looking good....
>> >> >
>> >> >
>> >> > ,.20{. 6000}. words
>> >> >
>> >> > ┌────────────┐
>> >> >
>> >> > │adultly │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adultness │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adultoid │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adultress │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adults │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adultship │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adulty │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbral │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbrant │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbrate │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbrated │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbrates │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbrating │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbration │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbrations│
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adumbrative │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adunation │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │adunc │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │aduncate │
>> >> >
>> >> > ├────────────┤
>> >> >
>> >> > │aduncity │
>> >> >
>> >> > └────────────┘
>> >> >
>> >> > $vec
>> >> >
>> >> > 417195 300
>> >> >
>> >> > 3 {. }.vec
>> >> >
>> >> > _0.0264 0.0468 _0.0099 _0.0242 _0.0762 0.0562 0.0863 0.0115 _0.0471
>> >> 0.0442
>> >> > _0.0875 0.0376 _0.0404 _0.0086 0.0161 _0.1689 0.1485 _0.0201 0.1021
>> >> _0.0635
>> >> > _0.0317 0.0142 0.0588 _0.1299 _0.0905 0.0389 _0.0452 0.1352 0.0731
>> 0.0648
>> >> > 0.1309 0.0493 0.0785 0.015...
>> >> >
>> >> > _0.0096 0.0318 _0.0095 _0.042 _0.0831 0.1103 0.075 0.024 _0.0237
>> 0.0398
>> >> > _0.1274 _0.0299 _0.0209 _0.0195 _0.0043 _0.1033 0.1378 _0.0499 0.0517
>> >> > _0.0958 _0.0651 0.0214 0.0096 _0.0855 _0.1049 0.036 _0.0562 0.043
>> 0.0616
>> >> > 0.1124 0.152 0.0418 0.0628 _0.018...
>> >> >
>> >> > _0.0364 0.0254 _0.0448 _0.0327 _0.0712 0.1548 0.1004 0.0033 _0.039
>> 0.0635
>> >> > _0.1179 _0.0703 _0.0359 0.0296 _0.0594 _0.0954 0.1904 _0.0301 0.0078
>> >> > _0.0607 _0.0344 0.034 _0.0059 _0.1453 _0.0429 _0.0061 _0.05 0.0377
>> 0.0959
>> >> > 0.1313 0.1238 0.0302 0.0043 _0.038...
>> >> >
>> >> >
>> >> > So this looks good!
>> >> >
>> >> >
>> >> > Now I need a verb that will let me specify a word, and it will return
>> the
>> >> > associated vector.
>> >> >
>> >> > Here's how it should work:
>> >> >
>> >> >
>> >> > tst =. get 'adults'
>> >> >
>> >> >
>> >> > tst
>> >> >
>> >> > 0.1144 0.0444 0.0574 0.0387 0.082 _0.0271 0.209 _0.006 _0.1896 0.1038
>> >> > _0.0257 0.0646 0.0488 _0.0065 0.0486 0.0422 0.0239 _0.1006 _0.0541
>> 0.0511
>> >> > _0.0254 _0.0121 0.0216 0.0324 _0.1349 0.0237 0.0049 0.0061 0.0349
>> _0.0264
>> >> > 0.0086 0.0919 _0.0174 0.0645 ...
>> >> >
>> >> >
>> >> > To build the 'get' verb we need to try to find the location of the
>> word
>> >> 'adults'
>> >> > in the boxed words array:
>> >> >
>> >> > 'adults' = each words
>> >> >
>> >> > |length error
>> >> >
>> >> > | 'adults' =each words
>> >> >
>> >> >
>> >> > Nope, that didn't work... Do I need to box the word?
>> >> >
>> >> >
>> >> > (<'adults')=each words
>> >> >
>> >> > |length error
>> >> >
>> >> > | (<'adults') =each words
>> >> >
>> >> >
>> >> > Nope! How do I find a specific word in the boxed word array?
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Skip Cave
>> >> > Cave Consulting LLC
>> >> >
>> >> > On Wed, Feb 21, 2018 at 2:36 AM, Skip Cave <[email protected]>
>> >> > wrote:
>> >> >
>> >> > > I read in a text file of word vectors using fread. The format looks
>> >> > > like
>> >> > > this:
>> >> > >
>> >> > > bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210
>> ...
>> >> > >
>> >> > > bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942
>> -0.0237
>> >> ...
>> >> > >
>> >> > > belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855 ...
>> >> > >
>> >> > > belfast 0.1190 -0.0440 -0.0254 -0.2090 0.2144 0.0348 -0.1467 0.1256
>> ...
>> >> > >
>> >> > > Everything is literal text.
>> >> > >
>> >> > > The basic layout for each line is:
>> >> > >
>> >> > > word(s) (could contain multiple words separated by underscores)
>> space
>> >> > > number (positive or negative) in text format space number (positive
>> or
>> >> > > negative) in text format space
>> >> > > ......   repeat for 300 numbers (in text)
>> >> > >
>> >> > > the last number is followed by a line feed for the next line
>> >> > >
>> >> > > I need to:
>> >> > > 1. Convert all the the high minuses (-) to J's low minus (_) 2.
>> >> > > Extract the word(s) up to the first space into a separate array
>> >> > > (words) 3. Convert the text numbers into a 2D array of ? x 300
>> >> > > floating point numbers
>> >> > >
>> >> > > I know how to do #1 (string replace), and #3 (".) once I get rid of
>> >> > > the words, but I don't know how to strip out the initial word on
>> each
>> >> > > line and put them in a separate array. Any help is appreciated.
>> >> > >
>> >> > > Skip
>> >> > >
>> >> > ------------------------------------------------------------
>> ----------
>> >> > For information about J forums see http://www.jsoftware.com/
>> forums.htm
>> >> ----------------------------------------------------------------------
>> >> For information about J forums see http://www.jsoftware.com/forums.htm
>> >>
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] File Cleanup

Reply via email to