Re: [Jprogramming] File Cleanup

Raul Miller Wed, 21 Feb 2018 08:49:46 -0800

Yes, you need to box the name, when comparing it to the value of
'words' because the names in 'words' are all boxed.


I'd do it like this:

   get=:3 :'vec{~words i.<y'

You could also define this tacitly:

   get=:13 :'vec{~words i.<y'

But that has a couple drawbacks:

(*) The tacit definition of get is much larger than the explicit
version (since it contains the values of 'words' and 'vec'), which
makes it hard to read.

(*) If you import new data, you must redefine get (since it contains
the values of 'words' and 'vec').

(And, when you combine those two issues, you've also got a bit more
memory pressure (as long as you only have one definition for 'words'
and 'vec', the definition of 'get' can share the same internal
representation of those values as the other two variables. But when
you've got conflicting definitions, J has to hold onto both the old
and new versions. Memory is cheap, but if you're not benefiting from
your use of it, what's the point?)

Thanks,

-- 
Raul


On Wed, Feb 21, 2018 at 11:08 AM, Skip Cave <[email protected]> wrote:
> Thanks to Raul and Mike for the suggestions.
>
> I read in the data:
>
>
> nb =: <'C:\numberbatch-en.txt'
>
> nbs =. fread nb
>
>
> Then I tried to clean it up:
>
>
> Mike's method ran out of memory:
>
> nbs4 =. ( i.&' ' ({.;0 ". }.)] ) every nbs
>
> |out of memory
>
> When I tried to run it on a smaller set:
>
> nbs4=: (i.&' '({.;0".}.)])every 100000{. nbs
>
> nbs4
>
> ...
>
> │0││
>
> ├─┼┤
>
> │0││
>
> ├─┼┤
>
> │3││
>
> ├─┼┤
>
> │5││
>
> ├─┼┤
>
> │ ││
>
> ├─┼┤
>
> │0││
>
> ├─┼┤
>
> │.││
>
> ├─┼┤
>
> │0││
>
> ├─┼┤
>
> │7││
>
> ├─┼┤
>
> │8││
>
> ├─┼┤
>
> │2││
>
> ├─┼┤
>
> So that wasn't working for me.
>
> I tried Raul's suggestion:
>
> words=. <@({.~ i.&' ');._2 nbs
>
> vec =. 0 1 }. _&".;._2 nbs
>
>
> $words
>
> 417195
>
>
> Looking good....
>
>
> ,.20{. 6000}. words
>
> ┌────────────┐
>
> │adultly │
>
> ├────────────┤
>
> │adultness │
>
> ├────────────┤
>
> │adultoid │
>
> ├────────────┤
>
> │adultress │
>
> ├────────────┤
>
> │adults │
>
> ├────────────┤
>
> │adultship │
>
> ├────────────┤
>
> │adulty │
>
> ├────────────┤
>
> │adumbral │
>
> ├────────────┤
>
> │adumbrant │
>
> ├────────────┤
>
> │adumbrate │
>
> ├────────────┤
>
> │adumbrated │
>
> ├────────────┤
>
> │adumbrates │
>
> ├────────────┤
>
> │adumbrating │
>
> ├────────────┤
>
> │adumbration │
>
> ├────────────┤
>
> │adumbrations│
>
> ├────────────┤
>
> │adumbrative │
>
> ├────────────┤
>
> │adunation │
>
> ├────────────┤
>
> │adunc │
>
> ├────────────┤
>
> │aduncate │
>
> ├────────────┤
>
> │aduncity │
>
> └────────────┘
>
> $vec
>
> 417195 300
>
> 3 {. }.vec
>
> _0.0264 0.0468 _0.0099 _0.0242 _0.0762 0.0562 0.0863 0.0115 _0.0471 0.0442
> _0.0875 0.0376 _0.0404 _0.0086 0.0161 _0.1689 0.1485 _0.0201 0.1021 _0.0635
> _0.0317 0.0142 0.0588 _0.1299 _0.0905 0.0389 _0.0452 0.1352 0.0731 0.0648
> 0.1309 0.0493 0.0785 0.015...
>
> _0.0096 0.0318 _0.0095 _0.042 _0.0831 0.1103 0.075 0.024 _0.0237 0.0398
> _0.1274 _0.0299 _0.0209 _0.0195 _0.0043 _0.1033 0.1378 _0.0499 0.0517
> _0.0958 _0.0651 0.0214 0.0096 _0.0855 _0.1049 0.036 _0.0562 0.043 0.0616
> 0.1124 0.152 0.0418 0.0628 _0.018...
>
> _0.0364 0.0254 _0.0448 _0.0327 _0.0712 0.1548 0.1004 0.0033 _0.039 0.0635
> _0.1179 _0.0703 _0.0359 0.0296 _0.0594 _0.0954 0.1904 _0.0301 0.0078
> _0.0607 _0.0344 0.034 _0.0059 _0.1453 _0.0429 _0.0061 _0.05 0.0377 0.0959
> 0.1313 0.1238 0.0302 0.0043 _0.038...
>
>
> So this looks good!
>
>
> Now I need a verb that will let me specify a word, and it will return the
> associated vector.
>
> Here's how it should work:
>
>
> tst =. get 'adults'
>
>
> tst
>
> 0.1144 0.0444 0.0574 0.0387 0.082 _0.0271 0.209 _0.006 _0.1896 0.1038
> _0.0257 0.0646 0.0488 _0.0065 0.0486 0.0422 0.0239 _0.1006 _0.0541 0.0511
> _0.0254 _0.0121 0.0216 0.0324 _0.1349 0.0237 0.0049 0.0061 0.0349 _0.0264
> 0.0086 0.0919 _0.0174 0.0645 ...
>
>
> To build the 'get' verb we need to try to find the location of the word
> 'adults' in the boxed words array:
>
> 'adults' = each words
>
> |length error
>
> | 'adults' =each words
>
>
> Nope, that didn't work... Do I need to box the word?
>
>
> (<'adults')=each words
>
> |length error
>
> | (<'adults') =each words
>
>
> Nope! How do I find a specific word in the boxed word array?
>
>
>
>
>
>
>
>
> Skip Cave
> Cave Consulting LLC
>
> On Wed, Feb 21, 2018 at 2:36 AM, Skip Cave <[email protected]> wrote:
>
>> I read in a text file of word vectors using fread. The format looks like
>> this:
>>
>> bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210 ...
>>
>> bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942 -0.0237 ...
>>
>> belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855 ...
>>
>> belfast 0.1190 -0.0440 -0.0254 -0.2090 0.2144 0.0348 -0.1467 0.1256 ...
>>
>> Everything is literal text.
>>
>> The basic layout for each line is:
>>
>> word(s) (could contain multiple words separated by underscores)
>> space
>> number (positive or negative) in text format
>> space
>> number (positive or negative) in text format
>> space
>> ......   repeat for 300 numbers (in text)
>>
>> the last number is followed by a line feed for the next line
>>
>> I need to:
>> 1. Convert all the the high minuses (-) to J's low minus (_)
>> 2. Extract the word(s) up to the first space into a separate array (words)
>> 3. Convert the text numbers into a 2D array of ? x 300 floating point
>> numbers
>>
>> I know how to do #1 (string replace), and #3 (".) once I get rid of the
>> words,
>> but I don't know how to strip out the initial word on each line and put
>> them in a separate array. Any help is appreciated.
>>
>> Skip
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] File Cleanup

Reply via email to