Re: [Jprogramming] File Cleanup

Raul Miller Wed, 21 Feb 2018 11:32:59 -0800

I suppose the shape of the variable 'words' depends on how you built
the value for 'words'?


If you used
   words=: <@({.~ i.&' ');._2 text

the noun will be rank 1, and there's no need to ravel it.

If you built it at rank 2, rather than rank 1... well... I'm not sure
why you would want to do that - it seems like if you used a process
that gives you a single column rank 2 noun you probably should ravel
it before assigning it to the variable?

Thanks,

-- 
Raul


On Wed, Feb 21, 2018 at 1:46 PM, Don Guinn <dongu...@gmail.com> wrote:
> So words should be a list instead of a one column table. So we would have
>    words&i.
> instead of
>    (,words)&i.
>
> Correct? Doesn't the raveling prevent sharing of the contents of words in
> the new verb?
>
> And perhaps get should be
> get=:13 : 'words&i.boxopen y'
> instead of
> get=:13 : 'words i.boxopen y'
>
> Does the building of the hash table require that i. be bound to the left
> argument with & or will it still build the hash table only once in the
> tacit definition where i. is in the dyadic form where the & is not there?
>
> It would probably be safer to put the & in.
>
> On Wed, Feb 21, 2018 at 11:00 AM, Henry Rich <henryhr...@gmail.com> wrote:
>
>> I don't think this prescription is accurate.  When m&i. is executed to
>> create a fast search verb, the value of m is put into the new verb.  If m
>> is a name, the value of the name is NOT copied, but instead referred to.
>> If the name m is subsequently reassigned, the old value is retained,
>> referred to by the m&i. verb, and the new value is assigned to the name m.
>>
>> So, deleting words will not actually free any memory.  On the other hand,
>> executing words&i. didn't consume any memory either.
>>
>> (this is all from memory & I haven't checked it with 7!:2)
>>
>> Henry Rich
>>
>>
>> On 2/21/2018 12:08 PM, Don Guinn wrote:
>>
>>> Defining a verb get to retrieve the index of the desired word as tacit
>>> does
>>> make get pretty much unreadable; however, there is a possible performance
>>> gain as the hash table for i. gets built only once when get is defined. If
>>> you will be running get many times this could result in a significant
>>> performance gain.
>>>
>>> Of course, once read in words must not be modified without rebuilding get.
>>> But if it turns out that you don't need words for anything else than in
>>> get
>>> then you could erase words after get is defined so storage used by a big
>>> verb is offset by not having words around any more.
>>>
>>> On Wed, Feb 21, 2018 at 9:31 AM, R.E. Boss <r.e.b...@outlook.com> wrote:
>>>
>>>   vec {~ (<'adults') i.~ words
>>>> is perhaps what you are looking for
>>>>
>>>>
>>>> R.E. Boss
>>>>
>>>>
>>>> -----Original Message-----
>>>>> From: Programming [mailto:programming-boun...@forums.jsoftware.com]
>>>>> On Behalf Of Skip Cave
>>>>> Sent: woensdag 21 februari 2018 17:09
>>>>> To: programm...@jsoftware.com
>>>>> Subject: Re: [Jprogramming] File Cleanup
>>>>>
>>>>> Thanks to Raul and Mike for the suggestions.
>>>>>
>>>>> I read in the data:
>>>>>
>>>>>
>>>>> nb =: <'C:\numberbatch-en.txt'
>>>>>
>>>>> nbs =. fread nb
>>>>>
>>>>>
>>>>> Then I tried to clean it up:
>>>>>
>>>>>
>>>>> Mike's method ran out of memory:
>>>>>
>>>>> nbs4 =. ( i.&' ' ({.;0 ". }.)] ) every nbs
>>>>>
>>>>> |out of memory
>>>>>
>>>>> When I tried to run it on a smaller set:
>>>>>
>>>>> nbs4=: (i.&' '({.;0".}.)])every 100000{. nbs
>>>>>
>>>>> nbs4
>>>>>
>>>>> ...
>>>>>
>>>>> │0││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │0││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │3││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │5││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │ ││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │0││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │.││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │0││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │7││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │8││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> │2││
>>>>>
>>>>> ├─┼┤
>>>>>
>>>>> So that wasn't working for me.
>>>>>
>>>>> I tried Raul's suggestion:
>>>>>
>>>>> words=. <@({.~ i.&' ');._2 nbs
>>>>>
>>>>> vec =. 0 1 }. _&".;._2 nbs
>>>>>
>>>>>
>>>>> $words
>>>>>
>>>>> 417195
>>>>>
>>>>>
>>>>> Looking good....
>>>>>
>>>>>
>>>>> ,.20{. 6000}. words
>>>>>
>>>>> ┌────────────┐
>>>>>
>>>>> │adultly │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adultness │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adultoid │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adultress │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adults │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adultship │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adulty │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbral │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbrant │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbrate │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbrated │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbrates │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbrating │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbration │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbrations│
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adumbrative │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adunation │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │adunc │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │aduncate │
>>>>>
>>>>> ├────────────┤
>>>>>
>>>>> │aduncity │
>>>>>
>>>>> └────────────┘
>>>>>
>>>>> $vec
>>>>>
>>>>> 417195 300
>>>>>
>>>>> 3 {. }.vec
>>>>>
>>>>> _0.0264 0.0468 _0.0099 _0.0242 _0.0762 0.0562 0.0863 0.0115 _0.0471
>>>>>
>>>> 0.0442
>>>>
>>>>> _0.0875 0.0376 _0.0404 _0.0086 0.0161 _0.1689 0.1485 _0.0201 0.1021
>>>>>
>>>> _0.0635
>>>>
>>>>> _0.0317 0.0142 0.0588 _0.1299 _0.0905 0.0389 _0.0452 0.1352 0.0731
>>>>> 0.0648
>>>>> 0.1309 0.0493 0.0785 0.015...
>>>>>
>>>>> _0.0096 0.0318 _0.0095 _0.042 _0.0831 0.1103 0.075 0.024 _0.0237 0.0398
>>>>> _0.1274 _0.0299 _0.0209 _0.0195 _0.0043 _0.1033 0.1378 _0.0499 0.0517
>>>>> _0.0958 _0.0651 0.0214 0.0096 _0.0855 _0.1049 0.036 _0.0562 0.043 0.0616
>>>>> 0.1124 0.152 0.0418 0.0628 _0.018...
>>>>>
>>>>> _0.0364 0.0254 _0.0448 _0.0327 _0.0712 0.1548 0.1004 0.0033 _0.039
>>>>> 0.0635
>>>>> _0.1179 _0.0703 _0.0359 0.0296 _0.0594 _0.0954 0.1904 _0.0301 0.0078
>>>>> _0.0607 _0.0344 0.034 _0.0059 _0.1453 _0.0429 _0.0061 _0.05 0.0377
>>>>> 0.0959
>>>>> 0.1313 0.1238 0.0302 0.0043 _0.038...
>>>>>
>>>>>
>>>>> So this looks good!
>>>>>
>>>>>
>>>>> Now I need a verb that will let me specify a word, and it will return
>>>>> the
>>>>> associated vector.
>>>>>
>>>>> Here's how it should work:
>>>>>
>>>>>
>>>>> tst =. get 'adults'
>>>>>
>>>>>
>>>>> tst
>>>>>
>>>>> 0.1144 0.0444 0.0574 0.0387 0.082 _0.0271 0.209 _0.006 _0.1896 0.1038
>>>>> _0.0257 0.0646 0.0488 _0.0065 0.0486 0.0422 0.0239 _0.1006 _0.0541
>>>>> 0.0511
>>>>> _0.0254 _0.0121 0.0216 0.0324 _0.1349 0.0237 0.0049 0.0061 0.0349
>>>>> _0.0264
>>>>> 0.0086 0.0919 _0.0174 0.0645 ...
>>>>>
>>>>>
>>>>> To build the 'get' verb we need to try to find the location of the word
>>>>>
>>>> 'adults'
>>>>
>>>>> in the boxed words array:
>>>>>
>>>>> 'adults' = each words
>>>>>
>>>>> |length error
>>>>>
>>>>> | 'adults' =each words
>>>>>
>>>>>
>>>>> Nope, that didn't work... Do I need to box the word?
>>>>>
>>>>>
>>>>> (<'adults')=each words
>>>>>
>>>>> |length error
>>>>>
>>>>> | (<'adults') =each words
>>>>>
>>>>>
>>>>> Nope! How do I find a specific word in the boxed word array?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Skip Cave
>>>>> Cave Consulting LLC
>>>>>
>>>>> On Wed, Feb 21, 2018 at 2:36 AM, Skip Cave <s...@caveconsulting.com>
>>>>> wrote:
>>>>>
>>>>> I read in a text file of word vectors using fread. The format looks
>>>>>> like
>>>>>> this:
>>>>>>
>>>>>> bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210 ...
>>>>>>
>>>>>> bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942 -0.0237
>>>>>>
>>>>> ...
>>>>
>>>>> belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855 ...
>>>>>>
>>>>>> belfast 0.1190 -0.0440 -0.0254 -0.2090 0.2144 0.0348 -0.1467 0.1256 ...
>>>>>>
>>>>>> Everything is literal text.
>>>>>>
>>>>>> The basic layout for each line is:
>>>>>>
>>>>>> word(s) (could contain multiple words separated by underscores) space
>>>>>> number (positive or negative) in text format space number (positive or
>>>>>> negative) in text format space
>>>>>> ......   repeat for 300 numbers (in text)
>>>>>>
>>>>>> the last number is followed by a line feed for the next line
>>>>>>
>>>>>> I need to:
>>>>>> 1. Convert all the the high minuses (-) to J's low minus (_) 2.
>>>>>> Extract the word(s) up to the first space into a separate array
>>>>>> (words) 3. Convert the text numbers into a 2D array of ? x 300
>>>>>> floating point numbers
>>>>>>
>>>>>> I know how to do #1 (string replace), and #3 (".) once I get rid of
>>>>>> the words, but I don't know how to strip out the initial word on each
>>>>>> line and put them in a separate array. Any help is appreciated.
>>>>>>
>>>>>> Skip
>>>>>>
>>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>
>>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>
>>
>>
>> ---
>> This email has been checked for viruses by AVG.
>> http://www.avg.com
>>
>>
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] File Cleanup

Reply via email to