Re: [Jprogramming] File Cleanup

Ric Sherlock Wed, 04 Apr 2018 19:46:38 -0700

 Using the English-only file
<https://conceptnet.s3.amazonaws.com/downloads/2017/numberbatch/numberbatch-en-17.06.txt.gz>
rather than the file you link to above (which is all languages and seems to
be 3 times as big as the examples you show)


   load 'tar'
   timespacex ' labels=: }. <@('' ''&taketo);._2 gzip
''numberbatch-en-17.06.txt.gz'' '
5.28071 1.61114e9
   timespacex ' numbers=: }. (_ ". '' ''&takeafter);._2 gzip
''numberbatch-en-17.06.txt.gz'' '
28.6025 3.86091e9
   $labels
417194
   $numbers
417194 300

I did run the same code successfully on the all-languages file as well - it
took about 137 seconds to uncompress, parse and convert to numeric for
1917247 words.

On Thu, Apr 5, 2018 at 12:26 PM, Skip Cave <[email protected]> wrote:

> Here's my latest cut at the numberbatch file parsing:
>
>
> nb =: <'C:\numberbatch-en.txt'
>
> load 'files'
>
> $nbs =. fread nb
>
>
> NB. nbs is the raw numberbatch character array
>
> NB. parse the raw text & get the words
>
>
> $ words=. <@({.~ i.&' ');._2 nbs
>
> 417195
>
>
> NB. Get the vectors
>
>
> $ vecs =. 0 1 }. _&".;._2 nbs
>
> 417195 300
>
> ts 'vecs =. 0 1 }. _&".;._2 nbs'
>
> 61.1788 2.78677e9
>
> <<<<>>>>
>
> Wow! the vector extraction takes over a minute on my machine.
> Is there a more efficient way to extract the vectors?
> The good news is that I only have to do this once. After that,
> I can use the vecs noun for looking up a words' vector:
>
> NB. Raul's 'get' verb:
>
> get=: 13 :'vecs{~words i. ;:^:(0=L.)y'
>
> get 'bake'
>
> 0.0666 0.0191 0.1889 0.1288 _0.0716 _0.1245 0.1423 _0.1784 0.0352 _0.0095
> 0.0624 _0.0423 _0.0237 _0.1382 _0.1238 0.085 _0.0188 _0.1 0.0225 _0.0264
> _0.1443 _0.1018 0.0856 _0.0824 0.082 _0.1749 _0.0436 0.1718 0.0024 0.0018
> _0.0029 0.1591 0.0497 0.0199 _0.031...
>
> ConceptNet Numberbatch 17.06
> <https://conceptnet.s3.amazonaws.com/downloads/2017/
> numberbatch/numberbatch-17.06.txt.gz>
>  is the current recommended download link for the full numberbatch file.
>
> Skip
>
>
>
> Skip Cave
> Cave Consulting LLC
>
> On Wed, Feb 21, 2018 at 1:28 PM, Raul Miller <[email protected]>
> wrote:
>
> > I suppose the shape of the variable 'words' depends on how you built
> > the value for 'words'?
> >
> > If you used
> >    words=: <@({.~ i.&' ');._2 text
> >
> > the noun will be rank 1, and there's no need to ravel it.
> >
> > If you built it at rank 2, rather than rank 1... well... I'm not sure
> > why you would want to do that - it seems like if you used a process
> > that gives you a single column rank 2 noun you probably should ravel
> > it before assigning it to the variable?
> >
> > Thanks,
> >
> > --
> > Raul
> >
> >
> > On Wed, Feb 21, 2018 at 1:46 PM, Don Guinn <[email protected]> wrote:
> > > So words should be a list instead of a one column table. So we would
> have
> > >    words&i.
> > > instead of
> > >    (,words)&i.
> > >
> > > Correct? Doesn't the raveling prevent sharing of the contents of words
> in
> > > the new verb?
> > >
> > > And perhaps get should be
> > > get=:13 : 'words&i.boxopen y'
> > > instead of
> > > get=:13 : 'words i.boxopen y'
> > >
> > > Does the building of the hash table require that i. be bound to the
> left
> > > argument with & or will it still build the hash table only once in the
> > > tacit definition where i. is in the dyadic form where the & is not
> there?
> > >
> > > It would probably be safer to put the & in.
> > >
> > > On Wed, Feb 21, 2018 at 11:00 AM, Henry Rich <[email protected]>
> > wrote:
> > >
> > >> I don't think this prescription is accurate.  When m&i. is executed to
> > >> create a fast search verb, the value of m is put into the new verb.
> If
> > m
> > >> is a name, the value of the name is NOT copied, but instead referred
> to.
> > >> If the name m is subsequently reassigned, the old value is retained,
> > >> referred to by the m&i. verb, and the new value is assigned to the
> name
> > m.
> > >>
> > >> So, deleting words will not actually free any memory.  On the other
> > hand,
> > >> executing words&i. didn't consume any memory either.
> > >>
> > >> (this is all from memory & I haven't checked it with 7!:2)
> > >>
> > >> Henry Rich
> > >>
> > >>
> > >> On 2/21/2018 12:08 PM, Don Guinn wrote:
> > >>
> > >>> Defining a verb get to retrieve the index of the desired word as
> tacit
> > >>> does
> > >>> make get pretty much unreadable; however, there is a possible
> > performance
> > >>> gain as the hash table for i. gets built only once when get is
> > defined. If
> > >>> you will be running get many times this could result in a significant
> > >>> performance gain.
> > >>>
> > >>> Of course, once read in words must not be modified without rebuilding
> > get.
> > >>> But if it turns out that you don't need words for anything else than
> in
> > >>> get
> > >>> then you could erase words after get is defined so storage used by a
> > big
> > >>> verb is offset by not having words around any more.
> > >>>
> > >>> On Wed, Feb 21, 2018 at 9:31 AM, R.E. Boss <[email protected]>
> > wrote:
> > >>>
> > >>>   vec {~ (<'adults') i.~ words
> > >>>> is perhaps what you are looking for
> > >>>>
> > >>>>
> > >>>> R.E. Boss
> > >>>>
> > >>>>
> > >>>> -----Original Message-----
> > >>>>> From: Programming [mailto:[email protected]
> ]
> > >>>>> On Behalf Of Skip Cave
> > >>>>> Sent: woensdag 21 februari 2018 17:09
> > >>>>> To: [email protected]
> > >>>>> Subject: Re: [Jprogramming] File Cleanup
> > >>>>>
> > >>>>> Thanks to Raul and Mike for the suggestions.
> > >>>>>
> > >>>>> I read in the data:
> > >>>>>
> > >>>>>
> > >>>>> nb =: <'C:\numberbatch-en.txt'
> > >>>>>
> > >>>>> nbs =. fread nb
> > >>>>>
> > >>>>>
> > >>>>> Then I tried to clean it up:
> > >>>>>
> > >>>>>
> > >>>>> Mike's method ran out of memory:
> > >>>>>
> > >>>>> nbs4 =. ( i.&' ' ({.;0 ". }.)] ) every nbs
> > >>>>>
> > >>>>> |out of memory
> > >>>>>
> > >>>>> When I tried to run it on a smaller set:
> > >>>>>
> > >>>>> nbs4=: (i.&' '({.;0".}.)])every 100000{. nbs
> > >>>>>
> > >>>>> nbs4
> > >>>>>
> > >>>>> ...
> > >>>>>
> > >>>>> │0││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │0││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │3││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │5││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │ ││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │0││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │.││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │0││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │7││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │8││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> │2││
> > >>>>>
> > >>>>> ├─┼┤
> > >>>>>
> > >>>>> So that wasn't working for me.
> > >>>>>
> > >>>>> I tried Raul's suggestion:
> > >>>>>
> > >>>>> words=. <@({.~ i.&' ');._2 nbs
> > >>>>>
> > >>>>> vec =. 0 1 }. _&".;._2 nbs
> > >>>>>
> > >>>>>
> > >>>>> $words
> > >>>>>
> > >>>>> 417195
> > >>>>>
> > >>>>>
> > >>>>> Looking good....
> > >>>>>
> > >>>>>
> > >>>>> ,.20{. 6000}. words
> > >>>>>
> > >>>>> ┌────────────┐
> > >>>>>
> > >>>>> │adultly │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adultness │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adultoid │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adultress │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adults │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adultship │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adulty │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbral │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbrant │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbrate │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbrated │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbrates │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbrating │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbration │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbrations│
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adumbrative │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adunation │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │adunc │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │aduncate │
> > >>>>>
> > >>>>> ├────────────┤
> > >>>>>
> > >>>>> │aduncity │
> > >>>>>
> > >>>>> └────────────┘
> > >>>>>
> > >>>>> $vec
> > >>>>>
> > >>>>> 417195 300
> > >>>>>
> > >>>>> 3 {. }.vec
> > >>>>>
> > >>>>> _0.0264 0.0468 _0.0099 _0.0242 _0.0762 0.0562 0.0863 0.0115 _0.0471
> > >>>>>
> > >>>> 0.0442
> > >>>>
> > >>>>> _0.0875 0.0376 _0.0404 _0.0086 0.0161 _0.1689 0.1485 _0.0201 0.1021
> > >>>>>
> > >>>> _0.0635
> > >>>>
> > >>>>> _0.0317 0.0142 0.0588 _0.1299 _0.0905 0.0389 _0.0452 0.1352 0.0731
> > >>>>> 0.0648
> > >>>>> 0.1309 0.0493 0.0785 0.015...
> > >>>>>
> > >>>>> _0.0096 0.0318 _0.0095 _0.042 _0.0831 0.1103 0.075 0.024 _0.0237
> > 0.0398
> > >>>>> _0.1274 _0.0299 _0.0209 _0.0195 _0.0043 _0.1033 0.1378 _0.0499
> 0.0517
> > >>>>> _0.0958 _0.0651 0.0214 0.0096 _0.0855 _0.1049 0.036 _0.0562 0.043
> > 0.0616
> > >>>>> 0.1124 0.152 0.0418 0.0628 _0.018...
> > >>>>>
> > >>>>> _0.0364 0.0254 _0.0448 _0.0327 _0.0712 0.1548 0.1004 0.0033 _0.039
> > >>>>> 0.0635
> > >>>>> _0.1179 _0.0703 _0.0359 0.0296 _0.0594 _0.0954 0.1904 _0.0301
> 0.0078
> > >>>>> _0.0607 _0.0344 0.034 _0.0059 _0.1453 _0.0429 _0.0061 _0.05 0.0377
> > >>>>> 0.0959
> > >>>>> 0.1313 0.1238 0.0302 0.0043 _0.038...
> > >>>>>
> > >>>>>
> > >>>>> So this looks good!
> > >>>>>
> > >>>>>
> > >>>>> Now I need a verb that will let me specify a word, and it will
> return
> > >>>>> the
> > >>>>> associated vector.
> > >>>>>
> > >>>>> Here's how it should work:
> > >>>>>
> > >>>>>
> > >>>>> tst =. get 'adults'
> > >>>>>
> > >>>>>
> > >>>>> tst
> > >>>>>
> > >>>>> 0.1144 0.0444 0.0574 0.0387 0.082 _0.0271 0.209 _0.006 _0.1896
> 0.1038
> > >>>>> _0.0257 0.0646 0.0488 _0.0065 0.0486 0.0422 0.0239 _0.1006 _0.0541
> > >>>>> 0.0511
> > >>>>> _0.0254 _0.0121 0.0216 0.0324 _0.1349 0.0237 0.0049 0.0061 0.0349
> > >>>>> _0.0264
> > >>>>> 0.0086 0.0919 _0.0174 0.0645 ...
> > >>>>>
> > >>>>>
> > >>>>> To build the 'get' verb we need to try to find the location of the
> > word
> > >>>>>
> > >>>> 'adults'
> > >>>>
> > >>>>> in the boxed words array:
> > >>>>>
> > >>>>> 'adults' = each words
> > >>>>>
> > >>>>> |length error
> > >>>>>
> > >>>>> | 'adults' =each words
> > >>>>>
> > >>>>>
> > >>>>> Nope, that didn't work... Do I need to box the word?
> > >>>>>
> > >>>>>
> > >>>>> (<'adults')=each words
> > >>>>>
> > >>>>> |length error
> > >>>>>
> > >>>>> | (<'adults') =each words
> > >>>>>
> > >>>>>
> > >>>>> Nope! How do I find a specific word in the boxed word array?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> Skip Cave
> > >>>>> Cave Consulting LLC
> > >>>>>
> > >>>>> On Wed, Feb 21, 2018 at 2:36 AM, Skip Cave <
> [email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>> I read in a text file of word vectors using fread. The format looks
> > >>>>>> like
> > >>>>>> this:
> > >>>>>>
> > >>>>>> bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210
> > ...
> > >>>>>>
> > >>>>>> bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942
> > -0.0237
> > >>>>>>
> > >>>>> ...
> > >>>>
> > >>>>> belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855
> ...
> > >>>>>>
> > >>>>>> belfast 0.1190 -0.0440 -0.0254 -0.2090 0.2144 0.0348 -0.1467
> 0.1256
> > ...
> > >>>>>>
> > >>>>>> Everything is literal text.
> > >>>>>>
> > >>>>>> The basic layout for each line is:
> > >>>>>>
> > >>>>>> word(s) (could contain multiple words separated by underscores)
> > space
> > >>>>>> number (positive or negative) in text format space number
> (positive
> > or
> > >>>>>> negative) in text format space
> > >>>>>> ......   repeat for 300 numbers (in text)
> > >>>>>>
> > >>>>>> the last number is followed by a line feed for the next line
> > >>>>>>
> > >>>>>> I need to:
> > >>>>>> 1. Convert all the the high minuses (-) to J's low minus (_) 2.
> > >>>>>> Extract the word(s) up to the first space into a separate array
> > >>>>>> (words) 3. Convert the text numbers into a 2D array of ? x 300
> > >>>>>> floating point numbers
> > >>>>>>
> > >>>>>> I know how to do #1 (string replace), and #3 (".) once I get rid
> of
> > >>>>>> the words, but I don't know how to strip out the initial word on
> > each
> > >>>>>> line and put them in a separate array. Any help is appreciated.
> > >>>>>>
> > >>>>>> Skip
> > >>>>>>
> > >>>>>> ------------------------------------------------------------
> > ----------
> > >>>>> For information about J forums see http://www.jsoftware.com/
> > forums.htm
> > >>>>>
> > >>>> ------------------------------------------------------------
> > ----------
> > >>>> For information about J forums see http://www.jsoftware.com/
> > forums.htm
> > >>>>
> > >>>> ------------------------------------------------------------
> > ----------
> > >>> For information about J forums see http://www.jsoftware.com/
> forums.htm
> > >>>
> > >>
> > >>
> > >> ---
> > >> This email has been checked for viruses by AVG.
> > >> http://www.avg.com
> > >>
> > >>
> > >> ------------------------------------------------------------
> ----------
> > >> For information about J forums see http://www.jsoftware.com/
> forums.htm
> > >>
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] File Cleanup

Reply via email to