Wow! Thanks so much for all the help on cleaning up and parsing the
numberbatch
text file, as well as the various methods for extracting words and their
associated
vectors from the data. It will take me a bit to digest all this, as well as
some time to
test the various suggestions, to see which scheme seems to work the best
for my applicaton.

Once I get it all checked out, I will see how well the data allows me to
discover the
meaning similarities (vector distances) between pairs and groups of words.

Skip

Skip Cave
Cave Consulting LLC

On Wed, Feb 21, 2018 at 2:36 AM, Skip Cave <s...@caveconsulting.com> wrote:

> I read in a text file of word vectors using fread. The format looks like
> this:
>
> bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210 ...
>
> bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942 -0.0237 ...
>
> belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855 ...
>
> belfast 0.1190 -0.0440 -0.0254 -0.2090 0.2144 0.0348 -0.1467 0.1256 ...
>
> Everything is literal text.
>
> The basic layout for each line is:
>
> word(s) (could contain multiple words separated by underscores)
> space
> number (positive or negative) in text format
> space
> number (positive or negative) in text format
> space
> ......   repeat for 300 numbers (in text)
>
> the last number is followed by a line feed for the next line
>
> I need to:
> 1. Convert all the the high minuses (-) to J's low minus (_)
> 2. Extract the word(s) up to the first space into a separate array (words)
> 3. Convert the text numbers into a 2D array of ? x 300 floating point
> numbers
>
> I know how to do #1 (string replace), and #3 (".) once I get rid of the
> words,
> but I don't know how to strip out the initial word on each line and put
> them in a separate array. Any help is appreciated.
>
> Skip
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to