Wow! Thanks so much for all the help on cleaning up and parsing the numberbatch text file, as well as the various methods for extracting words and their associated vectors from the data. It will take me a bit to digest all this, as well as some time to test the various suggestions, to see which scheme seems to work the best for my applicaton.
Once I get it all checked out, I will see how well the data allows me to discover the meaning similarities (vector distances) between pairs and groups of words. Skip Skip Cave Cave Consulting LLC On Wed, Feb 21, 2018 at 2:36 AM, Skip Cave <s...@caveconsulting.com> wrote: > I read in a text file of word vectors using fread. The format looks like > this: > > bell 0.0264 -0.2927 -0.0254 -0.1034 0.1672 -0.0440 -0.0019 0.1210 ... > > bell_tower -0.1252 -0.1233 0.1351 0.1897 0.0242 0.0014 0.1942 -0.0237 ... > > belt 0.1332 0.0142 -0.1208 -0.0574 0.1451 -0.0731 -0.1293 0.0855 ... > > belfast 0.1190 -0.0440 -0.0254 -0.2090 0.2144 0.0348 -0.1467 0.1256 ... > > Everything is literal text. > > The basic layout for each line is: > > word(s) (could contain multiple words separated by underscores) > space > number (positive or negative) in text format > space > number (positive or negative) in text format > space > ...... repeat for 300 numbers (in text) > > the last number is followed by a line feed for the next line > > I need to: > 1. Convert all the the high minuses (-) to J's low minus (_) > 2. Extract the word(s) up to the first space into a separate array (words) > 3. Convert the text numbers into a 2D array of ? x 300 floating point > numbers > > I know how to do #1 (string replace), and #3 (".) once I get rid of the > words, > but I don't know how to strip out the initial word on each line and put > them in a separate array. Any help is appreciated. > > Skip > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm