Hello All, I am looking for an efficient way to represent vectors that exist in an infinite dimensional space. Specifically I am working with large amounts of text data and will be receiving a lot of data that contains previously unseen words. Each text represents a vector that exists in the space of all possible strings and each word in the text represents a dimension. As such these vectors are extremely sparse. Currently we handle this by using a dictionary to represent each text as a bag of words <http://en.wikipedia.org/wiki/Bag-of-words_model> vector. If a word does not exist in the vector we return zero. This allows use to perform computations as so:
["the"=>3,"and"=>2,"is"=>4] + ["this"=>5,"was"=>1,"where"=>6] = ["where"=>6,"the"=>3,"is"=>4,"this"=>5,"was"=>1,"and"=>2] euclidean(["the"=>3,"and"=>2,"is"=>4], ["this"=>5,"was"=>1,"where"=>6]) = 9.539392014169456 Is a dictionary the proper associative structure, or should we use a different data structure like a JudyArray or Trie? -MT
