CEP-30: [Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes] uses the smile-nlp library (com.github.haifengl.smile-nlp) in its testing to allow the creation of word2vec embeddings for valid input into the HNSW graph index.
The reason for this library is that we found that using random vectors in testing produced very inconsistent results. Using the smile-nlp word2vec implementation with the glove.3k.50d library produces repeatable results. Does anyone have any objections to the use of this library as a test only dependency? -- [image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson* Engineering +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/> Find DataStax Online: [image: LinkedIn Logo] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> [image: Facebook Logo] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> [image: Twitter Logo] <https://twitter.com/DataStax> [image: RSS Feed] <https://www.datastax.com/blog/rss.xml> [image: Github Logo] <https://github.com/datastax>