msokolov commented on pull request #1930:
URL: https://github.com/apache/lucene-solr/pull/1930#issuecomment-707431740


   > One option could be 200-dimensional GloVe word vectors, available from 
http://ann-benchmarks.com/glove-200-angular.hdf5. I think these are trained on 
Twitter data.
   
   +1 I'm looking into adding GloVe data to luceneutil benchmarks, initially 
just to index and retrieve them, then I hope to add tasks for scoring lexical 
matches, and then for knn matching. I think some of the GloVe datasets are 
trained on wikipedia (plus other text) so should be suitable for use in our 
benchmarks, which are based on wikipedia text.
   
   I think for initial performance comparisons we can use our own tool; it 
wouldn't be as nicely controlled as running in the same framework, but if we 
are careful the results should be comparable. And it's good to know there is a 
reasonable path for integrating with ann-benchmark using py4j, and I hadn't 
realized there was a --batch option.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to