msokolov commented on pull request #1930: URL: https://github.com/apache/lucene-solr/pull/1930#issuecomment-707431740
> One option could be 200-dimensional GloVe word vectors, available from http://ann-benchmarks.com/glove-200-angular.hdf5. I think these are trained on Twitter data. +1 I'm looking into adding GloVe data to luceneutil benchmarks, initially just to index and retrieve them, then I hope to add tasks for scoring lexical matches, and then for knn matching. I think some of the GloVe datasets are trained on wikipedia (plus other text) so should be suitable for use in our benchmarks, which are based on wikipedia text. I think for initial performance comparisons we can use our own tool; it wouldn't be as nicely controlled as running in the same framework, but if we are careful the results should be comparable. And it's good to know there is a reasonable path for integrating with ann-benchmark using py4j, and I hadn't realized there was a --batch option. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org