jtibshirani edited a comment on pull request #1930:
URL: https://github.com/apache/lucene-solr/pull/1930#issuecomment-707404485


   >  It seems to expect your algorithm to be delivered as an in-process 
extension to Python, which works OK for a native code library, but I'm not sure 
how we'd present Lucene to it. We don't want to have to call through a network 
API?
   
   I ended up using `py4j` to call out to Lucene, which sets up a 'gateway 
server' and passes data between the Python + Java processes through a socket. I 
found there to be a significant overhead from converting between Python <-> 
Java, but this can be largely mitigated by making sure to use 'batch mode' (the 
`--batch` option), which allows all query vectors to be passed to Lucene at 
once. Amortizing the overhead this way, I was able to get consistent + 
informative results. Let me know if you're interested in trying the py4j option 
and I can post set-up steps. I found it helpful while developing but it's quite 
tricky and maybe shouldn't be the main way to track performance right now (as 
you mentioned) !
   
   A note that it's possible to use vector data from ann-benchmarks without 
integrating with the framework. The datasets are listed 
[here](https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/datasets.py#L396)
 and made available on the website in hdf5 format. One option could be 
200-dimensional GloVe word vectors, available from 
`http://ann-benchmarks.com/glove-200-angular.hdf5`. I think these are trained 
on Twitter data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to