mocobeta commented on pull request #238: URL: https://github.com/apache/lucene/pull/238#issuecomment-894810335
I was also thinking about how do we (and users) obtain the example vectors data if we provide a "standalone" demo besides the IndexFiles/Searches integrated one. There are two possible options we could take: 1. Sample random vectors from uniform or normal distribution when performing indexing/searching. Of course, the generated vectors are not meaningful at all - but one could say that the "meaning" of vectors is up to specific model or application, and what we provide is general "vector search" functionality anyway... 2. Generate word representations of some publicly available corpus (e.g. Project Gutenberg) by using GloVe; then include a small fraction of them within demo module distribution. While proper credits are required, distributing a dataset that is converted from copyright-free texts and public domain embeddings (GloVe) would not be problematic, I think. (Though if we come into a somewhat difficult discussion on this, I wouldn't push this plan.) > So what we're doing here is different from the benchmarks since we're redistributing (a portion of) the GloVe data, unlike benchmarks which requires the user to download the wikipedia data (or does it for them). I didn't notice that the fraction of GloVe data was included...; yes I think it would be great if we have some decent credits (or notice?) for it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
