Couple ideas I guess... Rather than use queries (being so much more difficult) just make an index that contains documents that are just a list of keywords (representing a profilenet 'query'). Use the MoreLikeThis class from contrib to search that index using your source document. The hits you get back will have a score and show which profiles "matched". Kind of simplistic and would prob require some tweaking, but hey...
You might also try searching the Lucene mailing list for classification/taxonomy and you'll find some good solutions for that (I imagine better than the profilenet system). In any case, it shouldnt be that difficult to rig something. Is the profilenet system even that valuable? Sounds a bit hokey to me, but then im just a kid that has never used it <g> You might also try shoving the document in question into a MemoryIndex and applying all of your profile queries against it, saving the score for each search (does the weight system make these scores comparable?). Doubt thats very scalable though... (*heavily* doubt it, but this would appear to be most accurate in terms of an inverted search) And finally, maybe this reply this will trigger some of these brilliant dudes on the list to suggest something better.... - Mark On Jan 16, 2008 1:32 PM, Marcus Falk <[EMAIL PROTECTED]> wrote: > The norms are modded so each norm value is stored as 4 byte instead of 1 > byte, this modification is using more memory. But anyway the hw we are > running on are > 2x 8 cpu hp servers with 16 gig ram in each of them. > > We are scaling the index on daterange (and the ranking is modified to sort > by date) > > Ex: > 2007-01 - 2007-06 > ix 1 > 2007-07 - 2007-12 > ix 2 > > Each index is hosted in a separate hosting application and we have a layer > infront of all indexes to merge the results. So it is our own "search > engine" we tried solr but I didn't really liked it + we had to do some > modifications to support the business. > > Since we are running 32 bit OS (we are going to use 64 bit soon it will be > interesting) on windows with pae each process can consume just 2-2.5gbmemory > so we are having a lot of indexes on the same machine. > > We did this some years ago and we run it 24/7 without any failure, > indexing approx 200-300k new articles every day. > > > Any way... We really need to find a good api / some one that knows how to > add inverted searching to lucene. > > /Regards > Marcus > > > > > > > -----Ursprungligt meddelande----- > Från: Mark Miller [mailto:[EMAIL PROTECTED] > Skickat: den 16 januari 2008 19:14 > Till: java-user@lucene.apache.org > Ämne: Re: Inverted search / Search on profilenet > > Don't have any info to add, but out of curiosity, what kind of setup are > you > using to host the 300 mil archive? Is the index distributed? Single > machine? > Solr? > > Thanks, > > Mark > > On Jan 16, 2008 12:27 PM, Marcus Falk <[EMAIL PROTECTED]> wrote: > > > Hi again, > > > > > > > > Today we are hosting a 300 million large search index without any > > problems in a lucene environment, with just some customization in the > > lucene api for ranking etc... > > > > > > > > So we are really satisfied with lucene. > > > > > > > > We also have the demands to search with documents on profiles we are > > currently using verity (autonomy) for this, where we store the profiles > > in the index and are using the document as query. > > > > The verity api we are using seems to have some internal threading > > problems (race conditions) so we need to find another way to perform > > those kind of searches. > > > > > > > > Does anybody have any ideas of any api that could do this for us? Any > > ideas on how lucene could be modified to do this kind of searches? > > > > > > > > The volumes are around 300k full length articles distributed some what > > evenly over a 24h period on a 50 k profilenet. > > > > > > > > > > > > /Mvh > > > > Marcus > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >