I shudder to think about what the higher-ups will say about the cost of faster storage. They're very stingy on things that make our lives easier. :)
I like the idea of caching the term vectors somehow, though I wonder how that would scale with millions of documents. I will add that to the list of things to prototype for our long-term solution. Thanks for the feedback, ---Matt On Thu, Nov 12, 2009 at 12:14 PM, Michael Garski <mgar...@myspace-inc.com>wrote: > Matt, > > Metadata can be collected during the course of a search in a > HitCollector (note that 2.9 will deprecate the current HitCollector in > favor of the more flexible Collector). One approach to collecting > facets is to index the field with the desired facet metadata with > TermVectors enabled. During collection you can retrieve the > TermFreqVector for a document with Reader.GetTermFreqVector for each > document and collect/aggregate metadata however you like. > > Making the call to get the TermFreqVector for each hit will be a > performance hit due to IO and deserialization (even though it's very > fast, 50K times in one search is a lot). At that point you can add > faster IO (Fusion IO rules this space) or during an index warmup create > an in-memory structure with lookups to the vectors keyed off the > document id. > > Hope that helps. > > Michael > > -----Original Message----- > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com] > Sent: Thursday, November 12, 2009 7:13 AM > To: lucene-net-user@incubator.apache.org > Subject: Re: FieldLookup for field with multiple values > > I can elaborate a little on what our *planned* approach for utilizing > SQL > Server is. I don't know if this will work, but I've done similar things > with SQL CLR and haven't had it explode (yet), so I'm hopeful. Anyway: > > Our system needs two types of output, basically: the full-text reports > (I'll > call them documents to reduce ambiguity), and then statistical reports > built > on those documents. The documents can easily be retrieved from Lucene, > so > the challenge is building the reports. For that, my plan is to submit > the > same query string to SQL Server that I sent to Lucene. Internally, SQL > Server would then pass the query back to Lucene and retrieve a list of > document IDs that matched. The communication may be over WCF or > something > similar, and will be compressed during transit to reduce IO overhead. > Once > SQL Server has the IDs, they will be loaded into a temporary table (with > indexes) or a table variable, which will then be used to filter the > metadata > that the statistical reports are built from. > > I have no idea how such a system would perform. I do hope to do some > feasibility tests sometime Real Soon (like in the next few weeks), and > I'll > post my results if I manage to get it working. > > If anyone has any other suggestions, please do share. > > On Thu, Nov 12, 2009 at 8:43 AM, Moray McConnachie < > mmcco...@oxford-analytica.com> wrote: > > > >While we're discussing this, anyone have any advice or suggestions > for > > a better solution? We've considered a few things for our long-term > > solution. > > > > I'd be very interested to hear thoughts on intersecting SQL and Lucene > > too, as in our case we have very large lists of organisations which > have > > different permissions (stored in SQL) for different documents stored > in > > Lucene. Showing in the search results only those documents to which > the > > organisation has permission is quite expensive for queries with lots > of > > results. Storing it in the documents is not manageable because they > need > > to be updated frequently across multiple documents. Currently we > > precompute a list for each organisation, cache that in memory, and > > recache it every time that organisation is updated. However this too > is > > costly. > > > > Storing the Lucene document nos in SQL during indexing, and then > passing > > the list of Lucene document nos matching a search to SQL for filtering > > seems the right way to go. But Matt is right, the problem is with > > scaling this to searches returning many thousands of documents. > > > > Yours, > > M. > > ------------------------------------- > > Moray McConnachie > > Director of IT +44 1865 261 600 > > Oxford Analytica http://www.oxan.com > > > > -----Original Message----- > > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com] > > Sent: 12 November 2009 13:53 > > To: lucene-net-user@incubator.apache.org > > Subject: Re: FieldLookup for field with multiple values > > > > Yeah, it is sort of like your standard faceting scenario, except there > > are about 20,000 facets (organizations), and there's complex > > relationships among the facets. > > > > The reports we're dealing with only occasionally break the funding up > by > > organization, so we decided (for now) to just store a single funding > > value, then break it up after-the-fact by dividing it by the number of > > organizations. So no, the funding is only stored once. > > > > While we're discussing this, anyone have any advice or suggestions for > a > > better solution? We've considered a few things for our long-term > > solution. > > One is to put this metadata in a SQL Server instance, and use SQL CLR > to > > build a temporary table based on document IDs from a Lucene index > > (hosted over WCF or something similar), then do the reporting within > SQL > > Server. We plan to compress the list of IDs going back from Lucene to > > SQL Server to cut down on IO overhead, but we're still concerned that > > approach won't scale as we go from hundreds of thousands to millions > of > > reports. > > > > Another option we've discussed is to precompute data cubes and use > these > > to calculate reporting information. The concern here is the high > > dimensionality of the data (we have about 20,000 distinct > organizations > > now, but fully expect that to increase by an order of magnitude) as > well > > as the accuracy of the generated reports, since there's (probably) not > a > > good way to divide the cube based on arbitrary Lucene queries. > > > > On Thu, Nov 12, 2009 at 1:03 AM, Michael Garski > > <mgar...@myspace-inc.com>wrote: > > > > > Sounds like a full-text search with the results simply being facets > on > > > > > the organizations sorted by the funding amount? > > > > > > You mentioned adding the org ID once for each document. Do you do > the > > > > > same for the funding, with the funding for each corresponding > > organization? > > > > > > Michael > > > > > > > > > -----Original Message----- > > > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com] > > > Sent: Wed 11/11/2009 10:17 PM > > > To: lucene-net-user@incubator.apache.org > > > Subject: Re: FieldLookup for field with multiple values > > > > > > Well, let me prefix what I'm about to describe by saying that I know > > > that I'm doing something with Lucene that it wasn't meant to do. > This > > > > > is for a "proof of concept" system that I'm helping put together on > a > > > tight schedule with very limited resources, and we're trying to get > to > > > > > a mostly-working state as quickly as possible. > > > > > > That said, we are basically storing reports in Lucene. The reports > > > are fairly standard documents for the most part: they have a title, > > > body, abstract, etc, all of which we index and search with Lucene. > > > However, they also have a few fields that aren't standard, including > a > > > > > list of involved organizations as well as a dollar amount for each > > > report. The organizations are stored as IDs, and we add the org ID > > > field multiple times, once for each organization involved in the > > > report. The funding is also stored as a non-indexed field on the > > > Lucene document. > > > > > > What I'm trying to do is build a quick-and-dirty org-by-dollar > report > > > off of the reports that match the user's query. So, a query for > > > "aerospace" might match 50,000 documents, and I want to show the > user > > > the top 5 organizations in terms of dollars. Again, I know > reporting > > > like this isn't what Lucene was meant for, and we do have some ideas > > > on how to handle it long-term, but for now, I'm trying to get it > > > working as well as I can using Lucene alone, and Lucene does do a > > > great job of finding the relevant set of documents to build a report > > > from. > > > > > > On Wed, Nov 11, 2009 at 8:56 PM, Michael Garski > > > <mgar...@myspace-inc.com > > > >wrote: > > > > > > > Matt, > > > > > > > > StringIndex is for use when a field has only one value in it for > the > > > > > > purposes of sorting results, not for tokenized fields with > multiple > > > > values. TermVectors might be a better approach, but for 50K docs, > > > > you'll encounter an IO hit on reading them. > > > > > > > > I'm curious why you are looking to grab all of the terms for a > > > > ScoreDoc... can you shed some light on that? > > > > > > > > Michael > > > > > > > > -----Original Message----- > > > > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com] > > > > Sent: Wednesday, November 11, 2009 4:57 PM > > > > To: lucene-net-user@incubator.apache.org > > > > Subject: FieldLookup for field with multiple values > > > > > > > > It seems that the StringIndex returned by > > > > FieldCache.Fields.Default.GetStringIndex() only indexes one value > > > > for a document even when the document has multiple values for the > > > > field. Is there a performant want to get all the values for a > > > > particular field in a ScoreDoc? I'm having to do this across the > > > > entire result set of ScoreDocs (up to 50,000), and retrieving the > > > > values through LuceneDocument.GetFields is not going to cut it. > > > > > > > > > > > > > > > > > > > > > > >