Yeah, it is sort of like your standard faceting scenario, except there are
about 20,000 facets (organizations), and there's complex relationships among
the facets.

The reports we're dealing with only occasionally break the funding up by
organization, so we decided (for now) to just store a single funding value,
then break it up after-the-fact by dividing it by the number of
organizations.  So no, the funding is only stored once.

While we're discussing this, anyone have any advice or suggestions for a
better solution?  We've considered a few things for our long-term solution.
One is to put this metadata in a SQL Server instance, and use SQL CLR to
build a temporary table based on document IDs from a Lucene index (hosted
over WCF or something similar), then do the reporting within SQL Server.  We
plan to compress the list of IDs going back from Lucene to SQL Server to cut
down on IO overhead, but we're still concerned that approach won't scale as
we go from hundreds of thousands to millions of reports.

Another option we've discussed is to precompute data cubes and use these to
calculate reporting information.  The concern here is the high
dimensionality of the data (we have about 20,000 distinct organizations now,
but fully expect that to increase by an order of magnitude) as well as the
accuracy of the generated reports, since there's (probably) not a good way
to divide the cube based on arbitrary Lucene queries.

On Thu, Nov 12, 2009 at 1:03 AM, Michael Garski <mgar...@myspace-inc.com>wrote:

> Sounds like a full-text search with the results simply being facets on the
> organizations sorted by the funding amount?
>
> You mentioned adding the org ID once for each document.  Do you do the same
> for the funding, with the funding for each corresponding organization?
>
> Michael
>
>
> -----Original Message-----
> From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> Sent: Wed 11/11/2009 10:17 PM
> To: lucene-net-user@incubator.apache.org
> Subject: Re: FieldLookup for field with multiple values
>
> Well, let me prefix what I'm about to describe by saying that I know that
> I'm doing something with Lucene that it wasn't meant to do.  This is for a
> "proof of concept" system that I'm helping put together on a tight schedule
> with very limited resources, and we're trying to get to a mostly-working
> state as quickly as possible.
>
> That said, we are basically storing reports in Lucene.  The reports are
> fairly standard documents for the most part: they have a title, body,
> abstract, etc, all of which we index and search with Lucene.  However, they
> also have a few fields that aren't standard, including a list of involved
> organizations as well as a dollar amount for each report.  The
> organizations
> are stored as IDs, and we add the org ID field multiple times, once for
> each
> organization involved in the report.  The funding is also stored as a
> non-indexed field on the Lucene document.
>
> What I'm trying to do is build a quick-and-dirty org-by-dollar report off
> of
> the reports that match the user's query.  So, a query for "aerospace" might
> match 50,000 documents, and I want to show the user the top 5 organizations
> in terms of dollars.  Again, I know reporting like this isn't what Lucene
> was meant for, and we do have some ideas on how to handle it long-term, but
> for now, I'm trying to get it working as well as I can using Lucene alone,
> and Lucene does do a great job of finding the relevant set of documents to
> build a report from.
>
> On Wed, Nov 11, 2009 at 8:56 PM, Michael Garski <mgar...@myspace-inc.com
> >wrote:
>
> > Matt,
> >
> > StringIndex is for use when a field has only one value in it for the
> > purposes of sorting results, not for tokenized fields with multiple
> > values.  TermVectors might be a better approach, but for 50K docs,
> > you'll encounter an IO hit on reading them.
> >
> > I'm curious why you are looking to grab all of the terms for a
> > ScoreDoc...  can you shed some light on that?
> >
> > Michael
> >
> > -----Original Message-----
> > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> > Sent: Wednesday, November 11, 2009 4:57 PM
> > To: lucene-net-user@incubator.apache.org
> > Subject: FieldLookup for field with multiple values
> >
> > It seems that the StringIndex returned by
> > FieldCache.Fields.Default.GetStringIndex() only indexes one value for a
> > document even when the document has multiple values for the field.  Is
> > there
> > a performant want to get all the values for a particular field in a
> > ScoreDoc?  I'm having to do this across the entire result set of
> > ScoreDocs
> > (up to 50,000), and retrieving the values through
> > LuceneDocument.GetFields
> > is not going to cut it.
> >
> >
>
>
>

Reply via email to