I can elaborate a little on what our *planned* approach for utilizing SQL
Server is.  I don't know if this will work, but I've done similar things
with SQL CLR and haven't had it explode (yet), so I'm hopeful.  Anyway:

Our system needs two types of output, basically: the full-text reports (I'll
call them documents to reduce ambiguity), and then statistical reports built
on those documents.  The documents can easily be retrieved from Lucene, so
the challenge is building the reports.  For that, my plan is to submit the
same query string to SQL Server that I sent to Lucene.  Internally, SQL
Server would then pass the query back to Lucene and retrieve a list of
document IDs that matched.  The communication may be over WCF or something
similar, and will be compressed during transit to reduce IO overhead.  Once
SQL Server has the IDs, they will be loaded into a temporary table (with
indexes) or a table variable, which will then be used to filter the metadata
that the statistical reports are built from.

I have no idea how such a system would perform.  I do hope to do some
feasibility tests sometime Real Soon (like in the next few weeks), and I'll
post my results if I manage to get it working.

If anyone has any other suggestions, please do share.

On Thu, Nov 12, 2009 at 8:43 AM, Moray McConnachie <
mmcco...@oxford-analytica.com> wrote:

> >While we're discussing this, anyone have any advice or suggestions for
> a better solution?  We've considered a few things for our long-term
> solution.
>
> I'd be very interested to hear thoughts on intersecting SQL and Lucene
> too, as in our case we have very large lists of organisations which have
> different permissions (stored in SQL) for different documents stored in
> Lucene. Showing in the search results only those documents to which the
> organisation has permission is quite expensive for queries with lots of
> results. Storing it in the documents is not manageable because they need
> to be updated frequently across multiple documents. Currently we
> precompute a list for each organisation, cache that in memory, and
> recache it every time that organisation is updated. However this too is
> costly.
>
> Storing the Lucene document nos in SQL during indexing, and then passing
> the list of Lucene document nos matching a search to SQL for filtering
> seems the right way to go. But Matt is right, the problem is with
> scaling this to searches returning many thousands of documents.
>
> Yours,
> M.
> -------------------------------------
> Moray McConnachie
> Director of IT    +44 1865 261 600
> Oxford Analytica  http://www.oxan.com
>
> -----Original Message-----
> From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> Sent: 12 November 2009 13:53
> To: lucene-net-user@incubator.apache.org
> Subject: Re: FieldLookup for field with multiple values
>
> Yeah, it is sort of like your standard faceting scenario, except there
> are about 20,000 facets (organizations), and there's complex
> relationships among the facets.
>
> The reports we're dealing with only occasionally break the funding up by
> organization, so we decided (for now) to just store a single funding
> value, then break it up after-the-fact by dividing it by the number of
> organizations.  So no, the funding is only stored once.
>
> While we're discussing this, anyone have any advice or suggestions for a
> better solution?  We've considered a few things for our long-term
> solution.
> One is to put this metadata in a SQL Server instance, and use SQL CLR to
> build a temporary table based on document IDs from a Lucene index
> (hosted over WCF or something similar), then do the reporting within SQL
> Server.  We plan to compress the list of IDs going back from Lucene to
> SQL Server to cut down on IO overhead, but we're still concerned that
> approach won't scale as we go from hundreds of thousands to millions of
> reports.
>
> Another option we've discussed is to precompute data cubes and use these
> to calculate reporting information.  The concern here is the high
> dimensionality of the data (we have about 20,000 distinct organizations
> now, but fully expect that to increase by an order of magnitude) as well
> as the accuracy of the generated reports, since there's (probably) not a
> good way to divide the cube based on arbitrary Lucene queries.
>
> On Thu, Nov 12, 2009 at 1:03 AM, Michael Garski
> <mgar...@myspace-inc.com>wrote:
>
> > Sounds like a full-text search with the results simply being facets on
>
> > the organizations sorted by the funding amount?
> >
> > You mentioned adding the org ID once for each document.  Do you do the
>
> > same for the funding, with the funding for each corresponding
> organization?
> >
> > Michael
> >
> >
> > -----Original Message-----
> > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> > Sent: Wed 11/11/2009 10:17 PM
> > To: lucene-net-user@incubator.apache.org
> > Subject: Re: FieldLookup for field with multiple values
> >
> > Well, let me prefix what I'm about to describe by saying that I know
> > that I'm doing something with Lucene that it wasn't meant to do.  This
>
> > is for a "proof of concept" system that I'm helping put together on a
> > tight schedule with very limited resources, and we're trying to get to
>
> > a mostly-working state as quickly as possible.
> >
> > That said, we are basically storing reports in Lucene.  The reports
> > are fairly standard documents for the most part: they have a title,
> > body, abstract, etc, all of which we index and search with Lucene.
> > However, they also have a few fields that aren't standard, including a
>
> > list of involved organizations as well as a dollar amount for each
> > report.  The organizations are stored as IDs, and we add the org ID
> > field multiple times, once for each organization involved in the
> > report.  The funding is also stored as a non-indexed field on the
> > Lucene document.
> >
> > What I'm trying to do is build a quick-and-dirty org-by-dollar report
> > off of the reports that match the user's query.  So, a query for
> > "aerospace" might match 50,000 documents, and I want to show the user
> > the top 5 organizations in terms of dollars.  Again, I know reporting
> > like this isn't what Lucene was meant for, and we do have some ideas
> > on how to handle it long-term, but for now, I'm trying to get it
> > working as well as I can using Lucene alone, and Lucene does do a
> > great job of finding the relevant set of documents to build a report
> > from.
> >
> > On Wed, Nov 11, 2009 at 8:56 PM, Michael Garski
> > <mgar...@myspace-inc.com
> > >wrote:
> >
> > > Matt,
> > >
> > > StringIndex is for use when a field has only one value in it for the
>
> > > purposes of sorting results, not for tokenized fields with multiple
> > > values.  TermVectors might be a better approach, but for 50K docs,
> > > you'll encounter an IO hit on reading them.
> > >
> > > I'm curious why you are looking to grab all of the terms for a
> > > ScoreDoc...  can you shed some light on that?
> > >
> > > Michael
> > >
> > > -----Original Message-----
> > > From: Matt Honeycutt [mailto:mbhoneyc...@gmail.com]
> > > Sent: Wednesday, November 11, 2009 4:57 PM
> > > To: lucene-net-user@incubator.apache.org
> > > Subject: FieldLookup for field with multiple values
> > >
> > > It seems that the StringIndex returned by
> > > FieldCache.Fields.Default.GetStringIndex() only indexes one value
> > > for a document even when the document has multiple values for the
> > > field.  Is there a performant want to get all the values for a
> > > particular field in a ScoreDoc?  I'm having to do this across the
> > > entire result set of ScoreDocs (up to 50,000), and retrieving the
> > > values through LuceneDocument.GetFields is not going to cut it.
> > >
> > >
> >
> >
> >
>
>

Reply via email to