Modelling Relational Lucene Index

Harini Raghavan Wed, 27 Dec 2006 08:47:29 -0800

Hi Erick,

Thank you for the detailed response.

First I would like to mention that my application has an index withcompany id & name indexed for article for the following reasons:

1. A search interface where we search across articles and companies.

2. Paging - I need to page the results after loading the hits due towhich I don't want to separate out the text search and article-companymatching logic. I want to load the articles using one single Lucene query.

I am using MySQL database to store the relations. But since I need tosearch across companies & keywords in article, I am also storing thecompany name and id in the index. The option 3 looks good to me. But Iam concerned about degrading the performance of the existing system if Imake the search into a 2 step process.


However I will try to evaluate your suggestions in detail.

Thank you again,
Harini

Erick Erickson wrote:

First, it probably would have been a good thing to start a new thread on
this topic, since it's only vaguely related to disk space <G>...

That said, sure. Note that there's no requirement in lucene that all
documents in an index have the same fields. Also, there's no reason you
can't use two separate indexes. Finally, you have to think about how many
times you are going to add update a given article when choosing your
approach. Here are several possibilities.

1> Add a field (tokenized) to each article in your index that containsIDsof the companies you want to associate with that article. The downsidehere

is that you need to delete and re-add the document every time you want to
add a company to that article.

2> Create a separate index that contains that relationship.

3> have two kinds of documents in your index, one that indexesarticles and

one that relates those to companies. Something like this:

Articles are indexed with "text" and "artid" fields. (NOTE: artid isNOT the

Lucene document ID, those change)
Relations are indexed with "id" and "company id" fields.

id and artid are your relationship. You *don't* want to name the fieldthe

same for both kinds of documents since they would be indexed together.

Now, given a search over some text, you get back a bunch of articleIDs. Youthen search on the id field of the relations documents to extractcompany id

fields.

You may be able to do some interesting things with termdocs/termenums to
make this efficient, but don't go there unless you need to.

At this point, though, I've got to ask if you have access to adatabase inyour application. If you do, why not store the relations there? Luceneis atext-search engine, not a relational database. This kind of relationmay be

perfectly valid to implement in Lucene, but you want to be careful if you
find yourself trying to do any more RDBMS-like things.

Best
Erick

On 12/26/06, Harini Raghavan <[EMAIL PROTECTED]> wrote:


Hi,

I have another related problem. I am adding news articles for a company
to the lucene index. As of now if the articles are mapped to more than
one company, they are added so many times in the index. As the no. of
companies mapped to each article increases, this will not be a scalable
implementation as documents will be duplicated in the index. Is there a
way to model the lucene index in a relational way such that the articles
can be stored in an index and article-company mapping can be modelled
separately?

Thanks,
Harini


Harini Raghavan
Software Engineer
Office : +91-40-23556255
[EMAIL PROTECTED]
we think, you sell
www.InsideView.com

InsideView


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Modelling Relational Lucene Index

Reply via email to