Re: Merge Index Filling up Disk Space

Erick Erickson Tue, 26 Dec 2006 07:01:18 -0800

First, it probably would have been a good thing to start a new thread on
this topic, since it's only vaguely related to disk space <G>...


That said, sure. Note that there's no requirement in lucene that all
documents in an index have the same fields. Also, there's no reason you
can't use two separate indexes. Finally, you have to think about how many
times you are going to add update a given article when choosing your
approach. Here are several possibilities.

1> Add a field (tokenized) to each article in your index that contains IDs
of the companies you want to associate with that article. The downside here
is that you need to delete and re-add the document every time you want to
add a company to that article.

2> Create a separate index that contains that relationship.

3> have two kinds of documents in your index, one that indexes articles and
one that relates those to companies. Something like this:

Articles are indexed with "text" and "artid" fields. (NOTE: artid is NOT the
Lucene document ID, those change)
Relations are indexed with "id" and "company id" fields.

id and artid are your relationship. You *don't* want to name the field the
same for both kinds of documents since they would be indexed together.

Now, given a search over some text, you get back a bunch of article IDs. You
then search on the id field of the relations documents to extract company id
fields.

You may be able to do some interesting things with termdocs/termenums to
make this efficient, but don't go there unless you need to.

At this point, though, I've got to ask if you have access to a database in
your application. If you do, why not store the relations there? Lucene is a
text-search engine, not a relational database. This kind of relation may be
perfectly valid to implement in Lucene, but you want to be careful if you
find yourself trying to do any more RDBMS-like things.

Best
Erick

On 12/26/06, Harini Raghavan <[EMAIL PROTECTED]> wrote:


Hi,

I have another related problem. I am adding news articles for a company
to the lucene index. As of now if the articles are mapped to more than
one company, they are added so many times in the index. As the no. of
companies mapped to each article increases, this will not be a scalable
implementation as documents will be duplicated in the index. Is there a
way to model the lucene index in a relational way such that the articles
can be stored in an index and article-company mapping can be modelled
separately?

Thanks,
Harini

Mark Miller wrote:

> A Searcher uses a Reader to read the index for searching.
>
> - Mark
>
> Harini Raghavan wrote:
>
>> Hi Mike,
>>
>> Thank you for the response. I don't have readers open on the index,
>> but while the optimize/merge was running I was searching on the
>> index. Would that make any difference?
>> Also after the optimizing the index I had some .tmp files which were
>> > 10G and did not get merged. Could that also be related to having
>> searchers open while running optimize?
>>
>> -Harini
>>
>> Michael McCandless wrote:
>>
>>> Harini Raghavan wrote:
>>>
>>>> I am using lucene 1.9.1 for search functionality in my j2ee
>>>> application using JBoss as app server. The lucene index directory
>>>> size is almost 20G right now. There is a Quartz job that is adding
>>>> data to the index evey min and around 20000 documents get added to
>>>> the index every day.When the documents are added and the segments
>>>> are merged, the index size increases and sometimes grows to more
>>>> than double its original size. This results in filling up the disk
>>>> space. We have allotted a f/s size of 50G and even that is not
>>>> sufficient at times. Is there an optimum vales for the f/s size to
>>>> be allotted in such scenario.
>>>> Any suggestions would be appreciated.
>>>
>>>
>>>
>>> I believe optimize should use at most 2X the starting index size,
>>> transiently, if there are no readers open against the index.  And then
>>> when optimize is done the size should be around the starting size, or
>>> less.
>>>
>>> If there are open readers against the index when the optimize occurs,
>>> then, the segments that were merged cannot actually be deleted until
>>> those readers close.  Even on Unix, where it will look like the
>>> segments were deleted, they are still consuming disk space because
>>> open file handles keep them allocated ("delete on last close").
>>>
>>> This means if you have open readers you should see at most 3X the
>>> starting index size.  Worse, if some (but not all) readers are
>>> re-opening while the merge is underway it's possible to peak at even
>>> more than 3X the starting size.
>>>
>>> Do you have readers running against your index?
>>>
>>> I will call this out in the javadocs for optimize, addDocument,
>>> addIndexes...
>>>
>>> Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

--
Harini Raghavan
Software Engineer
Office : +91-40-23556255
[EMAIL PROTECTED]
we think, you sell
www.InsideView.com
InsideView


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Merge Index Filling up Disk Space

Reply via email to