Re: Combining results of multiple indexes

Erick Erickson Wed, 17 Dec 2008 10:21:53 -0800

Well, maybe if I'd read the original post more carefully I'd have figured
that out,
sorry 'bout that.


I *think* I remember reading somewhere on the email lists that your indexing
speed goes up pretty linearly as the number of indexing tasks approaches
the number of CPUs. Are you, perhaps, on a dual-core machine? But do search
the mail archives because my memory may not be accurate.

You can easily combine indexes by IndexWriter.addIndexes BTW. Personally
I prefer fewer indexes if you can get away with it. But I'd only try this
after
Michael's suggestion of using multiple threads on a single underlying
writer.

You could even think about using N machines to create M fragments then
combining them all afterwards if your logs are static enough to make that
reasonable. Combining indexes may take a while though.....

Best
Erick

On Wed, Dec 17, 2008 at 10:46 AM, Preetham Kajekar <preet...@cisco.com>wrote:

> Hi Erick,
> Thanks for the response. Replies inline.
>
> Erick Erickson wrote:
>
>> The very first question is always "are you opening a new searcher
>> each time you query"? But you've looked at the Wiki so I assume not.
>> This question is closely tied to what kind of latency you can tolerate.
>>
>> A few more details, please. What's slow? Queries? Indexing?
>>
>>
> Indexing. Again, it is not slow. It is just faster with two separate
> indexers in two threads.
>
>> How slow? 100ms? 100s? What are your target times and
>> what are you seeing?
>>
>>
> With a single indexer in a single thread, I can index about 20,000 event
> objects per second. With 2 thread and 2 indexers, it is close to 50,000. :-)
>
>> How big is your index? 100M? 100G? What kind of VM
>> parameters are you specifying?
>>
>>
> The index will have about 20mil entries. The size of the index lands up
> being about 500M.
> I start the VM with 1G of heap. No other options for GC etc is used.
>
>> As an aside, do note that there's no requirement in Lucene that
>> each document have the same fields, so it's unclear why you
>> need two indexes, but perhaps some of the answers to the above
>> will help us understand.
>>
>>
> Like I mentioned, Lucene does the job much faster with two indexes.
>
>> Also, be very very careful what you measure when you measure
>> queries. You absolutely *have* to put some instrumentation in
>> the code since "slow queries" can result from things other than
>> searching. For instance, iterating over a Hits object for 100s of
>> documents....
>>
>>
> The Query speeds are much faster than what I need :-) So no complains here.
>
>> Show the code, man <G>!
>>
>>
> Code below. EvIndexer is the base class. There are two subclasses which
> implement addEvFieldsToIndexDoc() (template pattern) to add different fields
> to the index. that code is also pasted below
>
> --Code ---
>
> BaseClass
>
>   public EvIndexer(String indexName) throws Exception {
>       this.name = indexName;
>       a = new KeywordAnalyzer();
>       INDEX_PATH = System.getProperty(StoreManager.PROP_DB_DB_LOC,
> "./index/");
>       FSDirectory directory = FSDirectory.getDirectory(INDEX_PATH +
> File.separatorChar + indexName, NoLockFactory.getNoLockFactory());
>       indexWriter = new IndexWriter(directory, a,
> IndexWriter.MaxFieldLength.LIMITED);
> //indexWriter.setUseCompoundFile(false);
>       //indexWriter.setRAMBufferSizeMB(256);
>         }
>       /** Method implemented by extending classes to add data into the
> index document for the
>    *  given event
>    *
>    * @param d
>    */
>   protected abstract void addEvFieldsToIndexDoc(Document d, Ev event);
>     public void addToIndex(Ev ev) throws Exception {
>       noOfEventsIndexed++;
>       Document d = new Document();             addEvFieldsToIndexDoc(d,
> ev);
>       indexWriter.addDocument(d);
>             if ((noOfEventsIndexed % COMMIT_INTERVAL) == 0) {
>           System.out.println(name + " indexed " +
> NumberFormat.getInstance().format(noOfEventsIndexed) + " Commiting them");
>           commit();
>       }                   }
>
> DerievdClass1
>   protected void addEvFieldsToIndexDoc(Document d, Ev ev) {
>       //noOfEventsIndexed++;
>             Field id = new Field(EV_ID, Long.toString(ev.getId()),
> Field.Store.YES, Field.Index.NO);
>       Field src = new Field(EV_SRC, Long.toString(ev.getSrcId()),
> Field.Store.NO, Field.Index.NOT_ANALYZED);
>       Field type = new Field(EV_TYPE,
> Integer.toString(ev.getEventTypeId()), Field.Store.NO,
> Field.Index.NOT_ANALYZED);
>       Field pri = new Field(EV_PRI, Short.toString(ev.getPriority()) ,
> Field.Store.NO, Field.Index.NOT_ANALYZED);
>       Field time = new Field(EV_TIME, getHexString(ev.getRecvTime()) ,
> Field.Store.NO, Field.Index.NOT_ANALYZED);
>       d.add(id);
>       d.add(src);
>       d.add(type);
>       d.add(pri);
>       d.add(time);
>       //noOfFieldsIndexed +=  4;
>                   }
>
>
>
>
> Thanks for the support.
> ~preetham
>
>
>  Best
>> Erick
>>
>>
>> On Wed, Dec 17, 2008 at 9:40 AM, Preetham Kajekar <preet...@cisco.com
>> >wrote:
>>
>>
>>
>>> Hi Grant,
>>> Thanks four response. Replies inline.
>>>
>>> Grant Ingersoll wrote:
>>>
>>>
>>>
>>>> On Dec 17, 2008, at 12:57 AM, Preetham Kajekar wrote:
>>>>
>>>>  Hi,
>>>>
>>>>
>>>>> I am new to Lucene. I am not using it as a pure text indexer.
>>>>>
>>>>> I am trying to index a Java object which has about 10 fields (like id,
>>>>> time, srcIp, dstIp) - most of them being numerical values.
>>>>> In order to speed up indexing, I figured that having two separate
>>>>> indexers, each of them indexing different set of fields works great. So
>>>>> I
>>>>> have the first 5 fields in index1 and the remaining in index2.
>>>>>
>>>>>
>>>>>
>>>> Can you explain this a bit more?  Are those two fields really large org
>>>> something?  How are you obtaining them?  How are you correlating the
>>>> documents between the two indexes?  Did you actually try a single index
>>>> and
>>>> it was too slow?
>>>>
>>>>
>>>>
>>> I have a java object which has about 10 fields. However, the fields are
>>> not
>>> fixed. The java object is essentially a representation of Syslogs from
>>> network devices. So different syslogs have different fields. Each field
>>> has
>>> a unique id and a value (mostly numeric types, so i convert it to
>>> string).
>>> There are some fixed fields. So the object is a list of fields which is
>>> produced by a parser.
>>> I am trying to index using two indexers in two separate threads- one for
>>> fixed and another for the non-fixed fields. Except for a unique id, I do
>>> not
>>> store the fields in Lucene - i just index them. From the index, i get the
>>> unique id which is all I care about. (the objects are stored elsewhere
>>> and
>>> can be looked up based on this unique id).
>>> I did try using a single indexer, but things were quite slow. Getting
>>> high
>>> throughput is crucial and having two indexers seemed to do very well.
>>> (more
>>> than twice as fast)
>>>
>>> Further, the index will never be modified and I can have just one thread
>>> writing to the index. If there are any other performance tips would be
>>> very
>>> helpful. I have already looked at the wiki link regarding performance and
>>> using some of them.
>>>
>>> Thanks,
>>> ~preetham
>>>
>>>
>>>
>>>
>>>> Now, I want to have boolean AND query's looking for values in both
>>>>> indexes. Like f1=1234 AND f7=ABCD.f1 and f7 and present in two separate
>>>>> indexes. Would using the MultiIndexReader help ? Since I am doing an
>>>>> AND, I
>>>>> dont expect that it would work.
>>>>>
>>>>> Thanks,
>>>>> ~preetham
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>>
>>>> Lucene Helpful Hints:
>>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>>
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Combining results of multiple indexes

Reply via email to