Re: stream of events never to know when it ends? how to index such things & search

Christian Brennsteiner Thu, 19 Feb 2009 05:25:59 -0800

hi erick,

nr of events are 107/sec in average with 400/sec peak and 20/sec low.
between searchable should be less than 20 minutes. we are planning to
index IN RAM only for a duration of one day MAX. per lucene process on
the operating system.


currently we need 500 M RAM for indexing one day (just storing the
eventids and indexing (without storing) highly redundant event
descriptions. collecting all eventdescriptions costs us additionally
3G ram (which is very much :-( for us.)

@PositionIncrementGap or SpanQueries or the proximity operator ...
sorry i am bloody beginner i don't really kow what you are talking
about.

a real update would be perfect... but i think from the current design
it is not possible to extract all unstemmed keywords from a HIT? or is
this possible?
update then would be:

search eventid
get hits (should be one)
extract all keywords from hit
add new information plus hits newly to the index
delete the hit.

is there a possibility to gather detailed information about the index
itself, that i can give you a detailed idea how big / and in which
condition it is?

regards chris





On Wed, Feb 18, 2009 at 5:38 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> You could always sort by EVENTID, that way at least
> you'd have all the events for a particular ID together
> in your results. You'd have to post-filter the results to
> determine whether all the necessary descriptions were
> present. But I don't think this works all that well because,
> as you pointed out, you may have a lot of records to
> sort through so I don't think this is a very good idea...
>
>
>
> How many events are we talking about here and what
> kind of lag between an event and being able to search it
> can you tolerate? I guess what I'm really asking is whether
> it's possible to recreate your index "often enough" to
> satisfy your users. If so, you can index multiple
> descriptions in a single document, something like
>
> doc.add("EVENTDESCRIPTION", "STARTING EVENT")
> doc.add("EVENTDESCRIPTION", "XYZ")
> doc.add("EVENTDESCRIPTION", "ABC")
> doc.add("EVENTID", "1")
> IndexWriter.addDocument(doc);
>
>
> You'd have to gather all the descriptions related
> to each EVENTID before you were able to index the doc.....
>
> By manipulating the PositionIncrementGap you could also
> keep searches from matching across different EVENTDESCRIPTIONs,
> e.g. if you didn't want to match +STARTING +ABC you could use
> SpanQueries or the proximity operator, but going into details
> depends upon whether you can rebuild your index so we'll defer
> that part....
>
> You could also think about updating the document when new events
> were added, but since an update is really a delete/add under the
> covers you'd have to either gather enough information from what I
> assume is your log or store enough information with the document to
> recreate it.
>
> How big is your index currently and what kind of throughput do you
> require?
>
> Best
> Erick
>
>
> On Wed, Feb 18, 2009 at 10:20 AM, Christian Brennsteiner <
> christ...@brennsteiner.at> wrote:
>
>> dear lucene community,
>>
>> i am playing around with lucene right now. and have come to very bad
>> problem.
>>
>> given environment:
>>
>> a signal source gives signals with eventids ans eventdescriptions
>>
>> for example EVENTID=1 and EVENTDESCRIPTION="STARTING EVENT"
>>
>> those events can be running very long (e.g. one month) during this
>> period we will receive for example
>>
>> EVENTID=1 and EVENTDESCRIPTION="EXECUTING XYZ"
>> 10 minutes later
>> EVENTID=1 and EVENTDESCRIPTION="EXECUTING YZA"
>> 10 minutes later
>> EVENTID=1 and EVENTDESCRIPTION="PASSED MILESTONE1"
>> 10 minutes later
>> EVENTID=1 and EVENTDESCRIPTION="EXECUTING ZAB"
>>
>> after e.g. 1 week we receive
>> EVENTID=1 and EVENTDESCRIPTION="STOPING EVENT"
>>
>> what i want:
>> i want to be able to search e.g. which eventids are connected to "XYZ"
>> AND "ZAB" AND have already passed "MILESTONE1"
>>
>> so my current try is to index all events by full indexing (without
>> storing) eventdescriptions AND stemming e.g. EXECUTING
>>
>> then searching for "+XYZ +ZAB +MILESTONE1"
>> --> result no document since those are all seperated documents
>> when i search
>>  "XYZ ZAB MILESTONE1"
>> i am getting 3 times EVENTID 3
>> --> this is bad since when i get 1000000 of such events how do i rank them?
>>
>> CONCLUSION:
>> my biggest problem is that my lucene document given to the index
>> currently is not in a final state BUT i have to index and search it
>> also while it is in progress.
>> as a result of this the ranking as i do it now has no real value since
>> the ranking is just based on a "line of a whole event"
>>
>> QUESTION:
>> is there a solution within lucene to combine search results? e.g. merge
>> them OR
>> is there a better workaround how i would do such updates to the index
>> without storing the original docmuent inside the index (since this
>> consumes so many space)? e.g. extracting the keywords that were stored
>> for the item?
>>
>> any hints appreciated.
>>
>> regards chris
>>
>>
>> ----------
>> Christian Brennsteiner
>> Salzburg / Austria / Europe
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>



-- 
----------
Christian Brennsteiner
Salzburg / Austria / Europe

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: stream of events never to know when it ends? how to index such things & search

Reply via email to