Re: one huge index or many small ones?

javier muguruza Thu, 04 Nov 2004 11:15:10 -0800

Justin, 

Yes, I wanted as less info as possible in the index. The body and
atachemntes will be stored outside lucene. As I mentioned,  I only
need to deal with the body/attachments contents with lucene, from, to,
subject, dates etc are deal with before. My idea was:


Unstored: Body + attachment (after extracting text)

I dont need to know in with attchment the word I am looking for are,
it's enough to know there are in the email. I will have a look at
MultiSearcher or ParallelMultiSearcher,

thanks!


On Thu, 4 Nov 2004 10:28:18 -0700, Justin Swanhart <[EMAIL PROTECTED]> wrote:
> First off, I think you should make a decision about what you want to
> store in your index and how you go about searching it.
> 
> The less information you store in your index, the better, for
> performance reasons.  If you can store the messages in an external
> database you probably should.  I would create a table that contains a
> clob and an associated id that can be used to get the message at any
> time.
> 
> Assuming mail is in SMTP RFC format:
> 
> I would suggest:
> Unstored: Subject
> Keyword: From
> Keyword: To
> Stored,Unindexed: ID  <-- this would be the ID to the message in your database
> Unstored: Body
> Keyword: Month
> Keyword: Day
> Keyword: Year
> (and any other keywords you might use)
> 
> Your lucene query would then look something like:
> +From:[EMAIL PROTECTED] +(Subject:money Body:money) +Year:2004
> 
> Use the stored ID field to get the message contents from your database.
> 
> If you want to break your index down into multiple indexes, based on
> some criteria such as time frame you could do that too.  You would
> then use a MultiSearcher or ParallelMultiSearcher to process the
> multiple indexes.
> 
> 
> 
> 
> On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza <[EMAIL PROTECTED]> wrote:
> > Thanks Erik and Giulio for the fast reply.
> >
> > I am just starting to look at lucene so forgive me if I got some ideas
> > wrong. I understand your concerns about one index per email. But
> > having one index only is also (I guess) out of question.
> >
> > I am building an email archive. Email will be kept indefinitely
> > available for search, adding new email every day. Imagine a company
> > with millions of emails per day (been there), keep it growing for
> > years, adding stuff to the index while using it for searches
> > continuously...
> >
> > That's why my idea is to decide on a time frame (a day, a month...an
> > extreme would be an instant, that is a single email, my original idea)
> > and build the index for all the email in that timeframe. After the
> > timeframe is finished no more stuff will be ever added.
> >
> > Before the lucene search emails are selected based on other conditions
> > (we store the from, to, date etc in database as well, and these
> > conditions are enforced with a sql query first, so I would not need to
> > enforce them in the lucene search again, also that query can be quite
> > sophisticated and I guess would not be easyly possible to do it in
> > lucene by itself). That first db step gives me a group of emails that
> > maybe I have to further narrow down based on a lucene search (of body
> > and attachment contents). Having an index for more than one emails
> > means that after the search I would have to get only the overlaping
> > emails from the two searches...Maybe this is better than keeping the
> > same info I have in the db in lucene fields as well.
> >
> > An example: I want all the email from [EMAIL PROTECTED] from Jan
> > to Dec containing the word 'money'. I run the db query that returns a
> > list with john's email for that period of time, then (lets assume I
> > have one index per day) I iterate on every day, looking for emails
> > that contain 'money', from the results returned by lucene I keep only
> > these that are also in the first list.
> >
> > Does that sound better?
> >
> > On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli
> >
> >
> > <[EMAIL PROTECTED]> wrote:
> > > Hi Javier,
> > >
> > > I suggest you to build a single index, with all the information you
> > > need to find the right mail you are looking for. You than can use
> > > Lucene alone to find you messages.
> > >
> > > Giulio Cesare
> > >
> > >
> > >
> > >
> > > On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza <[EMAIL PROTECTED]> wrote:
> > > > Hi,
> > > >
> > > > We are going to move from a just-in-time perl based search to using
> > > > lucene in our project. I have to index emails (bodies and also
> > > > attachements). I keep in the filesystem all the bodies and attachments
> > > > for a long period of time. I have to find emails that fullfil certain
> > > > conditions, some of the conditions are take care of at a different
> > > > level, so in the end I have a SUBSET of emails I have to run through
> > > > lucene.
> > > >
> > > > I was assuming that the best way would be to create an index for each
> > > > email. Having an unique index for a group of emails (say a day worth
> > > > of email) seems too coarse grained, imagine a day has 10000 emails,
> > > > and some queries will like to look in only a handful of the
> > > > emails...But the problem with having one index per emails is the
> > > > massive number of emails...imagine having 100000 indexes
> > > >
> > > > Anyway, any idea about that? I just wanted to check wether someones
> > > > feels I am wrong.
> > > >
> > > > Thanks
> > > >
> > > > ---------------------------------------------------------------------
> > >
> > >
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: one huge index or many small ones?

Reply via email to