Justin, Yes, I wanted as less info as possible in the index. The body and atachemntes will be stored outside lucene. As I mentioned, I only need to deal with the body/attachments contents with lucene, from, to, subject, dates etc are deal with before. My idea was:
Unstored: Body + attachment (after extracting text) I dont need to know in with attchment the word I am looking for are, it's enough to know there are in the email. I will have a look at MultiSearcher or ParallelMultiSearcher, thanks! On Thu, 4 Nov 2004 10:28:18 -0700, Justin Swanhart <[EMAIL PROTECTED]> wrote: > First off, I think you should make a decision about what you want to > store in your index and how you go about searching it. > > The less information you store in your index, the better, for > performance reasons. If you can store the messages in an external > database you probably should. I would create a table that contains a > clob and an associated id that can be used to get the message at any > time. > > Assuming mail is in SMTP RFC format: > > I would suggest: > Unstored: Subject > Keyword: From > Keyword: To > Stored,Unindexed: ID <-- this would be the ID to the message in your database > Unstored: Body > Keyword: Month > Keyword: Day > Keyword: Year > (and any other keywords you might use) > > Your lucene query would then look something like: > +From:[EMAIL PROTECTED] +(Subject:money Body:money) +Year:2004 > > Use the stored ID field to get the message contents from your database. > > If you want to break your index down into multiple indexes, based on > some criteria such as time frame you could do that too. You would > then use a MultiSearcher or ParallelMultiSearcher to process the > multiple indexes. > > > > > On Thu, 4 Nov 2004 18:03:49 +0100, javier muguruza <[EMAIL PROTECTED]> wrote: > > Thanks Erik and Giulio for the fast reply. > > > > I am just starting to look at lucene so forgive me if I got some ideas > > wrong. I understand your concerns about one index per email. But > > having one index only is also (I guess) out of question. > > > > I am building an email archive. Email will be kept indefinitely > > available for search, adding new email every day. Imagine a company > > with millions of emails per day (been there), keep it growing for > > years, adding stuff to the index while using it for searches > > continuously... > > > > That's why my idea is to decide on a time frame (a day, a month...an > > extreme would be an instant, that is a single email, my original idea) > > and build the index for all the email in that timeframe. After the > > timeframe is finished no more stuff will be ever added. > > > > Before the lucene search emails are selected based on other conditions > > (we store the from, to, date etc in database as well, and these > > conditions are enforced with a sql query first, so I would not need to > > enforce them in the lucene search again, also that query can be quite > > sophisticated and I guess would not be easyly possible to do it in > > lucene by itself). That first db step gives me a group of emails that > > maybe I have to further narrow down based on a lucene search (of body > > and attachment contents). Having an index for more than one emails > > means that after the search I would have to get only the overlaping > > emails from the two searches...Maybe this is better than keeping the > > same info I have in the db in lucene fields as well. > > > > An example: I want all the email from [EMAIL PROTECTED] from Jan > > to Dec containing the word 'money'. I run the db query that returns a > > list with john's email for that period of time, then (lets assume I > > have one index per day) I iterate on every day, looking for emails > > that contain 'money', from the results returned by lucene I keep only > > these that are also in the first list. > > > > Does that sound better? > > > > On Thu, 4 Nov 2004 17:26:21 +0100, Giulio Cesare Solaroli > > > > > > <[EMAIL PROTECTED]> wrote: > > > Hi Javier, > > > > > > I suggest you to build a single index, with all the information you > > > need to find the right mail you are looking for. You than can use > > > Lucene alone to find you messages. > > > > > > Giulio Cesare > > > > > > > > > > > > > > > On Thu, 4 Nov 2004 17:00:35 +0100, javier muguruza <[EMAIL PROTECTED]> wrote: > > > > Hi, > > > > > > > > We are going to move from a just-in-time perl based search to using > > > > lucene in our project. I have to index emails (bodies and also > > > > attachements). I keep in the filesystem all the bodies and attachments > > > > for a long period of time. I have to find emails that fullfil certain > > > > conditions, some of the conditions are take care of at a different > > > > level, so in the end I have a SUBSET of emails I have to run through > > > > lucene. > > > > > > > > I was assuming that the best way would be to create an index for each > > > > email. Having an unique index for a group of emails (say a day worth > > > > of email) seems too coarse grained, imagine a day has 10000 emails, > > > > and some queries will like to look in only a handful of the > > > > emails...But the problem with having one index per emails is the > > > > massive number of emails...imagine having 100000 indexes > > > > > > > > Anyway, any idea about that? I just wanted to check wether someones > > > > feels I am wrong. > > > > > > > > Thanks > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]