Re: why Apache doesnt create a nice forum like the others???
Mohammad Norouzi wrote: I registered in Nabble, but to post message you should subscribe to lucene mailing list and if you subscribe to mailing list your inbox will become full of messages. this is very bad!!! You're using gmail aren't you? Why don't you set up a filter to handle mail from the list? If you really don't want to full up Google's disks you can delete it immediately. For those not using gmail, any client worth its salt will allow you to automatically process messages in some way. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
Daniel Noll wrote: The only screenshots I can see look like plain text to me, and I'm currently working on something which needs to convert Word to HTML, which is why I ask. wvWare, which I mentioned earlier, can convert word to HTML and does a pretty good job of maintaining formatting. abiword is better though (because it goe through a different internal representation). jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: why Apache doesnt create a nice forum like the others???
karl wettin wrote: The way I see it (and probably many other) mailing lists are suprior in many ways, especially when following multiple forums. It's true. Any forum that I need to subscribe to I find an RSS feed for so that I can get mail messages. Forums are a pain in the neck once you're subscribing to more than a handful -- I can manage all the lists I subscribe to through a single, central, fast interface. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: why Apache doesnt create a nice forum like the others???
Grant Ingersoll wrote: I like the mailing list approach much better. With a good set of rules and folders in place (which takes about 15 minutes to setup), one can easily manage large volumes of mail w/o batting an eye, whereas forums require large amounts of navigation, IMO. Glad I'm not the only one that thinks that -- I was wondering if there was something about managing forums (I still want to say fora) that I was missing. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
Sami Siren wrote: There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. antiword isn't very good. I use wvWare (http://wvware.sourceforge.net/) directly, but you may find that using abiword is better for you (abiword is an editor, but it also does conversions and actually it's quite fast for that). It also deals with fast-saved word docs. We used to use POI, but it's a little fragile and people were getting all upset when a word document gummed up the works. Using an external executable seems to be no slower and is certainly less problematic. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index word files ( doc )
John Haxby wrote: Sami Siren wrote: There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. antiword isn't very good. I use wvWare (http://wvware.sourceforge.net/) directly, but you may find that using abiword is better for you (abiword is an editor, but it also does conversions and actually it's quite fast for that). It also deals with fast-saved word docs. Sigh. Must remember to read messages before sending. abiword uses wvWare -- both will deal with fast-saved word docs, not just abiword. We used to use POI, but it's a little fragile and people were getting all upset when a word document gummed up the works. Using an external executable seems to be no slower and is certainly less problematic. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clearing locks
MC Moisei wrote: Is there a easy way to clear locks ? If I redeploy my war file and it happens that there is an indexing happening the lock is not cleared. I know I can tell JVM to run the finalizers before it exits but in this case the JVM is not exiting being a hot deploy. I'd do this by having a destroy() method in the servlet to explicitly shut down any operations. Tomcat (or whatever the servlet is) will call destroy() for you when it shuts down the servlet. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Building lucene index using 100 Gb Mobile HardDisk
maureen tanuwidjaja wrote: Oh is it?I didn't know about that...so Is it means I cant use this Mobile HDD.. Damien McCarthy [EMAIL PROTECTED] wrote: FAT 32 imposes a lower file size limitation than NTF. Attempts to create files greater that 4Gig on FAT32 will throw error you are seeing. Not at all, you just need to re-format the disk using a sensible file system. If you're using Linux, that's ext3, if you're using Windows, NTFS. If you have data on there you really want to keep and you can't back it up elsewhere while you do the re-format, then hunt down gparted -- it has a live CD that you can use to grow, shrink and move partitions so you'll be able to move stuff around your mobile disk while you shrink the FAT32 partition and grow the NTFS partition. The shrink/grow/move cycle can take several hours though, depending on how often you have to do it (which in turn depends on how full the disk is). jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Websphere and Dark Matter
Nadav Har'El wrote: On Tue, Jan 16, 2007, Rollo du Pre wrote about Re: Websphere and Dark Matter: I was hoping it would, yes. Does websphere not release memory back to the OS when it not longer needs it? I'm concerned that if the memory spiked for some reason (indexing a large document) then that would hamper the rest of the OS because it'd hold on to far more memory than is needed. This is a well known, may I say infamous, Java issue. Java could, in theory, easily shrink its heap as soon as it needs less memory, because in Java's GC model, memory can be moved around so fragmentation is not a problem. But unfortunately, the JVM's heap rarely does shrink by default: once the JVM's heap grows, it rarely ever shrinks back. Are you implying that the process memory shrinks, that memory is returned to the kernel? I didn't read the page you referenced that way. I know that if I allocate memory by memory mapping anonymous regions with Linux/Unix I can give it back, but is that the technique that JVMs use? It's not generally a problem though. Provided you have a compacting garbage collector (and the Sun Java GC is one) then the unused memory will just get paged out. It may be a different story on windows and it's certainly a different story on an embedded platform, but releasing memory to the kernel under Linux is not generally necessary or desirable. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Websphere and Dark Matter
Rollo du Pre wrote: We have a scenario where a web search app using Lucene causes Websphere 5.1 allocated memory to grow but not shrink. JProfiler shows the heap shrinks back ok, leaving the JVM with over 1GB allocated to the jvm but only 400MB in use. Websphere does not perform a level 2 garbage collection and there is significant dark matter in the allocated memory. Are you expecting the process to shrink? They don't normally. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateTools again
Volodymyr Bychkoviak wrote: User has an input (javaScript calendar) on page where he can choose some date to include in search. Search resolution is day resolution. If user will enter same date in different time of date he will get different results (because calendar will also set current hour and minute in the date). But this is not right behavior. I propose not to use GMT when rounding time to selected resolution. This will prevent us from situation described above. This can be done by replacing two lines Calendar cal = Calendar.getInstance(GMT); with Calendar cal = Calendar.getInstance(); I don't think you're improving matters there: you might be cancelling-out the effects of timezone adjustment when everyone is in the same timezone, but if you have users on a browser in one timezone and the server is in a different timezone then you're in for interestingly broken results. There's also the interesting decision about when a day starts. You're using Etc/GMT-2 instead of (for example) Europe/Moscow -- do you have daylight savings time? What happens on the day the clocks change? Is the answer different for spring and summer? If a document is dated, let's say, 00:30 (half an hour after midnight) is its day number dependent on the time zone? What's half an hour after midnight when the clocks change? You say you're using javascript to get a date in a browser -- would it not be better to remove the time of day there and just leave you with the date? And have the date as a string so you're not dealing with boundary conditions? When I was struggling with this for mail messages I eventually decided that it really only makes sense to deal with GMT. If some client wants messages delivered on, let say, 14-Jul-2006 then the client has to produce the range of times that make most sense for it to be 14-Jul-2006. Here in the UK that's 13-Jul-2006 23:00:00 UTC to 14-Jul-2006 22:59:59. In San Francisco it would be 5pm to 5pm UTC, in Moscow, well, you work it out.Of course, users in San Francisco, Wokingham (where I am) and Moscow wouldn't see the same set of documents dated 14-Jul-2006, but they'd none of them would see documents dated the day before or the day after in their local timezone. If you want everyone to see the same set of documents for Bastille Day then use UTC throughout. I'm not sure what you're doing in javascript, but it may be enough to pass along the timezone correction along with the time and use that to get the search that you want. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateTools again
Volodymyr Bychkoviak wrote: I'm using DateTools with Resolution.DAY. I know that dates internally are converted to GMT. Converting dates 2006-10-01 00:00 and 2006-10-01 15:00 from Etc/GMT-2 timezone will give us 20060930 and 20061001 respectively. But these dates are identical with day resolution. Is this bug or I'm missing something? They're not identical. The first one is 2006-09-30 22:00:00 UTC and the second 2006-10-01 13:00:00 UTC. I ran across the problem with DateTools not using UTC when I tried to use an index created in California from the UK: I was looking for documents with a particular date stamp but I found documents with a date stamp from the wrong day. Even more interesting and bizarre things happen around the change from daylight savings time to normal time. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateTools again
John Haxby wrote: I ran across the problem with DateTools not using UTC when I tried to use an index created in California from the UK: I was looking for documents with a particular date stamp but I found documents with a date stamp from the wrong day. Even more interesting and bizarre things happen around the change from daylight savings time to normal time. That's confusing isn't it? Originally DateTools didn't use UTC for its conversions: I submitted a patch some time ago (well before 2.0) that made it use UTC. Does that make it less confusing? jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practice: emails and file-attachments
lude wrote: Hi John, thanks for the detailed answer. You wrote: If you're indexing a multipart/alternative bodypart then index all the MIME headers, but only index the content of the *first* bodypart. Does this mean you index just the first file-attachment? What do you advice, if you have to index mulitpart bodys (== more then one file-attachment)? One lucene-document for each part (==file)? How do you handle the queries? MIME has no concept of attachment, that's something that the user agent programs have a concept of -- you attach a file to a message. The file might be a picture, a word document, a compressed tar archive -- as far MIME is concerned they're all the same (well, apart from the content-* headers that describe what's attached). The MIME type for a message with attachments is multipart. There are several subtypes though. If you're typing a plain text message (whose MIME type is text/plain, a message like this one) and you attach a jpeg image to it you'll be sending a message whose type is multipart/mixed; the first part will have type text/plain and the second image/jpeg. In Google Mail under more options you can show original to see the complete MIME message and you'll see the different parts separated by a boundary. OK. Now I'm in a position to answer your question. Often, when you send an HTML formatted message the content of the message is sent twice: once as text/plain and once as text/html (or multipart/related if it has pictures and stuff). The two parts are alternatives, apart from the formatting (and pictures) there's no difference between the two parts, you can read either. The best fidelity of the alternatives (and there can be more than two) is last, the poorest fidelity first, but the intent of the sender is that you can read any of them. This is a multipart/alternative bodypart. Because all parts of the multipart/alternative have the same text then you can index any of them, so index the first as that's going to be the easiest to process (it's almost always going to be text/plain). I've skipped loads. You need to read the RFCs. Start with RFC2045 (http://www.rfc.net/rfc2045.html) and keep going. If you get stuck with the details of how messages are constructed, go back and read RFC2822 first, or at least skim it (it's quite long). Note that RFC2045 references RFC822 in its abstract, where ever you see references to RFC821 and RFC822 you can read them as references to RFC2821 and RFC2822 respectively -- the newer ones are a little more precise when they need to be and have rather more explanation of awkward cases that you need to know about. Someone earlier (and I'm sorry, I deleleted the message before realising i should reply) said something about attached files really being in an attached .tar.gz file. Well, yes and no. An attached compressed tar archive is a bodypart like any other and will need to be indexed like any other. That will involve breaking it open and indexing the files that it contains. It's not really any different to indexing an open office document (which is actually a zip file). You also mentioned indexing each bodypart (attachment) separately. Why? When I'm searching, am I going to look for the word xyzzy in the first bodypart? What if it was a multipart/alternative and Thunderbird (in my case) suppressed the first bodypart and xyzzy is something that couldn't be rendered in the (first) text/plain alternative? To my mind, there is no use case where it makes sense to search a particular bodypart. There *might* be a case for searching the prime bodypart and attachments but when you read the MIME spec you'll realise that detecting what the user sees as an attachment is not easy: it gets even harder when you discover that different mail user agents have different and legal (and sometimes reasonable) ways of deciding whether to treat something as in-line or as an attachment. To be honest, people don't remember whether something was an attachment. They think I remember reading about xyzzy in a mail message and go off looking for that. They often can't tell and remember even less that the xyzzy was in something that you decided was an attachment. And if your rules for deciding whether you have something that's intended to be viewed as an attachment or in-line are different to the rules that the user's mail reader is using then you'll have Awkward Bugs to explain. You'll read about Content-Disposition in the RFCs, but don't believe that it's a foolproof way of deciding whether or not something is an attachment, lack of a content-disposition header doesn't mean inline or attachment and Microsoft, bless, have weird rules all of their own for deciding whether to display something in-line or not. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,
Re: Best Practice: emails and file-attachments
lude wrote: You also mentioned indexing each bodypart (attachment) separately. Why? To my mind, there is no use case where it makes sense to search a particular bodypart I will give you the use case: [snip] 3.) The result list would show this: 1. mail-1 'subject' 'Abstract of the message-text' 2. mail-2 'subject' Attachment with name 'filename.doc' contains 'Abstract of file-content' Another Use-Case would be an extended search, which allows to select if attached files should be searched (yes or no). That's a good use case. File it as a bug and close it WONTFIX :-) The problem that you have is trying to determine whether something is going to be inline or an attachment. I'll give you a real-life example that caught out some old code the other day. We had a message with this structure: multipart/alternative text/plain multipart/related text/html image/gif image/gif application/msword Is there an attached file in there? Think before you read on. The answer should be no. Are you surprised that at least one client decided that there was? What we have is three representations of the same document: plain text, html (with two pictures) and MS Word. The original, the Word document obviously has the best fidelity and comes last. The one client I'm thinking of (and I've lost track of which one it was) correctly suppressed the display of the text/plain alternative, displayed the HTML with its pictures in-line and then mistakenly displayed the Word document as an attachment. This is a fictional example, but it could exist: multipart/related text/html image/gif application/msword The gif image (and let's assume it can be indexed sensibly) is obviously a picture in the HTML bodypart. What's the word document? It's referenced from the HTML as a link just like the picture is. Is it an attachment? What's the difference between the word document referenced as a link within the multipart/related (by content-id) and a link to an external document (by http URL)? From a user perspective both are the same, but is one an attachment and the other not? I'm being unfair, this is not only an unrealistic problem but there isn't a right or a wrong answer. The word document isn't an attachment because it doesn't (or shouldn't) appear in the list of attachments and it's not in-line because you have to click on something to see it. So yes, I agree, your use-cases are good; I'm just not sure how you're going to identify an attachment :-) I do like the idea, though, of when you do a search for xyzzy that you get the abstract of the bodypart that contains xyzzy rather than the abstract (or subject) of the entire message and I'm going to think about that one some more. The problem that immediately springs to mind though is that a message can have an arbitrary number of bodyparts so if I have BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is it for me to construct the search? I think I probably should construct the search that way because the score depends upon the size of the document and it seems to make sense that the document is the bodypart, not the entire message, but it seems more complex than is useful for mail messages. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practice: emails and file-attachments
Oh rats. Thunderbird ate the indenting. The two examples should be: multipart/alternative text/plain multipart/related text/html image/gif image/gif application/msword and multipart/related text/html image/gif application/msword the indenting indicates nesting. A message isn't just a bodypart followed by attachments, it has structure like a file system. Something which escapes most mail readers. Sigh. John Haxby wrote: lude wrote: You also mentioned indexing each bodypart (attachment) separately. Why? To my mind, there is no use case where it makes sense to search a particular bodypart I will give you the use case: [snip] 3.) The result list would show this: 1. mail-1 'subject' 'Abstract of the message-text' 2. mail-2 'subject' Attachment with name 'filename.doc' contains 'Abstract of file-content' Another Use-Case would be an extended search, which allows to select if attached files should be searched (yes or no). That's a good use case. File it as a bug and close it WONTFIX :-) The problem that you have is trying to determine whether something is going to be inline or an attachment. I'll give you a real-life example that caught out some old code the other day. We had a message with this structure: multipart/alternative text/plain multipart/related text/html image/gif image/gif application/msword Is there an attached file in there? Think before you read on. The answer should be no. Are you surprised that at least one client decided that there was? What we have is three representations of the same document: plain text, html (with two pictures) and MS Word. The original, the Word document obviously has the best fidelity and comes last. The one client I'm thinking of (and I've lost track of which one it was) correctly suppressed the display of the text/plain alternative, displayed the HTML with its pictures in-line and then mistakenly displayed the Word document as an attachment. This is a fictional example, but it could exist: multipart/related text/html image/gif application/msword The gif image (and let's assume it can be indexed sensibly) is obviously a picture in the HTML bodypart. What's the word document? It's referenced from the HTML as a link just like the picture is. Is it an attachment? What's the difference between the word document referenced as a link within the multipart/related (by content-id) and a link to an external document (by http URL)? From a user perspective both are the same, but is one an attachment and the other not? I'm being unfair, this is not only an unrealistic problem but there isn't a right or a wrong answer. The word document isn't an attachment because it doesn't (or shouldn't) appear in the list of attachments and it's not in-line because you have to click on something to see it. So yes, I agree, your use-cases are good; I'm just not sure how you're going to identify an attachment :-) I do like the idea, though, of when you do a search for xyzzy that you get the abstract of the bodypart that contains xyzzy rather than the abstract (or subject) of the entire message and I'm going to think about that one some more. The problem that immediately springs to mind though is that a message can have an arbitrary number of bodyparts so if I have BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is it for me to construct the search? I think I probably should construct the search that way because the score depends upon the size of the document and it seems to make sense that the document is the bodypart, not the entire message, but it seems more complex than is useful for mail messages. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best Practice: emails and file-attachments
lude wrote: does anybody has an idea what is the best design approch for realizing the following: The goal is to index emails and their corresponding file attachments. One email could contain for example: I put a fair amount of thought into this when I was doing the design for our mail server -- I know about mail :-) After a little trial and error I came up with the following scheme: 1. All header fields indexed under their own name with the name converted to lower case. 2. Almost all bodyparts indexed in a single field called BODY (in upper case) 3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with uppercase fields 4. Extensions for other bodypart-specific or application-specific fields indexed as something with an initial uppercase letter and at least one lowercase letter That gives an extensible set of fields and does require that the index knows ahead of time what header fields will be present or relevant. It means that there are potentially a lot of fields: we're running at about 60 depending on the user. Some header fields are special. The various message-id fields (Message-Id, Resent-Message-Id, In-Reply-To and References) need to have their mesage-ids carefully extracted and then indexed untokenized. Recipient fields (to, cc, from, etc) need to parsed and then have their addresses re-assembled as a friendly-name and an RFC822 address -- the reason for the re-assembly is that addresses can be presented in equivalent but odd fashions. Most header fields can have RFC2047 encoded text which needs to be decoded. When indexing the bodyparts you need to be a little careful. In general, the MIME headers for each part are all indexed as other message headers (content-id is a messge id field) and I also indexed the canonical content type under a CONTENT-TYPE field, again to get rid of fluff so that I can search for, say, CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly huge messages :-) An attached message probably doesn't want all its headers indexed: subject is good; recipients are probably bad as it'll confuse the normal search and give unexpected results; message-id fields are almost certainly a bad idea. If you're indexing a multipart/alternative bodypart then index all the MIME headers, but only index the content of the *first* bodypart. Does that all make sense? Javamail is great for this, it's good at parsing and extracting the content of messages. However, it's not enough to just read what I've said and the javamail doc. If you're not intimately familiar with the MIME RFCs (I think the first one is RFC2045, but their not difficult to find as their all around RFC2047) and RFC2822, the message structure RFC itself. If you just guess because the structure is obvious you'll come unstuck. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: email libraries
Andrzej Bialecki wrote: Just for the record - I've been using javamail POP and IMAP providers in the past, and they were prone to hanging with some servers, and resource intensive. I've been also using Outlook (proper, not Outlook Express - this is AFAIK impossible to work with) via a Java-COM bridge such as Jawin or JNIWrapper plus Redemption . This also tends to be rather unstable, and requires a lot of fine-tuning ... We use javamail a *lot* with the Scalix IMAP server (the web access part uses IMAP underneath). We have had performance problems with the way that javamail works, although for just scanning a message store to index messages it's OK. We have tuned the web access code somewhat to make it behave better but we've also re-engineered the IMAP server somewhat, partly with javamail in mind, and performance and resource usage on the server are now somewhat under control. So, be prepared to suffer quite a bit. ;) If you're doing complicated things, yes, but if it's simple access for the purposes of indexing then you probably don't need to worry too much. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: email libraries
Suba Suresh wrote: Anyone know of good free email libraries I can use for lucene indexing for Windows Outlook Express and Unix emails?? javamail. Not sure how you get hold of the messages from Outlook Express, but getting hold of the MIME message in most Unix-based message stores is relatively easy. You might, however, prefer to go down the POP or IMAP route for getting hold of the messages to index -- either way, javamail is your friend. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Matching accented with non-accented characters
Rajan, Renuka wrote: I am trying to match accented characters with non-accented characters in French/Spanish and other Western European languages. The use case is that the users may type letters without accents in error and we still want to be able to retrieve valid matches. The one idea, albeit naïve, is to normalize the data on the inbound side as well as the data in the database (prior to full text indexing) and retrieve matches. Look back through the archives a bit for ISOLatin1AccentFilter. It almost does the job and works reasonably well for western european characters.You'll also find a posting of mine that presents a somewhat more complete filter based on the unicode decompositions. If you can't find it I'll dig out the stuff I wrote and re-post it (and then maybe some kind soul will add it alongside ISOLatin1AccentFilter). Eric Jain's comment about ä being converted to a instead of ae is a fair one, but it probably doesn't much matter. Although I have seen Müller written as both Muller and Mueller so you're not going to be able to please everyone all the time without injecting synonyms and being very clever. And if you're that clever you might catch both encyclopedia and encyclopædia -- the latter converted to encyclopaedia which isn't the same as encyclopëdia! jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing emails
Michael J. Prichard wrote: I am working on indexing emails and want to have a to field. I am currently putting all the emails on one line seperated w/ spaces...example: [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] Then i index that with a StandardAnalyzer as follows: doc.add(new Field(to, (String) itemContent.get(to), Field.Store.YES, Field.Index.UN_TOKENIZED)); Question is...is this the best way to do it? I want to be able to search for [EMAIL PROTECTED] and pick out just those Documents, etc. I took a slightly different approach. Using javamail, given a To: line like this: To: Fred Smith [EMAIL PROTECTED], =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= [EMAIL PROTECTED] I re-constructed the address list to look like this: Fred Smith [EMAIL PROTECTED] Keld Jørn Simonsen [EMAIL PROTECTED] and fed that to the analyser. I forget which analyser we eventually settled on, but the [EMAIL PROTECTED] turns into the tokens fred example and com. This actually gives rise to a remarkably natural way of search for addresses. People do things like searching for lucene.apache.org to look for mail sent to the lucene lists, they search for me variously as jch, john haxby and haxby; they even, occasionally, search for complete mail addresses. They all work. The RFC2047 syntax in the example above gives one hint as to the minefield that address parsing can be. If you look at the javamail spec, you'll also see reference to group-syntax -- it's often seen as undisclosed-recipients:; but you'll also occasionally see example-group: [EMAIL PROTECTED], [EMAIL PROTECTED]; Javamail knows how to parse these and I threw away the group name and just indexed the messages. It might've been better to keep the group name, but groups aren't that widely used so it probably doesn't make much difference. Other heads cause headaches as well. Things like the subject can be RFC2047 encoded so you'll need to decode them. The various message-id headers are also slightly problematic. If you're using message-id and references and in-reply-to you'll need to be careful -- the individual message-id's will need their angle brackets removed and they really ought not to be tokenized. It's also worth indexing *all* the message headers. People do do searches on some odd things. I also index the raw content-type as well -- those huge presentations can be found and deleted by searching for content-type:application/vnd.ms-powerpoint. Or at least I could. It seems to be broken at the moment :-( jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing emails
Michael J. Prichard wrote: We are actually grabbing emails by becoming part of the SMTP stream. This part is figured out and we have archived over 600k emails into a mysql database. The problem is that since we currently store the blobs in the DB this databases are getting large and searching takes plenty of time. We want to convert the searching to lucene to add more advanced features. In which case, javamail is your friend. Can I have multiple to, from and bcc fields? Yes. And it's definitely worth your while to study not only javamail but the MIME RFCs (RFC2047 deals with headers, a nearby one deals with the main MIME format, I forget the number) and RFC2822 for the base mail format -- understanding the structure of a mail message is more than half the battle to knowing how to index it. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Encryption
George Washington wrote: Is it possible to reconstruct a complete source document from the data stored in the index, even if the fields are only indexed but not stored? Because if the answer is yes there is no point in encrypting, unless the index itself can be encrypted. Is it feasible to encrypt the index? As Otis said, no, in general it's not possible to reconstruct the document, but that might not be needed to discover interesting thing about the documents. For example, it's interesting to know if lots of documents contain the word zimbra or the words eliminate and competition -- smile -- near each other. I'm sure you can dream up your own examples. If your documents need to be encrypted to protect them, then I would say that your index does as well. Although you can't reconstruct the original document you can certainly find out all kinds of interesting things even just by searching for phrases, let alone looking at what terms were indexesd. As Otis also suggested, your best bet is going to be to use an encrypted file system; both Linux and Windows offer this capability. Of course, and I presume you've already done this for the douments, you're going to have to worry about the safety of the key used to encrypt the file system and all the other associated security issues. jch (I wouldn't be paranoid if they weren't out to get me) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance and FS block size
Andrzej Bialecki wrote: None of you mentioned yet the aspect that 4k is the memory page size on IA32 hardware. This in itself would favor any operations using multiple of this size, and penalize operations using amounts below this size. For normal I/O it will rarely make any difference at all: the return results from read(2) are copied from kernel space to user space. Under some rare conditions it can make a difference if the copy causes a page fault for user-space memory, but that can happen with any buffer size. Memory-mapped I/O does take into account the VM page size, but that's entirely in the kernel's domain. I believe (though I haven't checked lately) that memory mapping does avoid the final copy, and it certainly does avoid system calls so it has the potential to be as fast as the undelying I/O subsystem allows it to be. However, there are a few pathological cases where memory-mapped I/O is slower and you have to be very careful about the size of the file you're dealing with (unless you're running in a 64 bit process). As Paul Elschot mentioned, the design of Lucene is the most important thing: it knows about locality of reference and does the right thing. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance and FS block size
Otis Gospodnetic wrote: I'm somewhat familiar with ext3 vs. ReiserFS stuff, but that's not really what I'm after (finding a better/faster FS). What I'm wondering is about different block sizes on a single (ext3) FS. If I understand block sizes correctly, they represent a chunk of data that the FS will read in a single read. - If the block size is 1K, and Lucene needs to read 4K of data, then the disk will have to do 4 reads, and will read in a total of 4K. - If the block size is 4K, and Lucene needs to read 3K of data, then the disk will have to do 1 read, and will read a total of 3K, although that will actually consume 4K, because that's the size of a block. That's correct Otis. Applications generally to get best performance when they read data in the file system block size (or small multiples thereof) which for ext2 and ext3 is almost always 4k. It might be interesting to try making file systems with different block sizes and see what the effect on performance is and also, perhaps trying larger block sizes in Lucene, but always keeping Lucene's block size a multiple of the file system block size. For an educated guess, I'd say that 4k/4k gives better performance than smaller file system block sizes and 8k/4k is not likely to have much of an effect either way. Does any of this sound right? I recall Paul Elschot talking about disk reads and disk arm movement, and Robert Engels talking about Nio and block sizes, so they might know more about this stuff. It depends very much on the type of disk: 15,000 rpm ultra-scsi 320 disks on a 64 bit PCI card will probably be faster than a 4200rpm disk in a laptop :-) Seriously, disk configuration makes a lot of difference: striped RAID arrays will give the best I/O performance (given a controller and whatnot that can exploit that). Once you get into huge amount of I/O there are other, more complex issues that affect performance. java.nio has the right features to exploit the I/O subsystem of the OS to good advantage. We haven't done the performance measurements yet, but memory mappied I/O should yield the best performance (as well as freeing you from worrying about what block size is best).It will also be interesting to try the different I/O schedulers under Linux: cfq is the default for the 2.6 kernel that Red Hat ships, but I can imagine the deadline scheduler may give interesting results. As I say, at some stage over the next few months we're likely to be looking at this in more detail. The one thing that makes more difference than anything else though is locality of reference; this seems to well understood by the Lucene index format and is probably why the performance is generall good! jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: encoding
petite_abeille wrote: I would love to see this. I presently have a somewhat unwieldy conversion table [1] that I would love to get ride of :)) [snip] [1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt I've attached the perl script -- feed http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it. It's based on a slightly different principle to yours. You seem to look for things like mumble mumble LETTER X mumble and take X as the base letter. That means that, for example, ɖ (a d with a hook) gets converted to d. My script, on the other hand, deals with things like Ǣ (LATIN CAPITAL LETTER AE WITH MACRON) and converts it to AE. There are some differences of opinion though, you have ß mapped to s whereas I have ss (strße to strasse instead of strase seems right). I think I'm also over-enthusiastic when it comes to mapping characters to spaces: I know that there are some arabic characters that get mapped to spaces. For the purposes of converting to an ASCII approximation, though, I suspect a combination of your approach and mine would be best. What do you think? Of course, it's still unweildy -- the code uses a huge great switch statement. It would be more aesthetically pleasing to have a class representing UnicodeData.txt and work out the mapping on the fly. IBM have some Unicode stuff that deals with decomposition and uses a similar algorithm (I think) to the one I use. The standard java.lang.Character has everything but the decompositions to implement what I do in perl in Java: generating a map of decompositions isn't difficult though. However, I doubt whether the reduction in code size would make it run faster and certainly looking at the name of the letter to determine the ASCII nearest equivalent is going to be slow. jch mkswitch.pl Description: Perl program - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: encoding
arnaudbuffet wrote: if I try to index a text file encoded in Western 1252 for exemple with the Turkish text düzenlediğimiz kampanyamıza the lucene index will contain re encoded data with #0;#17;k#0;#0; ISOLatin1AccentFilter.removeAccents() converts that string to duzenlediğimiz kampanyamıza The g-breve and the dotless-i are untouched. My AsciiDecomposeFilter.decompose() converts the string to duzenledigimiz kampanyamiza. However, since you're seeing those rather odd entities, it looks as though you're not actually indexing what you think you're indexing. As Erik says, you need to make sure that you're reading files with the proper encoding and removing accent and adding dots won't help. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Memory
Aigner, Thomas wrote: I did a man on top and sure enough there was a PPID command on Linux (f then B) for parent process. And yes, they always have the same parent command. Thanks for your help as I'm obviously still a noob on Unix. Nope, that doesn't tell you they're different threads in the same process -- look at all the obviously different processes whose PPID (parent PID) is 1 -- ps -ef | less, the first screenful or so will all have a PPID of 1. A single process can have lots of different children and many do. It does depend on which version you're running, but look for thread in the ps(1) man page. On the system I've got here it says -m shows all threads; 2.4-based machines may show all threads by default (but not RHEL3, which is closer to 2.6 than 2.4). Although it's not definitive, finding processes with the exactly the same memory profile -- same resident set size (RSS), same total size (SIZE), etc is a bit of a giveaway: any two processes are unlikely to be using exactly the same amount of memory, any two threads are, by definition, using the same memory and must show the same statistics (well, to within stack size, but we can ignore that mostly). Possibly a good way to look at this is with top. The H key toggles threads (where it can detect them) and you get plenty good memory reporting. If you can see all your java threads then you'll see that they are all suspiciously similar. Anyway, the reason that you see threads and processes confused is that there is very little difference between the two in Linux. Indeed, the lack of distinction caused some problems with the posix thread semantics that weren't fixed until 2.6 came out. If you're running a 2.6 kernel then look at /proc/pid where pid is the PID of the java process. In that you'll see a sub-directory called task which contains sub-directories for each thread, there's normally only one task, but multi-threaded processes will have several, eg $ ps -e | grep java 29918 ?00:02:38 java $ ls /proc/29918/task 29918 29920 29922 29924 29928 29930 29937 8993 29919 29921 29923 29925 29929 29934 6875 917 jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Cache index in RAMDirectory and evict
Kan Deng wrote: 1. Performance. Since all the cached disk data resides outside JVM heap space, the access efficiency from Java object to those cached data cannot be too high. True, but you need to compare the relative speeds. If data has to be pulled from a file, then you're talking several milliseconds to fetch from the disk. If it's in the OS's cache (and here I'm rather assuming Linux since that's what I know about) you're talking about microseconds rather than milliseconds to fetch the data from the OS. Once the data is in the JVM, but not in the CPU cache, then you're down to nanosecods to get the data from main memory (how many depends on the hardware; some platforms take a while to get the data moving but when it comes, it's very quick; some systems are fast to get going but don't have the throughput). It's not the absolute times that are important though: once you've got the data in the OS's cache then things like network latency, display update speed and scheduling overheads begin to make themselves felt and you won't make these any less by getting data into the JVM's memory. Well, not much anyway. 2. Volatile. Since the OS caches the disk data in a common area shared by multiple processes, but not only JVM. If there are other processes doing disk IO at the same time, chances are the cached Lucene index data from disk may be wiped. What you can do by hanging on to a lot of memory is make the overall machine performance worse. In fact by denying other processes memory, you're going to force up the I/O rate and when you do need to go to the disk then it'll take much longer -- net result, things run slower. Generally speaking, because the OS has a more holistic view of resource management, you'll get better overall performance. Therefore, a more reliable and efficient cache should reside inside JVM heap space. But due to the crowded JVM heap space, we have to manually evict the less frequently used data from the cache. It's that last sentence that is the critical one. Yes, you can do your own cache management, but how much better are you going to be than the OS?Well, you _can_ be a lot better since you know what you're doing. You can also be a _lot_ worse when you get it wrong. Choosing the right point to flush data from the cache (evict) is not all that straightforward: the OS buffer cache was introduced into BSD unix in the early '80s and we're still seeing work going on to improve the basic strategy 20-odd years later. If you find that you're spending an inordinate amount of time waiting for I/O for the index from the OS, then that it the time to start looking at caching strategies. My own feeling is that you're going to find easier things to fix before you get that far. Did I mis-understand anything? Probably not, it's just that performance is more of an holistic approach and an obvious, isolated, change isn't going to have the effect that you want. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene and UTF-8
John Cherouvim wrote: I'm having some problems indexing my UTF-8 html pages. I am running lucene on Linux and I cannot understand why does the index generated depends on the locale of my operating system. If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set this to en_US the index generated will be different. Why is this the case? My HTMLs are all UTF-8. What verison of Linux are you using? On Fedora Core 4 (and probably other Fedora's and RHEL) LANG=el_GR sets the character set to ISO 8859-7, eg (on my various machines): $ LANG=en_GR date | iconv -f iso88597 Πεμ Σεπ 29 11:59:19 BST 2005 $ LANG=el_GR.utf8 date Πεμ Σεπ 29 12:01:40 BST 2005 (Everything in FC4 is UTF-8 so it displays right and it seems that the Greek for Sep is Sep -- no surprises there I guess.) In your case, replacing date with whatever the command is that you use to generate the indexes should do the right thing. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search for documents where field does not exist?
Erik Hatcher wrote: On Jun 17, 2005, at 5:54 PM, [EMAIL PROTECTED] wrote: Please do not reply to a post on the list if your message actually isn't a reply. Post a new message instead. Sorry about that.. wasn't intentional.. clicked reply to get the reply address and then forgot to change the subject :) Even changing the subject after doing a reply is not sufficient as it will still end up in the same thread erroneously. You need to create a new message to start a new thread. (certainly this varies by mail client, though) The reason for this is that when you reply to a message you get an in-reply-to header that refers to the message-id of the original header; you may also get a references header that performs a similar function for other message in the thread. I haven't come across a mail client that drops the in-reply-to when you change the subject. If you want to find out exactly what your particular mail client does; reply to a message of your own and then look at the message source or the original message headers (what you're looking for depends on the client, sometimes it's obvious, sometimes its hidden in message properties.) jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimizing indexes with mulitiple processors?
Chris Collins wrote: Ok that part isnt surprising. However only about 1% of 30% of the merge was spent in the OS.flush call (not very IO bound at all with this controller). On Linux, at least, measuring the time taken in OS.flush is not a good way to determine if you're I/O bound -- all that does is transfer the data to the kernel. Later, possibly much later, the kernel will actually write the data to the disk. The upshot of this is that if the size of the index is around the size of physical memory in the system, optimizing will appear CPU bound. Once the index exceeds the size of physical memory, you'll see the effects of I/O. OS.flush will still probably be ver quick, but you'll see a lot of I/O wait if you run, say, top. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NFS
Otis Gospodnetic wrote: I haven't used Lucene with NFS. My understanding is that the problem is with lock files when they reside on the NFS server. Yes, you can change the location of lock files with a system property, but if you are using NFS to make the index accessible from multiple machines, then changing the lock file directory to a local directory doesn't make sense. However, it sounds like you are using a NFS-mounted partition simply because that's where you have sufficient space, not because you need to access the index from multiple machines, so you should be OK with changing the lock file directory to a local dir. I haven't tried this, but under Linux (at least), you can specify the nolock parameter to make file locking appen locally. Of course, this will make it impossible to use NFS to share the index among several machines, but, as Otis said, that doesn't seem to be the requirement here. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: NFS
Paul Libbrecht wrote: Le 18 mai 05, à 11:51, John Haxby a écrit : I haven't tried this, but under Linux (at least), you can specify the nolock parameter to make file locking appen locally. Of course, this will make it impossible to use NFS to share the index among several machines, but, as Otis said, that doesn't seem to be the requirement here. Just a hint that we have experienced using Lucene Indexes on NFS partitions to be much much slower than local partitions... aside of the little lock issues. I had forgotten that Lucene uses lock files rather than file locks so the nolock parameter wouldn't help.If (on Linux, don't know if this is possibe on an R100) if the file system is exported async then writing files is going to be rather quicker. However, frequently creating lock files is still likely to be slow -- I don't know the exact (OS level) mechanism that lucene uses, but the common mechanisms that I can think of are going to be slow regardless of the NFS settings. If you move the lock files into a local file system then you're just left with normal NFS performance issues. If you've got a gigabit network then chances are you'll be able to achieve NFS speeds roughly comparable with local IDE disks (50MB/s, ish). If you're restricted to a normal 100MB network then you're not going to get that kind of performance. Using async will greatly help when writing an index, as, probably will large buffer sizes. You'd need to experiment to find out what makes a big difference. The main thing, though, is that the default parameters for NFS aren't going to give sparkling performance for Lucene. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Top most frequent words
Otis Gospodnetic wrote: Somebody asked about this today, and I just found this through Simpy: http://www.unine.ch/info/clef/ Scroll half-way through the page, look on the right side: 1,000 most frequent words for several languages. Hmm. I'm not sure how valuable that is. For English los and angeles are ranked 99 and 101 respectively and officials comes in at 125. Obviously I'm guessing, but those middle ranking words have come from a slightly skewed source -- newspapers in a fixed interval perhaps. (I don't think Los Angeles makes it into every day parlance in the UK, and officials suggests that we're obsessed with beauracracy :-)) jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]