Re: why Apache doesnt create a nice forum like the others???

2007-03-29 Thread John Haxby

Mohammad Norouzi wrote:
I registered in Nabble, but to post message you should subscribe to 
lucene

mailing list and if you subscribe to mailing list your inbox will become
full of messages. this is very bad!!!

You're using gmail aren't you?  Why don't you set up a filter to handle 
mail from the list?  If you really don't want to full up Google's disks 
you can delete it immediately.   For those not using gmail, any client 
worth its salt will allow you to automatically process messages in some way.


jch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-28 Thread John Haxby

Daniel Noll wrote:
The only screenshots I can see look like plain text to me, and I'm 
currently working on something which needs to convert Word to HTML, 
which is why I ask.
wvWare, which I mentioned earlier, can convert word to HTML and does a 
pretty good job of maintaining formatting.  abiword is better though 
(because it goe through a different internal representation).


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread John Haxby

karl wettin wrote:
The way I see it (and probably many other) mailing lists are suprior 
in many ways, especially when following multiple forums.


It's true.   Any forum that I need to subscribe to I find an RSS feed 
for so that I can get mail messages.   Forums are a pain in the neck 
once you're subscribing to more than a handful -- I can manage all the 
lists I subscribe to through a single, central, fast interface.


jch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread John Haxby

Grant Ingersoll wrote:
I like the mailing list approach much better.  With a good set of 
rules and folders in place (which takes about 15 minutes to setup), 
one can easily manage large volumes of mail w/o batting an eye, 
whereas forums require large amounts of navigation, IMO.


Glad I'm not the only one that thinks that -- I was wondering if there 
was something about managing forums (I still want to say fora) that I 
was missing.


jch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-26 Thread John Haxby

Sami Siren wrote:

There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.
  
antiword isn't very good.  I use wvWare (http://wvware.sourceforge.net/) 
directly, but you may find that using abiword is better for you (abiword 
is an editor, but it also does conversions and actually it's quite fast 
for that).  It also deals with fast-saved word docs.


We used to use POI, but it's a little fragile and people were getting 
all upset when a word document gummed up the works.  Using an external 
executable seems to be no slower and is certainly less problematic.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index word files ( doc )

2007-03-26 Thread John Haxby

John Haxby wrote:

Sami Siren wrote:

There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.
  
antiword isn't very good.  I use wvWare 
(http://wvware.sourceforge.net/) directly, but you may find that using 
abiword is better for you (abiword is an editor, but it also does 
conversions and actually it's quite fast for that).  It also deals 
with fast-saved word docs.


Sigh.   Must remember to read messages before sending.   abiword uses 
wvWare -- both will deal with fast-saved word docs, not just abiword.
We used to use POI, but it's a little fragile and people were getting 
all upset when a word document gummed up the works.  Using an external 
executable seems to be no slower and is certainly less problematic.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Clearing locks

2007-03-06 Thread John Haxby

MC Moisei wrote:

Is there a easy way to clear locks ?

If I redeploy my war file and it happens that there is an indexing
happening the lock is not cleared. I know I can tell JVM to run the
finalizers before it exits but in this case the JVM is not exiting being
a hot deploy.
  
I'd do this by having a destroy() method in the servlet to explicitly 
shut down any operations.  Tomcat (or whatever the servlet is) will call 
destroy() for you when it shuts down the servlet.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Building lucene index using 100 Gb Mobile HardDisk

2007-02-05 Thread John Haxby

maureen tanuwidjaja wrote:

Oh is it?I didn't know about that...so Is it means I cant use this Mobile HDD..

Damien McCarthy [EMAIL PROTECTED] wrote:  FAT 32 imposes a lower file size 
limitation than NTF. Attempts to create
files greater that 4Gig on FAT32 will throw error you are seeing.
  
Not at all, you just need to re-format the disk using a sensible file 
system.   If you're using Linux, that's ext3, if you're using Windows, NTFS.


If you have data on there you really want to keep and you can't back it 
up elsewhere while you do the re-format, then hunt down gparted -- it 
has a live CD that you can use to grow, shrink and move partitions so 
you'll be able to move stuff around your mobile disk while you shrink 
the FAT32 partition and grow the NTFS partition.  The shrink/grow/move 
cycle can take several hours though, depending on how often you have to 
do it (which in turn depends on how full the disk is).


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Websphere and Dark Matter

2007-01-22 Thread John Haxby

Nadav Har'El wrote:

On Tue, Jan 16, 2007, Rollo du Pre wrote about Re: Websphere and Dark Matter:
  
I was hoping it would, yes. Does websphere not release memory back to 
the OS when it not longer needs it? I'm concerned that if the memory 
spiked for some reason (indexing a large document) then that would 
hamper the rest of the OS because it'd hold on to far more memory than 
is needed.



This is a well known, may I say infamous, Java issue. Java could, in theory,
easily shrink its heap as soon as it needs less memory, because in Java's GC
model, memory can be moved around so fragmentation is not a problem. But
unfortunately, the JVM's heap rarely does shrink by default: once the JVM's
heap grows, it rarely ever shrinks back.
  
Are you implying that the process memory shrinks, that memory is 
returned to the kernel? I didn't read the page you referenced that way.


I know that if I allocate memory by memory mapping anonymous regions 
with Linux/Unix I can give it back, but is that the technique that JVMs use?


It's not generally a problem though. Provided you have a compacting 
garbage collector (and the Sun Java GC is one) then the unused memory 
will just get paged out. It may be a different story on windows and it's 
certainly a different story on an embedded platform, but releasing 
memory to the kernel under Linux is not generally necessary or desirable.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Websphere and Dark Matter

2007-01-16 Thread John Haxby

Rollo du Pre wrote:

We have a scenario where a web search app using Lucene causes
Websphere 5.1 allocated memory to grow but not shrink. JProfiler shows
the heap shrinks back ok, leaving the JVM with over 1GB allocated to
the jvm but only 400MB in use. Websphere does not perform a level 2
garbage collection and there is significant dark matter in the
allocated memory.

Are you expecting the process to shrink?   They don't normally.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateTools again

2006-10-03 Thread John Haxby

Volodymyr Bychkoviak wrote:
User has an input (javaScript calendar) on page where he can choose 
some date to include in search. Search resolution is day resolution.


If user will enter same date in different time of date he will get 
different results (because calendar will also set current hour and 
minute in the date). But this is not right behavior.


I propose not to use GMT when rounding time to selected resolution. 
This will prevent us from situation described above.


This can be done by replacing two lines  Calendar cal = 
Calendar.getInstance(GMT); with Calendar cal = Calendar.getInstance();


I don't think you're improving matters there: you might be 
cancelling-out the effects of timezone adjustment when everyone is in 
the same timezone, but if you have users on a browser in one timezone 
and the server is in a different timezone then you're in for 
interestingly broken results.


There's also the interesting decision about when a day starts.   You're 
using Etc/GMT-2 instead of (for example) Europe/Moscow -- do you 
have daylight savings time?   What happens on the day the clocks 
change?   Is the answer different for spring and summer?   If a document 
is dated, let's say, 00:30 (half an hour after midnight) is its day 
number dependent on the time zone?   What's half an hour after midnight 
when the clocks change?


You say you're using javascript to get a date in a browser -- would it 
not be better to remove the time of day there and just leave you with 
the date?   And have the date as a string so you're not dealing with 
boundary conditions?


When I was struggling with this for mail messages I eventually decided 
that it really only makes sense to deal with GMT.   If some client wants 
messages delivered on, let say, 14-Jul-2006 then the client has to 
produce the range of times that make most sense for it to be 
14-Jul-2006.   Here in the UK that's 13-Jul-2006 23:00:00 UTC to 
14-Jul-2006 22:59:59.   In San Francisco it would be 5pm to 5pm UTC, in 
Moscow, well, you work it out.Of course, users in San Francisco, 
Wokingham (where I am) and Moscow wouldn't see the same set of documents 
dated 14-Jul-2006, but they'd none of them would see documents dated the 
day before or the day after in their local timezone.   If you want 
everyone to see the same set of documents for Bastille Day then use UTC 
throughout.


I'm not sure what you're doing in javascript, but it may be enough to 
pass along the timezone correction along with the time and use that to 
get the search that you want.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateTools again

2006-10-02 Thread John Haxby

Volodymyr Bychkoviak wrote:

I'm using DateTools with Resolution.DAY.
I know that dates internally are converted to GMT.

Converting dates 2006-10-01 00:00 and 2006-10-01 15:00 from 
Etc/GMT-2 timezone will give us

20060930 and 20061001 respectively.

But these dates are identical with day resolution.

Is this bug or I'm missing something?

They're not identical.   The first one is 2006-09-30 22:00:00 UTC and 
the second 2006-10-01 13:00:00 UTC.


I ran across the problem with DateTools not using UTC when I tried to 
use an index created in California from the UK: I was looking for 
documents with a particular date stamp but I found documents with a date 
stamp from the wrong day.  Even more interesting and bizarre things 
happen around the change from daylight savings time to normal time.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: DateTools again

2006-10-02 Thread John Haxby

John Haxby wrote:
I ran across the problem with DateTools not using UTC when I tried to 
use an index created in California from the UK: I was looking for 
documents with a particular date stamp but I found documents with a 
date stamp from the wrong day.  Even more interesting and bizarre 
things happen around the change from daylight savings time to normal 
time.
That's confusing isn't it?   Originally DateTools didn't use UTC for its 
conversions: I submitted a patch some time ago (well before 2.0) that 
made it use UTC.   Does that make it less confusing?


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby

lude wrote:

Hi John,

thanks for the detailed answer.

You wrote:

If you're indexing a
multipart/alternative bodypart then index all the MIME headers, but only
index the content of the *first* bodypart.


Does this mean you index just the first file-attachment?
What do you advice, if you have to index mulitpart bodys (== more then 
one

file-attachment)?
One lucene-document for each part (==file)?
How do you handle the queries?
MIME has no concept of attachment, that's something that the user 
agent programs have a concept of -- you attach a file to a message.  
The file might be a picture, a word document, a compressed tar archive 
-- as far MIME is concerned they're all the same (well, apart from the 
content-* headers that describe what's attached).   The MIME type for 
a message with attachments is multipart.   There are several 
subtypes though.   If you're typing a plain text message (whose MIME 
type is text/plain, a message like this one) and you attach a jpeg image 
to it you'll be sending a message whose type is multipart/mixed;  the 
first part will have type text/plain and the second image/jpeg.   In 
Google Mail  under more options you can show original to see the 
complete MIME message and you'll see the different parts separated by a 
boundary.


OK.   Now I'm in a position to answer your question.   Often, when you 
send an HTML formatted message the content of the message is sent twice: 
once as text/plain and once as text/html (or multipart/related if it has 
pictures and stuff).   The two parts are alternatives, apart from the 
formatting (and pictures) there's no difference between the two parts, 
you can read either.  The best fidelity of the alternatives (and there 
can be more than two) is last, the poorest fidelity first, but the 
intent of the sender is that you can read any of them.   This is a 
multipart/alternative bodypart.   Because all parts of the 
multipart/alternative have the same text then you can index any of them, 
so index the first as that's going to be the easiest to process (it's 
almost always going to be text/plain).


I've skipped loads.   You need to read the RFCs.   Start with RFC2045 
(http://www.rfc.net/rfc2045.html) and keep going.  If you get stuck with 
the details of how messages are constructed, go back and read RFC2822 
first, or at least skim it (it's quite long).  Note that RFC2045 
references RFC822 in its abstract, where ever you see references to 
RFC821 and RFC822 you can read them as references to RFC2821 and RFC2822 
respectively -- the newer ones are a little more precise when they need 
to be and have rather more explanation of awkward cases that you need to 
know about.


Someone earlier (and I'm sorry, I deleleted the message before realising 
i should reply) said something about attached files really being in an 
attached .tar.gz file.   Well, yes and no.   An attached compressed tar 
archive is a bodypart like any other and will need to be indexed like 
any other.   That will involve breaking it open and indexing the files 
that it contains.   It's not really any different to indexing an open 
office document (which is actually a zip file).


You also mentioned indexing each bodypart (attachment) separately.   
Why?   When I'm searching, am I going to look for the word xyzzy in 
the first bodypart?   What if it was a multipart/alternative and 
Thunderbird (in my case) suppressed the first bodypart and xyzzy is 
something that couldn't be rendered in the (first) text/plain 
alternative?   To my mind, there is no use case where it makes sense to 
search a particular bodypart.  There *might* be a case for searching the 
prime bodypart and attachments but when you read the MIME spec 
you'll realise that detecting what the user sees as an attachment is not 
easy: it gets even harder when you discover that different mail user 
agents have different and legal (and sometimes reasonable) ways of 
deciding whether to treat something as in-line or as an attachment.   To 
be honest, people don't remember whether something was an attachment.   
They think I remember reading about xyzzy in a mail message and go off 
looking for that.   They often can't tell and remember even less that 
the xyzzy was in something that you decided was an attachment.   And 
if your rules for deciding whether you have something that's intended to 
be viewed as an attachment or in-line are different to the rules that 
the  user's mail reader is using then you'll have Awkward Bugs to 
explain.   You'll read about Content-Disposition in the RFCs, but 
don't believe that it's a foolproof way of deciding whether or not 
something is an attachment, lack of a content-disposition header doesn't 
mean inline or attachment and Microsoft, bless, have weird rules all 
of their own for deciding whether to display something in-line or not.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, 

Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby

lude wrote:

You also mentioned indexing each bodypart (attachment) separately.
Why? 
To my mind, there is no use case where it makes sense to search a 
particular bodypart


I will give you the use case:

[snip]
3.) The result list would show this:
1. mail-1 'subject'
'Abstract of the message-text'
2. mail-2 'subject'
Attachment with name 'filename.doc' contains 'Abstract of
file-content'

Another Use-Case would be an extended search, which allows to select if
attached files
should be searched (yes or no).


That's a good use case. File it as a bug and close it WONTFIX :-) The 
problem that you have is trying to determine whether something is going 
to be inline or an attachment. I'll give you a real-life example that 
caught out some old code the other day. We had a message with this 
structure:


multipart/alternative
text/plain
multipart/related
text/html
image/gif
image/gif
application/msword

Is there an attached file in there? Think before you read on.






The answer should be no. Are you surprised that at least one client 
decided that there was? What we have is three representations of the 
same document: plain text, html (with two pictures) and MS Word. The 
original, the Word document obviously has the best fidelity and comes 
last. The one client I'm thinking of (and I've lost track of which one 
it was) correctly suppressed the display of the text/plain alternative, 
displayed the HTML with its pictures in-line and then mistakenly 
displayed the Word document as an attachment.


This is a fictional example, but it could exist:

multipart/related
text/html
image/gif
application/msword

The gif image (and let's assume it can be indexed sensibly) is 
obviously a picture in the HTML bodypart. What's the word document? 
It's referenced from the HTML as a link just like the picture is. Is it 
an attachment? What's the difference between the word document 
referenced as a link within the multipart/related (by content-id) and a 
link to an external document (by http URL)? From a user perspective both 
are the same, but is one an attachment and the other not? I'm being 
unfair, this is not only an unrealistic problem but there isn't a right 
or a wrong answer. The word document isn't an attachment because it 
doesn't (or shouldn't) appear in the list of attachments and it's not 
in-line because you have to click on something to see it.


So yes, I agree, your use-cases are good; I'm just not sure how you're 
going to identify an attachment :-)


I do like the idea, though, of when you do a search for xyzzy that you 
get the abstract of the bodypart that contains xyzzy rather than the 
abstract (or subject) of the entire message and I'm going to think about 
that one some more. The problem that immediately springs to mind though 
is that a message can have an arbitrary number of bodyparts so if I have 
BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is it for me 
to construct the search? I think I probably should construct the search 
that way because the score depends upon the size of the document and it 
seems to make sense that the document is the bodypart, not the entire 
message, but it seems more complex than is useful for mail messages.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best Practice: emails and file-attachments

2006-08-16 Thread John Haxby


Oh rats. Thunderbird ate the indenting. The two examples should be:

multipart/alternative
text/plain
multipart/related
text/html
image/gif
image/gif
application/msword

and

multipart/related
text/html
image/gif
application/msword

the indenting indicates nesting. A message isn't just a bodypart 
followed by attachments, it has structure like a file system. Something 
which escapes most mail readers. Sigh.



John Haxby wrote:

lude wrote:

You also mentioned indexing each bodypart (attachment) separately.
Why? 
To my mind, there is no use case where it makes sense to search a 
particular bodypart


I will give you the use case:

[snip]
3.) The result list would show this:
1. mail-1 'subject'
'Abstract of the message-text'
2. mail-2 'subject'
Attachment with name 'filename.doc' contains 'Abstract of
file-content'

Another Use-Case would be an extended search, which allows to select if
attached files
should be searched (yes or no).


That's a good use case. File it as a bug and close it WONTFIX :-) The 
problem that you have is trying to determine whether something is 
going to be inline or an attachment. I'll give you a real-life example 
that caught out some old code the other day. We had a message with 
this structure:


multipart/alternative
text/plain
multipart/related
text/html
image/gif
image/gif
application/msword

Is there an attached file in there? Think before you read on.






The answer should be no. Are you surprised that at least one client 
decided that there was? What we have is three representations of the 
same document: plain text, html (with two pictures) and MS Word. The 
original, the Word document obviously has the best fidelity and comes 
last. The one client I'm thinking of (and I've lost track of which one 
it was) correctly suppressed the display of the text/plain 
alternative, displayed the HTML with its pictures in-line and then 
mistakenly displayed the Word document as an attachment.


This is a fictional example, but it could exist:

multipart/related
text/html
image/gif
application/msword

The gif image (and let's assume it can be indexed sensibly) is 
obviously a picture in the HTML bodypart. What's the word document? 
It's referenced from the HTML as a link just like the picture is. Is 
it an attachment? What's the difference between the word document 
referenced as a link within the multipart/related (by content-id) and 
a link to an external document (by http URL)? From a user perspective 
both are the same, but is one an attachment and the other not? I'm 
being unfair, this is not only an unrealistic problem but there isn't 
a right or a wrong answer. The word document isn't an attachment 
because it doesn't (or shouldn't) appear in the list of attachments 
and it's not in-line because you have to click on something to see it.


So yes, I agree, your use-cases are good; I'm just not sure how you're 
going to identify an attachment :-)


I do like the idea, though, of when you do a search for xyzzy that 
you get the abstract of the bodypart that contains xyzzy rather than 
the abstract (or subject) of the entire message and I'm going to think 
about that one some more. The problem that immediately springs to mind 
though is that a message can have an arbitrary number of bodyparts so 
if I have BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is 
it for me to construct the search? I think I probably should construct 
the search that way because the score depends upon the size of the 
document and it seems to make sense that the document is the bodypart, 
not the entire message, but it seems more complex than is useful for 
mail messages.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best Practice: emails and file-attachments

2006-08-15 Thread John Haxby

lude wrote:

does anybody has an idea what is the best design approch for realizing
the following:

The goal is to index emails and their corresponding file attachments.
One email could contain for example:
I put a fair amount of thought into this when I was doing the design for 
our mail server -- I know about mail :-)   After a little trial and 
error I came up with the following scheme:


  1. All header fields indexed under their own name with the name
 converted to lower case.
  2. Almost all bodyparts indexed in a single field called BODY (in
 upper case)
  3. Meta-data such as SIZE, DELIVERY-DATE and similar indexed with
 uppercase fields
  4. Extensions for other bodypart-specific or application-specific
 fields indexed as something with an initial uppercase letter and
 at least one lowercase letter

That gives an extensible set of fields and does require that the index 
knows ahead of time what header fields will be present or relevant.   It 
means that there are potentially a lot of fields: we're running at about 
60 depending on the user.


Some header fields are special.   The various message-id fields 
(Message-Id, Resent-Message-Id, In-Reply-To and References) need to have 
their mesage-ids carefully extracted and then indexed untokenized.   
Recipient fields (to, cc, from, etc) need to parsed and then have their 
addresses re-assembled as a friendly-name and an RFC822 address -- the 
reason for the re-assembly is that addresses can be presented in 
equivalent but odd fashions.   Most header fields can have RFC2047 
encoded text which needs to be decoded.


When indexing the bodyparts you need to be a little careful.   In 
general, the MIME headers for each part are all indexed as other message 
headers (content-id is a messge id field) and I also indexed the 
canonical content type under a CONTENT-TYPE field, again to get rid of 
fluff so that I can search for, say, 
CONTENT-TYPE:application/x-vnd-powerpoint to find all those annoyingly 
huge messages :-)  An attached message probably doesn't want all its 
headers indexed: subject is good; recipients are probably bad as it'll 
confuse the normal search and give unexpected results; message-id fields 
are almost certainly a bad idea.  If you're indexing a 
multipart/alternative bodypart then index all the MIME headers, but only 
index the content of the *first* bodypart.


Does that all make sense?  Javamail is great for this, it's good at 
parsing and extracting the content of messages.  However, it's not 
enough to just read what I've said and the javamail doc.   If you're not 
intimately familiar with the MIME RFCs (I think the first one is 
RFC2045, but their not difficult to find as their all around RFC2047) 
and RFC2822, the message structure RFC itself.   If you just guess 
because the structure is obvious you'll come unstuck.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: email libraries

2006-07-30 Thread John Haxby

Andrzej Bialecki wrote:
Just for the record - I've been using javamail POP and IMAP providers 
in the past, and they were prone to hanging with some servers, and 
resource intensive. I've been also using Outlook (proper, not Outlook 
Express - this is AFAIK impossible to work with) via a Java-COM bridge 
such as Jawin or JNIWrapper plus Redemption . This also tends to be 
rather unstable, and requires a lot of fine-tuning ...
We use javamail a *lot* with the Scalix IMAP server (the web access part 
uses IMAP underneath).   We have had performance problems with the way 
that javamail works, although for just scanning a message store to index 
messages it's OK.   We have tuned the web access code somewhat to make 
it behave better but we've also re-engineered the IMAP server somewhat, 
partly with javamail in mind, and performance and resource usage on the 
server are now somewhat under control.

So, be prepared to suffer quite a bit. ;)
If you're doing complicated things, yes, but if it's simple access for 
the purposes of indexing then you probably don't need to worry too much.


jch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: email libraries

2006-07-26 Thread John Haxby

Suba Suresh wrote:
Anyone know of good free email libraries I can use for lucene indexing 
for Windows Outlook Express and Unix emails??
javamail.   Not sure how you get hold of the messages from Outlook 
Express, but getting hold of the MIME message in most Unix-based message 
stores is relatively easy.   You might, however, prefer to go down the 
POP or IMAP route for getting hold of the messages to index -- either 
way, javamail is your friend.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Matching accented with non-accented characters

2006-07-25 Thread John Haxby

Rajan, Renuka wrote:
I am trying to match accented characters with non-accented characters in French/Spanish and other Western European languages.  The use case is that the users may type letters without accents in error and we still want to be able to retrieve valid matches.  The one idea, albeit naïve, is to normalize the data on the inbound side as well as the data in the database (prior to full text indexing) and retrieve matches.  
  
Look back through the archives a bit for  ISOLatin1AccentFilter.  It 
almost does the job and works reasonably well for western european  
characters.You'll also find a posting of mine that presents a 
somewhat more complete filter based on the unicode decompositions.   If 
you can't find it I'll dig out the stuff I wrote and re-post it (and 
then maybe some kind soul will add it alongside ISOLatin1AccentFilter).


Eric Jain's comment about ä being converted to a instead of ae is 
a fair one, but it probably doesn't much matter.  Although I have seen 
Müller written as both Muller and Mueller so you're not going to 
be able to please everyone all the time without injecting synonyms and 
being very clever.   And if you're that clever you might catch both 
encyclopedia and encyclopædia -- the latter converted to 
encyclopaedia which isn't the same as encyclopëdia!


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing emails

2006-06-19 Thread John Haxby

Michael J. Prichard wrote:
I am working on indexing emails and want to have a to field.  I am 
currently putting all the emails on one line seperated w/ 
spaces...example:


[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]

Then i index that with a StandardAnalyzer as follows:

doc.add(new Field(to, (String) itemContent.get(to), 
Field.Store.YES, Field.Index.UN_TOKENIZED));


Question is...is this the best way to do it?  I want to be able to 
search for [EMAIL PROTECTED] and pick out just those Documents, etc.
I took a slightly different approach.   Using javamail, given a To: line 
like this:


   To: Fred Smith [EMAIL PROTECTED], 
=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= [EMAIL PROTECTED]


I re-constructed the address list to look like this:

   Fred Smith [EMAIL PROTECTED] Keld Jørn Simonsen [EMAIL PROTECTED]

and fed that to the analyser.   I forget which analyser we eventually 
settled on, but the [EMAIL PROTECTED] turns into the tokens fred 
example and com.   This actually gives rise to a remarkably natural 
way of search for
addresses.   People do things like searching for lucene.apache.org to 
look for mail sent to the lucene lists, they search for me variously as 
jch, john haxby and haxby; they even, occasionally, search for 
complete mail addresses.   They all work.


The RFC2047 syntax in the example above gives one hint as to the 
minefield that address parsing can be.   If you look at the javamail 
spec, you'll also see reference to group-syntax -- it's often seen as


   undisclosed-recipients:;

but you'll also occasionally see

   example-group: [EMAIL PROTECTED], [EMAIL PROTECTED];

Javamail knows how to parse these and I threw away the group name and 
just indexed the messages.   It might've been better to keep the group 
name, but groups aren't that widely used so it probably doesn't make 
much difference.


Other heads cause headaches as well.   Things like the subject can be 
RFC2047 encoded so you'll need to decode them.  The various message-id 
headers are also slightly problematic.   If you're using message-id 
and references and in-reply-to you'll need to be careful -- the 
individual message-id's will need their angle brackets removed and they 
really ought not to be tokenized.


It's also worth indexing *all* the message headers.   People do do 
searches on some odd things.   I also index the raw content-type as well 
-- those huge presentations can be found and deleted by searching for 
content-type:application/vnd.ms-powerpoint.   Or at least I could.  It 
seems to be broken at the moment :-(


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing emails

2006-06-19 Thread John Haxby

Michael J. Prichard wrote:
We are actually grabbing emails by becoming part of the SMTP stream.  
This part is figured out and we have archived over 600k emails into a 
mysql database.  The problem is that since we currently store the 
blobs in the DB this databases are getting large and searching takes 
plenty of time.  We want to convert the searching to lucene to add 
more advanced features.



In which case, javamail is your friend.


Can I have multiple to, from and bcc fields?
Yes.   And it's definitely worth your while to study not only javamail 
but the MIME RFCs (RFC2047 deals with headers, a nearby one deals with 
the main MIME format, I forget the number) and RFC2822 for the base mail 
format -- understanding the structure of a mail message is more than 
half the battle to knowing how to index it.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Encryption

2006-05-06 Thread John Haxby

George Washington wrote:
Is it possible to reconstruct a complete source document from the data 
stored in the index, even if the fields are only indexed but not 
stored? Because if the answer is yes there is no point in 
encrypting, unless the index itself can be encrypted. Is it feasible 
to encrypt the index?


As Otis said, no, in general it's not possible to reconstruct the 
document, but that might not be needed to discover interesting thing 
about the documents.  For example, it's interesting to know if lots of 
documents contain the word zimbra or the words eliminate and 
competition -- smile -- near each other.   I'm sure you can dream up 
your own examples.


If your documents need to be encrypted to protect them, then I would say 
that your index does as well.   Although you can't reconstruct the 
original document you can certainly find out all kinds of interesting 
things even just by searching for phrases, let alone looking at what 
terms were indexesd.


As Otis also suggested, your best bet is going to be to use an encrypted 
file system; both Linux and Windows offer this capability.   Of course, 
and I presume you've already done this for the douments, you're going to 
have to worry about the safety of the key used to encrypt the file 
system and all the other associated security issues.


jch (I wouldn't be paranoid if they weren't out to get me)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance and FS block size

2006-02-13 Thread John Haxby

Andrzej Bialecki wrote:

None of you mentioned yet the aspect that 4k is the memory page size 
on IA32 hardware. This in itself would favor any operations using 
multiple of this size, and penalize operations using amounts below 
this size.


For normal I/O it will rarely make any difference at all: the return 
results from read(2) are copied from kernel space to user space.   Under 
some rare conditions it can make a difference if the copy causes a page 
fault for user-space memory, but that can happen with any buffer size.   
Memory-mapped I/O does take into account the VM page size, but that's 
entirely in the kernel's domain.   I believe (though I haven't checked 
lately) that memory mapping does avoid the final copy, and it certainly 
does avoid system calls so it has the potential to be as fast as the 
undelying I/O subsystem allows it to be.   However, there are a few 
pathological cases where memory-mapped I/O is slower and you have to be 
very careful about the size of the file you're dealing with (unless 
you're running in a 64 bit process).


As Paul Elschot mentioned, the design of Lucene is the most important 
thing: it knows about locality of reference and does the right thing.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Performance and FS block size

2006-02-12 Thread John Haxby

Otis Gospodnetic wrote:


I'm somewhat familiar with ext3 vs. ReiserFS stuff, but that's not really what 
I'm after (finding a better/faster FS).  What I'm wondering is about different 
block sizes on a single (ext3) FS.
If I understand block sizes correctly, they represent a chunk of data that the 
FS will read in a single read.
- If the block size is 1K, and Lucene needs to read 4K of data, then the disk 
will have to do 4 reads, and will read in a total of 4K.
- If the block size is 4K, and Lucene needs to read 3K of data, then the disk 
will have to do 1 read, and will read a total of 3K, although that will 
actually consume 4K, because that's the size of a block.
 

That's correct Otis.   Applications generally to get best performance 
when they read data in the file system block size (or small multiples 
thereof) which for ext2 and ext3 is almost always 4k.  It might be 
interesting to try making file systems with different block sizes and 
see what the effect on performance is and also, perhaps trying larger 
block sizes in Lucene, but always keeping Lucene's block size a multiple 
of the file system block size.   For an educated guess, I'd say that 
4k/4k gives better performance than smaller file system block sizes and 
8k/4k is not likely to have much of an effect either way.



Does any of this sound right?
I recall Paul Elschot talking about disk reads and disk arm movement, and 
Robert Engels talking about Nio and block sizes, so they might know more about 
this stuff.
 

It depends very much on the type of disk: 15,000 rpm ultra-scsi 320 
disks on a 64 bit PCI card will probably be faster than a 4200rpm disk 
in a laptop :-)   Seriously, disk configuration makes a lot of 
difference: striped RAID arrays will give the best I/O performance 
(given a  controller and whatnot that can exploit that).   Once you get 
into huge amount of I/O there are other, more complex issues that affect 
performance.


java.nio has the right features to exploit the I/O subsystem of the OS 
to good advantage.   We haven't done the performance measurements yet, 
but memory mappied I/O should yield the best performance (as well as 
freeing you from worrying about what block size is best).It will 
also be interesting to try the different I/O schedulers under Linux: cfq 
is the default for the 2.6 kernel that Red Hat ships, but I can imagine 
the deadline scheduler may give interesting results.   As I say, at some 
stage over the next few months we're likely to be looking at this in 
more detail.


The one thing that makes more difference than anything else though is 
locality of reference; this seems to well understood by the Lucene index 
format and is probably why the performance is generall good!


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: encoding

2006-01-27 Thread John Haxby

petite_abeille wrote:

I would love to see this. I presently have a somewhat unwieldy 
conversion table [1] that I would love to get ride of :))

[snip]
[1] http://dev.alt.textdrive.com/browser/lu/LUStringBasicLatin.txt


I've attached the perl script -- feed 
http://www.unicode.org/Public/4.1.0/ucd/UnicodeData.txt to it.   It's 
based on a slightly different principle to yours.   You seem to look for 
things like mumble mumble LETTER X mumble and take X as the base 
letter.   That means that, for example, ɖ (a d with a hook) gets 
converted to d.   My script, on the other hand, deals with things like 
Ǣ (LATIN CAPITAL LETTER AE WITH MACRON) and converts it to AE.   There 
are some differences of opinion though, you have ß mapped to s whereas 
I have ss (strße to strasse instead of strase seems right).  I 
think I'm also over-enthusiastic when it comes to mapping characters to 
spaces: I know that there are some arabic characters that get mapped to 
spaces.   For the purposes of converting to an ASCII approximation, 
though, I suspect a combination of your approach and mine would be 
best.   What do you think?


Of course, it's still unweildy -- the code uses a huge great switch 
statement.   It would be more aesthetically pleasing to have a class 
representing UnicodeData.txt and work out the mapping on the fly.   IBM 
have some Unicode stuff that deals with decomposition and uses a similar 
algorithm (I think) to the one I use.   The standard java.lang.Character 
has everything but the decompositions to implement what I do in perl in 
Java: generating a map of decompositions isn't difficult though.   
However, I doubt whether the reduction in code size would make it run 
faster and certainly looking at the name of the letter to determine the 
ASCII nearest equivalent is going to be slow.


jch



mkswitch.pl
Description: Perl program
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: encoding

2006-01-26 Thread John Haxby

arnaudbuffet wrote:


if I try to index a text file encoded in Western 1252 for exemple with the Turkish text 
düzenlediğimiz kampanyamıza the lucene index will contain re encoded data with 
#0;#17;k#0;#0; 
 


ISOLatin1AccentFilter.removeAccents() converts that string to
duzenlediğimiz kampanyamıza The g-breve and the dotless-i are
untouched. My AsciiDecomposeFilter.decompose() converts the string to
duzenledigimiz kampanyamiza.

However, since you're seeing those rather odd entities, it looks as
though you're not actually indexing what you think you're indexing. As
Erik says, you need to make sure that you're reading files with the
proper encoding and removing accent and adding dots won't help.

jch



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Memory

2006-01-17 Thread John Haxby

Aigner, Thomas wrote:


I did a man on top and sure enough there was a PPID command on
Linux (f then B) for parent process.  And yes, they always have the same
parent command.  Thanks for your help as I'm obviously still a noob on
Unix.
 

Nope, that doesn't tell you they're different threads in the same 
process -- look at all the obviously different processes whose PPID 
(parent PID) is 1 -- ps -ef | less, the first screenful or so will 
all have a PPID of 1.   A single process can have lots of different 
children and many do.


It does depend on which version you're running, but look for thread in 
the ps(1) man page.   On the system I've got here it says -m shows all 
threads; 2.4-based machines may show all threads by default (but not 
RHEL3, which is closer to 2.6 than 2.4).


Although it's not definitive, finding processes with the exactly the 
same memory profile -- same resident set size (RSS), same total size 
(SIZE), etc is a bit of a giveaway: any two processes are unlikely to be 
using exactly the same amount of memory, any two threads are, by 
definition, using the same memory and must show the same statistics 
(well, to within stack size, but we can ignore that mostly).


Possibly a good way to look at this is with top.  The H key toggles 
threads (where it can detect them) and you get plenty good memory 
reporting.   If you can see all your java threads then you'll see that 
they are all suspiciously similar.


Anyway, the reason that you see threads and processes confused is that 
there is very little difference between the two in Linux.   Indeed, the 
lack of distinction caused some problems with the posix thread semantics 
that weren't fixed until 2.6 came out.   If you're running a 2.6 kernel 
then look at /proc/pid where pid is the PID of the java process.   
In that you'll see a sub-directory called task which contains 
sub-directories for each thread, there's normally only one task, but 
multi-threaded processes will have several, eg


$ ps -e | grep java
29918 ?00:02:38 java
$ ls /proc/29918/task
29918  29920  29922  29924  29928  29930  29937  8993
29919  29921  29923  29925  29929  29934  6875   917

jch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Cache index in RAMDirectory and evict

2006-01-12 Thread John Haxby

Kan Deng wrote:

1. Performance. 


  Since all the cached disk data resides outside JVM
heap space, the access efficiency from Java object to
those cached data cannot be too high.
 

True, but you need to compare the relative speeds.   If data has to be 
pulled from a file, then you're talking several milliseconds to fetch 
from the disk.  If it's in the OS's cache (and here I'm rather assuming 
Linux since that's what I know about) you're talking about microseconds 
rather than milliseconds to fetch the data from the OS.   Once the data 
is in the JVM, but not in the CPU cache, then you're down to nanosecods 
to get the data from main memory (how many depends on the hardware; some 
platforms take a while to get the data moving but when it comes, it's 
very quick; some systems are fast to get going but don't have the 
throughput).   It's not the absolute times that are important though: 
once you've got the data in the OS's cache then things like network 
latency, display update speed and scheduling overheads begin to make 
themselves felt and you won't make these any less by getting data into 
the JVM's memory.   Well, not much anyway.



2. Volatile.

  Since the OS caches the disk data in a common area
shared by multiple processes, but not only JVM. If
there are other processes doing disk IO at the same
time, chances are the cached Lucene index data from
disk may be wiped. 
 

What you can do by hanging on to a lot of memory is make the overall 
machine performance worse.  In fact by denying other processes memory, 
you're going to force up the I/O rate and when you do need to go to the 
disk then it'll take much longer -- net result, things run slower.
Generally speaking, because the OS has a more holistic view of resource 
management, you'll get better overall performance.



Therefore, a more reliable and efficient cache should
reside inside JVM heap space. But due to the crowded
JVM heap space, we have to manually evict the less
frequently used data from the cache. 
 

It's that last sentence that is the critical one.   Yes, you can do your 
own cache management, but how much better are you going to be than the 
OS?Well, you _can_ be a lot better since you know what you're 
doing.   You can also be a _lot_ worse when you get it wrong.   Choosing 
the right point to flush data from the cache (evict) is not all that 
straightforward: the OS buffer cache was introduced into BSD unix in the 
early '80s and we're still seeing work going on to improve the basic 
strategy 20-odd years later.


If you find that you're spending an inordinate amount of time waiting 
for I/O for the index from the OS, then that it the time to start 
looking at caching strategies.   My own feeling is that you're going to 
find easier things to fix before you get that far.



Did I mis-understand anything?
 

Probably not, it's just that performance is more of an holistic approach 
and an obvious, isolated, change isn't going to have the effect that you 
want.


jch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene and UTF-8

2005-09-29 Thread John Haxby

John Cherouvim wrote:

I'm having some problems indexing my UTF-8 html pages. I am running 
lucene on Linux and I cannot understand why does the index generated 
depends on the locale of my operating system.
If I do set | grep LANG I get: LANG=el_GR which is Greek. If I set 
this to en_US the index generated will be different. Why is this the 
case? My HTMLs are all UTF-8.


What verison of Linux are you using?

On Fedora Core 4 (and probably other Fedora's and RHEL)  LANG=el_GR sets 
the character set to ISO 8859-7, eg (on my various machines):


   $ LANG=en_GR date | iconv -f iso88597
   Πεμ Σεπ 29 11:59:19 BST 2005
   $ LANG=el_GR.utf8 date
   Πεμ Σεπ 29 12:01:40 BST 2005

(Everything in FC4 is UTF-8 so it displays right and it seems that the 
Greek for Sep is Sep -- no surprises there I guess.)


In your case, replacing date with whatever the command is that you use 
to generate the indexes should do the right thing.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search for documents where field does not exist?

2005-06-20 Thread John Haxby

Erik Hatcher wrote:



On Jun 17, 2005, at 5:54 PM, [EMAIL PROTECTED] wrote:

Please do not reply to a post on the list if your message actually  
isn't a

reply. Post a new message instead.



Sorry about that.. wasn't intentional.. clicked reply to get the reply
address and then forgot to change the subject :)



Even changing the subject after doing a reply is not sufficient as  
it will still end up in the same thread erroneously.  You need to  
create a new message to start a new thread.  (certainly this varies  
by mail client, though)


The reason for this is that when you reply to a message you get an 
in-reply-to header that refers to the message-id of the original 
header; you may also get a references header that performs a similar 
function for other message in the thread.   I haven't come across a mail 
client  that drops the in-reply-to when you change the subject.


If you want to find out exactly what your particular mail client does; 
reply to a message of your own and then look at the message source or 
the original message headers (what you're looking for depends on the 
client, sometimes it's obvious, sometimes its hidden in message properties.)


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimizing indexes with mulitiple processors?

2005-06-10 Thread John Haxby

Chris Collins wrote:


Ok that part isnt surprising.  However only about 1% of 30% of the merge was
spent in the OS.flush call (not very IO bound at all with this controller).
 

On Linux, at least, measuring the time taken in OS.flush is not a good 
way to determine if you're I/O bound -- all that does is transfer the 
data to the kernel.   Later, possibly much later, the kernel will 
actually write the data to the disk.


The upshot of this is that if the size of the index is around the size 
of physical memory in the system, optimizing will appear CPU bound.   
Once the index exceeds the size of physical memory, you'll see the 
effects of I/O.   OS.flush will still probably be ver quick, but you'll 
see a lot of I/O wait if you run, say, top.


jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: NFS

2005-05-18 Thread John Haxby
Otis Gospodnetic wrote:
I haven't used Lucene with NFS.  My understanding is that the problem
is with lock files when they reside on the NFS server.  Yes, you can
change the location of lock files with a system property, but if you
are using NFS to make the index accessible from multiple machines, then
changing the lock file directory to a local directory doesn't make
sense.  However, it sounds like you are using a NFS-mounted partition
simply because that's where you have sufficient space, not because you
need to access the index from multiple machines, so you should be OK
with changing the lock file directory to a local dir.
 

I haven't tried this, but under Linux (at least), you can specify the 
nolock parameter to make file locking appen locally.   Of course, this 
will make it impossible to use NFS to share the index among several 
machines, but, as Otis said, that doesn't seem to be the requirement here.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: NFS

2005-05-18 Thread John Haxby
Paul Libbrecht wrote:
Le 18 mai 05, à 11:51, John Haxby a écrit :
I haven't tried this, but under Linux (at least), you can specify the 
nolock parameter to make file locking appen locally.   Of course, 
this will make it impossible to use NFS to share the index among 
several machines, but, as Otis said, that doesn't seem to be the 
requirement here.
Just a hint that we have experienced using Lucene Indexes on NFS 
partitions to be much much slower than local partitions... aside of 
the little lock issues.
I had forgotten that Lucene uses lock files rather than file locks so 
the nolock parameter wouldn't help.If (on Linux, don't know if 
this is possibe on an R100) if the file system is exported async then 
writing files is going to be rather quicker.   However, frequently 
creating lock files is still likely to be slow -- I don't know the exact 
(OS level) mechanism that lucene uses, but the common mechanisms that I 
can think of are going to be slow regardless of the NFS settings.

If you move the lock files into a local file system then you're just 
left with normal NFS performance issues.   If you've got a gigabit 
network then chances are you'll be able to achieve NFS speeds roughly 
comparable with local IDE disks (50MB/s, ish).   If you're restricted to 
a normal 100MB network then you're not going to get that kind of 
performance.   Using async will greatly help when writing an index, 
as, probably will large buffer sizes.   You'd need to experiment to find 
out what makes a big difference.   The main thing, though, is that the 
default parameters for NFS aren't going to give sparkling performance 
for Lucene.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Top most frequent words

2005-05-12 Thread John Haxby
Otis Gospodnetic wrote:
Somebody asked about this today, and I just found this through Simpy:
 http://www.unine.ch/info/clef/
Scroll half-way through the page, look on the right side:  1,000 most
frequent words for several languages.
 

Hmm.  I'm not sure how valuable that is.   For English los and 
angeles are ranked 99 and 101 respectively and officials comes in at 
125.   Obviously I'm guessing, but those middle ranking words have come 
from a slightly skewed source -- newspapers in a fixed interval 
perhaps.  (I don't think Los Angeles makes it into every day parlance 
in the UK, and officials suggests that we're obsessed with beauracracy 
:-))

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]