Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
Otis Gospodnetic wrote:
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, but I always hear people complaining about the speed.  A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.
 

Also... MySQL full text search isn't perfect. If you're not a java 
programmer it would be difficult to hack on. Another downside is that FT 
in MySQL only works with MyISAM tables which aren't transaction aware 
and use global tables locks (not fun).

I'm sure though that MySQL would do a better job at online index 
maintenance than Lucene. It falls down a bit in this area...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
David Sitsky wrote:
On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote:
 

You are right.
Since there are C++ and now C ports of Lucene, it would be interesting
to integrate them directly with DBs, so that the RDBMS full-text search
under the hood is actually powered by one of the Lucene ports.
   

Or to see Lucene + Derby (100% JAVA embedded database donated from IBM 
currently in Apache incubation) integrated together... that would be 
really nice and powerful.

Does anyone know if there are any integration plans?
 

Don't forget BerkeleyDB Java  Edition... that would be interesting too...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.
You know... it looks like the problem is that TermInfosReader uses 
INDEX_INTERVAL during seeks and is probably just jumping RIGHT past the 
offsets that I need.

If this is going to be a practical way of reducing Lucene memory 
footprint for HUGE indexes then its going to need a way to change this 
value based on the current index thats being opened.

Is there anyway to determine the INDEX_INTERVAL from the file?It 
looks according to:

http://jakarta.apache.org/lucene/docs/fileformats.html
That the .tis file (which according to the docs the .tii file is very 
similar to the .tis file ) should have this data:

So according to this:
TermInfoFile (.tis)-- TIVersion, TermCount, IndexInterval, 
SkipInterval, TermInfos

The only problem is that the .tii and .tis files I have on disk don't 
have a constant preamble and doesnt' look like there's an index interval 
here...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.

It looks like you're using a pre-1.4 version of Lucene.  Since 1.4 
this is no longer called TermInfosWriter.INDEX_INTERVAL, but rather 
TermInfosWriter.indexInterval.
Yes... we're trying to be conservative and haven't migrated yet.  Though 
doing so might be required for this move I think...

Is this setting incompatible with older indexes burned with the lower 
value?

Prior to 1.4, yes.  After 1.4, no.
What happens after 1.4?  Can I take indexes burned with 256 (a greater 
value) in 1.3 and open them up correctly with 1.4?

Kevin
PS.  Once I get this working I'm going to create a wiki page documenting 
this process.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Doug Cutting wrote:
Not without hacking things.  If your 1.3 indexes were generated with 
256 then you can modify your version of Lucene 1.4+ to use 256 instead 
of 128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 
today).

Prior to 1.4 this was a constant, hardwired into the index format.  In 
1.4 and later each index segment stores this value as a parameter.  So 
once 1.4 has re-written your index you'll no longer need a modified 
version.
Thanks for the feedback doug. 

This makes more sense now. I didn't understand why the website 
documented the fact that the .tii file was soring the index interval.

I think I'm going to investigate just moving to 1.4 ...  I need to do it 
anyway.  Might as well bite the bullet now.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-24 Thread Kevin A. Burton
Whats the desired pattern of using of TermInfosWriter.indexInterval ?
Do I have to compile my own version of Lucene to change this?   The last 
API was public static final but this is not public nor static. 

I'm wondering if we should just make this a value that can be set at 
runtime.  Considering the memory savings for larger installs this 
can/will be important.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-02-15 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?

You can increase TermInfosWriter.indexInterval.  You'll need to 
re-write the .tii file for this to take effect.  The simplest way to 
do this is to use IndexWriter.addIndexes(), adding your index to a 
new, empty, directory.  This will of course take a while for a 60GB 
index...

(Note... when this works I'll note my findings in a wiki page for future 
developers)

Two more questions:
1.  Do I have to do this with a NEW directory?  Our nightly index merger 
uses an existing target index which I assume will re-use the same 
settings as before?  I did this last night and it still seems to use the 
same amount of memory.  Above you assert that I should use a new empty 
directory and I'll try that tonight.

2. This isn't destructive is it?  I mean I'll be able to move BACK to a 
TermInfosWriter.indexInterval of 128 right?

Thanks!
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


DbDirectory and Berkeley DB Java Edition...

2005-02-06 Thread Kevin A. Burton
I'm reading the Lucene in Action book right nowand on page 309 they talk 
about using the DbDirectory which berkeley DB for maintaining your index.

Anyone ever consider a port to Berkeley DB Java Edition?
The only downside would be the license (I think its GPL) but it could 
really free up the time it takes to optimize() I think.  You could just 
rehash the hashtable and then insert rows into the new table.

Would be interesting to benchmark I think though.
Thoughts?
http://www.sleepycat.com/products/je.shtml
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Paul Elschot wrote:
This would be similar to the way the MySQL index cache works...
   

It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM instead...
 

The problem I think for everyone right now is that 32bits just doesn't 
cut it in production systems...   2G of memory per process and you 
really start to feel it.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Chris Hostetter wrote:
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory.  The VM does nothing else besides open
Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory footprint settles down after a little while?  You're currently
checking the memoory usage immediately after opening the index, and some
of that memory may be used holding transient data that will get freed up
after some GC iterations.
 

Actually I haven't but to be honest the numbers seem dead on. The VM 
heap wouldn't reallocate if it didn't need that much memory and this is 
almost exactly the behavior I'm seeing in product.

Though I guess it wouldn't hurt ;)
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
It would be interesting to know _what_exactly_ uses your memory. 
Running under an optimizer should tell you that.

The only thing that comes to mind is... can't remember the details now,
but when the index is opened, I believe every 128th term is read into
memory.  This, I believe, helps with index seeks at search time.  I
wonder if this is what's using your memory.  The number '128' can't be
modified just like that, but somebody (Julien?) has modified the code
in the past to make this variable.  That's the only thing I can think
of right now and it may or may not be an idea in the right direction.
 

I loaded it into a profiler a long time ago. Most of the code was due to 
Term classes being loaded into memory.

I might try to get some time to load it into a profiler on monday...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides open 
this index.

Here's the code:
   System.out.println( opening... );
   long before = System.currentTimeMillis();
   Directory dir = FSDirectory.getDirectory( 
/var/ksa/index-1078106952160/, false );
   IndexReader ir = IndexReader.open( dir );
   System.out.println( ir.getClass() );
   long after = System.currentTimeMillis();
   System.out.println( opening...done - duration:  + 
(after-before) );

   System.out.println( totalMemory:  + 
Runtime.getRuntime().totalMemory() );
   System.out.println( freeMemory:  + 
Runtime.getRuntime().freeMemory() );

Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
Kevin A. Burton wrote:
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides 
open this index.
After thinking about it I guess 1.5% of memory per index really isn't 
THAT bad.  What would be nice if there was a way to do this from disk 
and then use the a buffer (either via the filesystem or in-vm memory) to 
access these variables.

This would be similar to the way the MySQL index cache works...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


nable to read TLD META-INF/c.tld from JAR file ... standard.jar

2004-12-23 Thread Kevin A. Burton
What in the world is up with this exception?
We've migrated to using pre-compiled JSPs in Tomcat 5.5 for performance reasons but if 
I try to start with a FRESH webapp or try to update any of the JSPs and in-place and 
recompile I'll get this error:

Any idea?
I thought maybe the .jar files were corrupt but if I md5sum them they are identical to 
production and the Tomcat standard dist.

Thoughts?
org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1) /init.jsp(2,0) Unable to read TLD 
META-INF/c.tld from JAR file 
file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/standard.jar: 
org.apache.jasper.JasperException: Failed to load or instantiate TagLibraryValidator class: 
org.apache.taglibs.standard.tlv.JstlCoreTLV

org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:39)

org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:405)

org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:86)

org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:339)
org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:372)
org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539)
org.apache.jasper.compiler.Parser.parse(Parser.java:126)

org.apache.jasper.compiler.ParserController.doParse(ParserController.java:211)

org.apache.jasper.compiler.ParserController.parse(ParserController.java:100)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:286)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:267)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:255)

org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:556)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:296)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: nable to read TLD META-INF/c.tld from JAR file ... standard.jar

2004-12-23 Thread Kevin A. Burton
Otis Gospodnetic wrote:
Most definitely Jetty.  I can't believe you're using Tomcat for Rojo!
;)
 

I never said we were using Tomcat for Rojo ;)
Sorry about that btw... wrong list!
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to index Windows' Compiled HTML Help (CHM) Format

2004-12-11 Thread Kevin A. Burton
Tom wrote:
Hi,
Does anybody know how to index chm-files? 
A possible solution I know is to convert chm-files to pdf-files (there are
converters available for this job) and then use the known tools (e.g.
PDFBox) to index the content of the pdf files (which contain the content of
the chm-files). Are there any tools which can directly grab the textual
content out of the (binary) chm-files?

I think chm-file indexing-support is really a big missing piece in the
currently supported indexable filetype-collection (XML, HTML, PDF,
MSWord-DOC, RTF, Plaintext). 
 

I believe its just a Microsoft .cab file with an index.html inside it... 
am I right?

just uncompress it.
The problem is that the HTML within them isn't any way NEAR standard and 
you can't really give them to the user in the UI...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene in action ebook

2004-12-09 Thread Kevin A. Burton
Erik Hatcher wrote:
I have the e-book PDF in my possession. I have been prodding Manning 
on a daily basis to update the LIA website and get the e-book 
available. It is ready, and I'm sure that its just a matter of them 
pushing it out. There may be some administrative loose ends they are 
tying up before releasing it to the world. It should be available any 
minute now, really. :)
Send off a link to the list when its out...
We're all holding our breath ;)
(seriously)
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: JDBCDirectory to prevent optimize()?

2004-11-23 Thread Kevin A. Burton
Erik Hatcher wrote:
Also, there is a DBDirectory in the sandbox to store a Lucene index 
inside Berkeley DB.
I assume this would prevent prefix queries from working...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index in RAM - is it realy worthy?

2004-11-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
For the Lucene book I wrote some test cases that compare FSDirectory
and RAMDirectory.  What I found was that with certain settings
FSDirectory was almost as fast as RAMDirectory.  Personally, I would
push FSDirectory and hope that the OS and the Filesystem do their share
of work and caching for me before looking for ways to optimize my code.
 

Yes... I performed the same benchmark and in my situation RAMDirectory 
for searches was about 2% slower.

I'm willing to bet that it has to do with the fact that its a Hashtable 
and not a HashMap (which isn't synchronized).

Also adding a constructor for the term size could make loading a 
RAMDirectory faster since you could prevent rehash.

If you're on a modern machine your filesystme cache will end up 
buffering your disk anyway which I'm sure was happening in my situation.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index in RAM - is it realy worthy?

2004-11-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
For the Lucene book I wrote some test cases that compare FSDirectory
and RAMDirectory.  What I found was that with certain settings
FSDirectory was almost as fast as RAMDirectory.  Personally, I would
push FSDirectory and hope that the OS and the Filesystem do their share
of work and caching for me before looking for ways to optimize my code.
 

Also another note is that doing an index merge in memory is probably 
faster if you just use a RAMDirectory and perform addIndexes to it.

This would almost certainly be faster than optimizing on disk but I 
haven't benchmarked it.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


JDBCDirectory to prevent optimize()?

2004-11-22 Thread Kevin A. Burton
It seems that when compared to other datastores that Lucene starts to 
fall down.  For example lucene doesn't perform online index 
optimizations so if you add 10 documents you have to run optimize() 
again and this isn't exactly a fast operation.

I'm wondering about the potential for a generic JDBCDirectory for 
keeping the lucene index within a database. 

It sounds somewhat unconventional would allow you to perform live 
addDirectory updates without performing an optimize() again.

Has anyone looked at this?  How practical would it be.
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Mozilla Desktop Search

2004-11-13 Thread Kevin A. Burton
  
http://www.peerfear.org/rss/permalink/2004/11/13/MozillaDesktopSearch/

The Mozilla foundation may be considering a desktop search 
implementation 
http://computerworld.com/developmenttopics/websitemgmt/story/0,10801,97396,00.html?f=x10 
:

Having launched the much-awaited Version 1.0 of the Firefox
browser yesterday (see story), The Mozilla Foundation is busy
planning enhancements to the open-source product, including the
possibility of integrating it with a variety of desktop search
tools. The Mozilla Foundation also wants to place Firefox in PCs
through reseller deals with PC hardware vendors and continue to
sharpen the product's pop-up ad-blocking technology. 

I'm not sure this is a good idea. Maybe it is though. The technology 
just isn't there for cross platform search.

I'd have to suggest using Lucene but using GCJ for a native compile 
into XPCOM components but I'm not sure if GCJ is up to the job here. 
If this approach is possible then I'd be very excited.

One advantage to this approach is that an HTTP server wouldn't be 
necessary since you're already within the brower.

Good for everyone involved. No bloated Tomcat causing problem and 
blazingly fast access within the browser. Also since TCP isn't 
involved you could gracefully fail when the search service isn't 
running; you could just start it.

--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Lucene external field storage contribution.

2004-11-07 Thread Kevin A. Burton
About 3 months ago I developed a external storage engine which ties into 
lucene. 

I'd like to discuss making a contribution so that this is integrated 
into a future version of Lucene.

I'm going to paste my original PROPOSAL in this email. 

There wasn't a ton of feedback first time around but I figure squeaky 
wheel gets the grease...



I created this proposal because we need this fixed at work. I want to 
go ahead and work on a vertical fix for our version of lucene and then 
submit this back to Jakarta.
There seems to be a lot of interest here and I wanted to get feedback 
from the list before moving forward ...

Should I put this in the wiki?!
Kevin
** OVERVIEW **
Currently Lucene supports 'stored fields; where the content of these 
fields are
kept within the lucene index for use in the future.

While acceptable for small indexes, larger amounts of stored fields 
prevent:

- Fast index merges since the full content has to be continually merged.
- Storing the indexes in memory (since a LOT of memory would be 
required and
this is cost prohibitive)

- Fast queries since block caching can't be used on the index data.
For example in our current setup our index size is 20G.  Nearly 90% of 
this is
content.  If we could store the content outside of Lucene our merges and
searches would be MUCH faster.  If we could store the index in MEMORY 
this could
be orders of magnitude faster.

** PROPOSAL **
Provide an external field storage mechanism which supports legacy indexes
without modification.  Content is stored in a content segment. The only
changes would be a field with 3(or 4 if checksum enabled) values.
- CS_SEGMENT
  Logical ID of the content segment.  This is an integer value.  
There is
  a global Lucene property named CS_ROOT which stores all the 
content.
  The segments are just flat files with pointers.  Segments are 
broken
  into logical pieces by time and size.  Usually 100M of content 
would be
  in one segment.

- CS_OFFSET
  The byte offset of the field.
- CS_LENGTH
  The length of the field.
- CS_CHECKSUM
  Optional checksum to verify that the content is correct when 
fetched
  from the index.

- The field value here would be exactly 'N:O:L' where N is the segment 
number,
  O is the offset, and L is the length.  O and L are 64bit values.  N 
is a 32
  bit value (though 64bit wouldn't really hurt).

This mechanism allows for the external storage of any named field.
 
CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO 
code for
efficient content lookup.  (Though filehandle caching should probably 
be used).

Since content is broken into logical 100M segments the underlying 
filesystem can
orgnize the file into contiguous blocks for efficient non-fragmented 
lookup.

File manipulation is easy and indexes can be merged by simply 
concatenating the
second file to the end of the first.  (Though the segment, offset, and 
length
need to be updated).  (FIXME: I think I need to think about this more 
since I
will have  100M per syncs)

Supporting full unicode is important.  Full java.lang.String storage 
is used
with String.getBytes() so we should be able to avoid unicode issues.  
If Java
has a correct java.lang.String representation it's possible easily add 
unicode
support just by serializing the byte representation. (Note that the 
JDK says
that the DEFAULT system char encoding is used so if this is ever 
changed it
might break the index)

While Linux and modern versions of Windows (not sure about OSX) 
support 64bit
filesystems the 4G storage boundary of 32bit filesystems (ext2 is an 
example)
are an issue.  Using smaller indexes can prevent this but eventually 
segment
lookup in the filesystem will be slow.  This will only happen within 
terabyte
storage systems so hopefully the developer has migrated to another 
(modern)
filesystem such as XFS.

** FEATURES **
  - Must be able to replicate indexes easily to other hosts.
  - Adding content to the index must be CHEAP
  - Deletes need to be cheap (these are cheap for older content.  Just 
discard
older indexes)

  - Filesystem needs to be able to optimize storage
  - Must support UNICODE and binary content (images, blobs, byte arrays,
serialized objects, etc)
  - Filesystem metadata operations should be fast.  Since content is 
kept in
LARGE indexes this is easy to avoid.

  - Migration to the new system from legacy indexes should be fast and
painless for future developers
 
 


--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E

Lots Of Interest in Lucene Desktop

2004-10-28 Thread Kevin A. Burton
I've made a few passive mentions of my Lucene 
http://jakarta.apache.org/lucene Desktop prototype here on PeerFear 
in the last few days and I'm amazed how much feedback I've had. People 
really want to start work on an Open Source desktop search based on 
Lucene.


http://www.peerfear.org/rss/permalink/2004/10/28/LotsOfInterestInLuceneDesktop/
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Ability to apply document age with the score?

2004-10-28 Thread Kevin A. Burton
Lets say I have an index with two documents.  They both have the same 
score but one was added 6 months ago and the other was added 2 minutes ago.

I want the score adjusted based on the age so that older documents have 
a lower score.

I don't want to sort by document age (date) because if one document is 
older but has a HIGHER score it would be better to have it rise above 
newer documents that have a lower score.

Is this possible?  The only way I could think of doing it would be to 
have a DateFilter and then apply a dampening after the query.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Poor Lucene Ranking for Short Text

2004-10-27 Thread Kevin A. Burton
http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForShortText/
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Kevin A. Burton
Daniel Naber wrote:
(Kevin complains about shorter documents ranked higher)
This is something that can easily be fixed. Just use a Similarity 
implementation that extends DefaultSimilarity and that overwrites 
lengthNorm: just return 1.0f there. You need to use that Similarity for 
indexing and searching, i.e. it requires reindexing.
 

What happens when I do this with an existing index? I don't want to have 
to rewrite this index as it will take FOREVER

If the current behavior is all that happens this is fine... this way I 
can just get this behavior for new documents that are added.

Also... why isn't this the default?
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Documents with 1 word are given unfair lengthNorm()

2004-10-27 Thread Kevin A. Burton
WRT to my blog post:
It seems the problem is that the distribution for lengthNorm() starts at 
1 and moves down from there.  1.0f would work but HUGE documents would 
be normalized and so would distort the results.

What would you think of using this implementation for lengthNorm:
public float lengthNorm( String fieldName, int numTokens ) {
int THRESHOLD = 50;

int nt = numTokens;

if ( numTokens = THRESHOLD )
++nt;

if ( numTokens  THRESHOLD )
nt -= THRESHOLD;

float v = (float)(1.0 / Math.sqrt(nt));

if ( numTokens = THRESHOLD )
v = 1 - v;
return v;
}
This starts the distribution low... approaches 1.0 when 50 terms are in 
the document... then asymptotically moves to zero from here on out based 
on sqrt.

For example with values from 1 - 150 would yield (I'd graph this out 
but I'm too lazy):

1 - 0.29289323
2 - 0.42264974
3 - 0.5
4 - 0.5527864
5 - 0.5917517
6 - 0.6220355
7 - 0.6464466
8 - 0.666
9 - 0.6837722
10 - 0.69848865
11 - 0.7113249
12 - 0.72264993
13 - 0.73273873
14 - 0.74180114
15 - 0.75
16 - 0.7574644
17 - 0.7642977
18 - 0.7705843
19 - 0.7763932
20 - 0.7817821
21 - 0.7867993
22 - 0.7914856
23 - 0.79587585
24 - 0.8
25 - 0.80388385
26 - 0.8075499
27 - 0.81101775
28 - 0.81430465
29 - 0.81742585
30 - 0.8203947
31 - 0.8232233
32 - 0.82592237
33 - 0.8285014
34 - 0.83096915
35 - 0.833
36 - 0.83560103
37 - 0.83777857
38 - 0.8398719
39 - 0.8418861
40 - 0.84382623
41 - 0.8456966
42 - 0.8475014
43 - 0.84924436
44 - 0.8509288
45 - 0.852558
46 - 0.85413504
47 - 0.85566247
48 - 0.85714287
49 - 0.8585786
50 - 0.859972
51 - 1.0
52 - 0.70710677
53 - 0.57735026
54 - 0.5
55 - 0.4472136
56 - 0.4082483
57 - 0.37796447
58 - 0.35355338
59 - 0.3334
60 - 0.31622776
61 - 0.30151135
62 - 0.28867513
63 - 0.2773501
64 - 0.26726124
65 - 0.2581989
66 - 0.25
67 - 0.24253562
68 - 0.23570226
69 - 0.22941573
70 - 0.2236068
71 - 0.2182179
72 - 0.21320072
73 - 0.2085144
74 - 0.20412415
75 - 0.2
76 - 0.19611613
77 - 0.19245009
78 - 0.18898223
79 - 0.18569534
80 - 0.18257418
81 - 0.1796053
82 - 0.17677669
83 - 0.17407766
84 - 0.17149858
85 - 0.16903085
86 - 0.1667
87 - 0.16439898
88 - 0.16222142
89 - 0.16012815
90 - 0.15811388
91 - 0.15617377
92 - 0.15430336
93 - 0.15249857
94 - 0.15075567
95 - 0.1490712
96 - 0.14744195
97 - 0.145865
98 - 0.14433756
99 - 0.14285715
100 - 0.14142136
101 - 0.14002801
102 - 0.13867505
103 - 0.13736056
104 - 0.13608277
105 - 0.13483997
106 - 0.13363062
107 - 0.13245323
108 - 0.13130644
109 - 0.13018891
110 - 0.12909944
111 - 0.12803689
112 - 0.12700012
113 - 0.12598816
114 - 0.125
115 - 0.12403473
116 - 0.12309149
117 - 0.12216944
118 - 0.12126781
119 - 0.120385855
120 - 0.11952286
121 - 0.11867817
122 - 0.11785113
123 - 0.11704115
124 - 0.11624764
125 - 0.11547005
126 - 0.114707865
127 - 0.11396058
128 - 0.1132277
129 - 0.11250879
130 - 0.1118034
131 - 0.
132 - 0.11043153
133 - 0.10976426
134 - 0.10910895
135 - 0.10846523
136 - 0.107832775
137 - 0.107211255
138 - 0.10660036
139 - 0.10599979
140 - 0.10540926
141 - 0.104828484
142 - 0.1042572
143 - 0.10369517
144 - 0.10314213
145 - 0.10259783
146 - 0.10206208
147 - 0.10153462
148 - 0.101015255
149 - 0.10050378

--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Google Desktop Could be Better

2004-10-15 Thread Kevin A. Burton
http://www.peerfear.org/rss/permalink/2004/10/15/GoogleDesktopCouldBeBetter/
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread Kevin A. Burton
David Spencer wrote:
Ji Kuhn wrote:
This doesn't work either!

You're right.
I'm running under JDK1.5 and trying larger values for -Xmx and it 
still fails.

Running under (Borlands) OptimzeIt shows the number of Terms and 
Terminfos (both in org.apache.lucene.index) increase every time thru 
the loop, by several hundred instances each.
Yes... I'm running into a similar situation on JDK 1.4.2 with Lucene 
1.3... I used the JMP debugger and all my memory is taken by Terms and 
TermInfo...

I can trace thru some Term instances on the reference graph of 
OptimizeIt but it's unclear to me what's right. One *guess* is that 
maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the 
problem.
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemory example

2004-09-13 Thread Kevin A. Burton
Ji Kuhn wrote:
Hi,
I think I can reproduce memory leaking problem while reopening an index. 
Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is:
$ java -version
java version 1.4.2_05
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
The code you can test is below, there are only 3 iterations for me if I use 
-Xmx5m, the 4th fails.
 

At least this test seems tied to the Sort API... I removed the sort 
under Lucene 1.3 and it worked fine...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IRC?!

2004-09-11 Thread Kevin A. Burton
Harald Tijink wrote:
I hope your idea isn't to replace this Users List and pull the
discussions into the IRC scene. I (and most of us) can not attend to any
IRC chat because of work and other priorities. This list gives me the
opportunity to keep informed (involved).
 

Yup... I want to replace the mailing lists, wiki, website, CVS, and 
Bugzilla with IRC. And if you can't keep up thats just your fault ;) (joke).

Its just another tool ;)
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


TermInfo using 300M for large index?

2004-09-10 Thread Kevin A. Burton
I'm trying to do some heap debugging of my application to find a memory 
leak.

Noticed that org.apache.lucene.index.TermInfo had 1.7M instances which 
consumed 300M ... this is of course for a 40G index.

Is this normal and is there any way I can streamline this?
We are of course caching the IndexSearchers but I want to reduce the 
memory footprint...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IRC?!

2004-09-10 Thread Kevin A. Burton
There isn't a Lucene IRC room is there (at least there isn't according 
to Google)?

I just joined #lucene on irc.freenode.net if anyone is interested...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Possible to remove duplicate documents in sort API?

2004-09-05 Thread Kevin A. Burton
I want to sort a result set but perform a group by as well... IE remove 
duplicate items. 

Is this possible with the new API?  Seems like a huge drawback to lucene 
right now.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)

2004-09-05 Thread Kevin A. Burton
It looks like Document.java uses its own implementation of a LinkedList..
Why not use a HashMap to enable O(1) lookup... right now field lookup is 
O(N) which is certainly no fun.

Was this benchmarked?  Perhaps theres the assumption that since 
documents often have few fields the object overhead and hashcode 
overhead would have been less this way.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Anyone avail for Lucene consulting or employment in the SF area?

2004-09-05 Thread Kevin A. Burton
Hope no one considers this spam ;)
We're hiring either someone full-time who has strong experience with 
Java, Lucene, and Jakarta technologies or someone to act as a consultant 
working on Lucene for about a month optimizing our search infra.

This is for a startup located in downtown SF.
Send me an email including your resume (html or text only) and I'll 
respond with full details.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Patch for IndexWriter.close which prevents NPE...

2004-09-03 Thread Kevin A. Burton
I just attached a patch which:
1. prevents multiple close() of an IndexWriter
2. prevents an NPE if the writeLock was null.
We have been noticing this from time to time and I haven't been able to 
come up with a hard test case.  This is just a bit of defensive 
programming to prevent it from happening in the first place.  It would 
happen from time to time without any reliable cause.

Anyway...
Thanks...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

--- IndexWriter.java.bak.close  2004-09-03 11:27:37.0 -0700
+++ IndexWriter.java2004-09-03 11:32:02.0 -0700
@@ -107,6 +107,11 @@
*/  
   private boolean useCompoundFile = false;
 
+  /**
+   * True when we have closed this IndexWriter
+   */
+  protected boolean isClosed = false;
+
   /** Setting to turn on usage of a compound file. When on, multiple files
*  for each segment are merged into a single file once the segment creation
*  is finished. This is done regardless of what directory is in use.
@@ -183,15 +188,27 @@
 }.run();
 }
   }
-
+
   /** Flushes all changes to an index, closes all associated files, and closes
 the directory that the index is stored in. */
   public synchronized void close() throws IOException {
+
+if ( isClosed ) {
+  return;
+}
+
 flushRamSegments();
 ramDirectory.close();
-writeLock.release();  // release write lock
+
+if ( writeLock != null ) {
+  // release write lock
+  writeLock.release();
+}
+
 writeLock = null;
 directory.close();
+isClosed = true;
+  
   }
 
   /** Release the write lock, if needed. */

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Benchmark of filesystem cache for index vs RAMDirectory...

2004-08-08 Thread Kevin A. Burton
Daniel Naber wrote:
On Sunday 08 August 2004 03:40, Kevin A. Burton wrote:
 

Would a HashMap implementation of RAMDirectory beat out a cached
FSDirectory?
   

It's easy to test, so it's worth a try. Please try if the attached patch 
makes any difference for you compared to the current implementation of 
RAMDirectory.

 

True... I was just thinking out loud... was being lazy.  Now I actually 
have to do the work to create the benchmark again... damn you ;)

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



Benchmark of filesystem cache for index vs RAMDirectory...

2004-08-07 Thread Kevin A. Burton
I'm not sure how Solaris or Windows perform but the Linux block cache 
will essentially use all avali memory to buffer the filesystem.

If one is using an FSDirectory to perform searches while the first 
search would be slow, remaining searches would be fast since Linux will 
now buffer the index in RAM.

The RAMDirectory has the advantage of pre-loading everything and can 
keep it in memory if the box is performing other operations.

In my benchmarks though RAMDirectory is slightly slower.  I assume this 
is because its Hashtable based even though it only needs to be 
synchronized in a few places.  IE searches should never be synchronized...

Would a HashMap implementation of RAMDirectory beat out a cached 
FSDirectory?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Performance when computing computing a filter using hundreds of diff terms.

2004-08-05 Thread Kevin A. Burton
I'm trying to compute a filter to match documents in our index by a set 
of terms.

For example some documents have a given field 'category' so I need to 
compute a filter with mulitple categories.

The problem is that our category list is  200 items so it takes about 
80 seconds to compute.  We cache it of course but this seems WAY too slow.

Is there anything I could do to speed it up?  Maybe run the queries 
myself and then combine the bitsets?

We're using a BooleanQuery with nested TermQueries to build up the 
filter...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Split an existing index into smaller segments without a re-index?

2004-08-04 Thread Kevin A. Burton
Is it possible to take an existing index (say 1G) and break it up into a 
number of smaller indexes (say 10 100M indexes)...

I don't think theres currently an API for this but its certainly 
possible (I think).

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


GROUP BY functionality.

2004-07-28 Thread Kevin A. Burton
In 1.4 we now have arbitrary sort support...
Is it possible to use GROUP BY without having do to this on the client 
(which would be inneficient)...

I have a field I want to make sure is unique in my search results.
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-14 Thread Kevin A. Burton
Doug Cutting wrote:
Aviran wrote:
I changed the Lucene 1.4 final source code and yes this is the source
version I changed.

Note that this patch won't produce the a speedup on earlier releases, 
since their was another multi-thread bottleneck higher up the stack 
that was only recently removed, revealing this lower-level bottleneck.

The other patch was:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg07873.html
Both are required to see the speedup.
Thanks...
Also, is there any reason folks cannot use 1.4 final now?
No... just that I'm trying to be conservative... I'm probably going to 
look at just migrating to 1.4 ASAP but we're close to a milestone...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Field.java - STORED, NOT_STORED, etc...

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
It would be best to get the compiler to check the order.
If we change this, why not use type-safe enumerations:
http://www.javapractices.com/Topic1.cjp
The calls would look like:
new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES);
Stored could be implemented as the nested class:
public final class Stored {
private Stored() {}
public static final Stored YES = new Stored();
public static final Stored NO = new Stored();
}
+1... I'm not in love with this pattern but since Java  1.4 doesnt' 
support enum its better than nothing.

I also didn't want to submit a recommendation that would break APIs. I 
assume the old API would be deprecated?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
I was going to create a new IDField class which just calls super( 
name, value, false, true, false) but noticed I was prevented because 
Field.java is final?

You don't need to subclass to do this, just a static method somewhere.
Why is this? I can't see any harm in making it non-final...

Field and Document are not designed to be extensible. They are 
persisted in such a way that added methods are not available when the 
field is restored. In other words, when a field is read, it always 
constructs an instance of Field, not a subclass.
Thats fine... I think thats acceptable behavior. I don't think anyone 
would assume that inner vars are restored or that the field is serialized.

Not a big deal but it would be nice...
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Kevin A. Burton
Aviran wrote:
Bug 30058 posted
 

Which of course is here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=30058
Is this the source of the revision you modified?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html
Also what version of Lucene?
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Field.java - STORED, NOT_STORED, etc...

2004-07-11 Thread Kevin A. Burton
I've been working with the Field class doing index conversions between 
an old index format to my new external content store proposal (thus the 
email about the 14M convert).

Anyway... I find the whole Field.Keyword, Field.Text thing confusing.  
The main problem is that the constructor to Field just takes booleans 
and if you forget the ordering of the booleans its very confusing.

new Field( name, value, true, false, true );
So looking at that you have NO idea what its doing without fetching javadoc.
So I added a few constants to my class:
new Field( name, value, NOT_STORED, INDEXED, NOT_TOKENIZED );
which IMO is a lot easier to maintain.
Why not add these constants to Field.java:
   public static final boolean STORED = true;
   public static final boolean NOT_STORED = false;
   public static final boolean INDEXED = true;
   public static final boolean NOT_INDEXED = false;
   public static final boolean TOKENIZED = true;
   public static final boolean NOT_TOKENIZED = false;
Of course you still have to remember the order but this becomes a lot 
easier to maintain.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Why is Field.java final?

2004-07-10 Thread Kevin A. Burton
I was going to create a new IDField class which just calls super( name, 
value, false, true, false) but noticed I was prevented because 
Field.java is final?

Why is this?  I can't see any harm in making it non-final...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Increasing Linux kernel open file limits.

2004-07-09 Thread Kevin A. Burton
Don't know if anyone knew this:
http://www.hp-eloquence.com/sdb/html/linux_limits.html
The kernel allocates filehandles dynamically up to a limit specified 
by file-max.

The value in file-max denotes the maximum number of file- handles that 
the Linux kernel will allocate. When you get lots of error messages 
about running out of file handles, you might want to increase this limit.

The three values in file-nr denote the number of allocated file 
handles, the number of used file handles and the maximum number of 
file handles. When the allocated filehandles come close to the 
maximum, but the number of actually used ones is far behind, you've 
encountered a peak in your filehandle usage and you don't need to 
increase the maximum.

So while root you can allocate as many file handles without any limits 
enforced by glibc you still have to fight against the kernel

Just doing a echo 100  /proc/sys/fs/file-max works fine.
Then I can keep track of my file limit by doing a
cat /proc/sys/fs/file-nr
At least this works on 2.6.x...
Think this is going to save me a lot of headache!
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Peter M Cipollone wrote:
You might try merging the existing index into a new index located on a ram
disk.  Once it is done, you can move the directory from ram disk back to
your hard disk.  I think this will work as long as the old index did not
finish merging.  You might do a strings command on the segments file to
make sure the new (merged) segment is not in there, and if there's a
deletable file, make sure there are no segments from the old index listed
therein.
 

Its a HUGE index.  It won't fit in memory ;)  Right now its at 8G...
Thanks though! :)
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
Also... what can I do to speed up this optimize? Ideally it wouldn't 
take 6 hours.

Was this the index with the mergeFactor of 5000? If so, that's why 
it's so slow: you've delayed all of the work until the end. Indexing 
on a ramfs will make things faster in general, however, if you have 
enough RAM...
No... I changed the mergeFactor back to 10 as you suggested.
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
So is it possible to fix this index now? Can I just delete the most 
recent segment that was created? I can find this by ls -alt

Sorry, I forgot to answer your question: this should work fine. I 
don't think you should even have to delete that segment.
I'm worried about duplicate or missing content from the original index. 
I'd rather rebuild the index and waste another 6 hours (I've probably 
blown 100 hours of CPU time on this already) and have a correct index :)

During an optimize I assume Lucene starts writing to a new segment and 
leaves all others in place until everything is done and THEN deletes them?

Also, to elaborate on my previous comment, a mergeFactor of 5000 not 
only delays the work until the end, but it also makes the disk 
workload more seek-dominated, which is not optimal. 
The only settings I uses are:
targetIndex.mergeFactor=10;
targetIndex.minMergeDocs=1000;
the resulting index has 230k files in it :-/
I assume this is contributing to all the disk seeks.
So I suspect a smaller merge factor, together with a larger 
minMergeDocs, will be much faster overall, including the final 
optimize(). Please tell us how it goes.

This is what I did for this last round but then I ended up with the 
highly fragmented index.

hm...
Thanks for all the help btw!
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-08 Thread Kevin A. Burton
[EMAIL PROTECTED] wrote:
Hi,
a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went
smoothly, but we are experiencing some problems with that new constant limit
maxClauseCount=1024
which leeds to Exceptions of type 

	org.apache.lucene.search.BooleanQuery$TooManyClauses 

when certain RangeQueries are executed (in fact, we get this Excpetion when
we execute certain Wildcard queries, too). Although we are working with a
fairly small index with about 35.000 documents, we encounter this Exception
when we search for the property modificationDate. For example
	modificationDate:[00 TO 0dwc970kw] 

 

We talked about this the other day.
http://wiki.apache.org/jakarta-lucene/IndexingDateFields
Find out what type of precision you need and use that.  If you only need 
days or hours or minutes then use that.   Millis is just too small. 

We're only using days and have queries for just the last 7 days as max 
so this really works out well...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Kevin A. Burton
Otis Gospodnetic wrote:
Hey Kevin,
Not sure if you're aware of it, but you can specify the lock dir, so in
your example, both JVMs could use the exact same lock dir, as long as
you invoke the VMs with the same params.  

Most people won't do this or won't even understand WHY they need to do 
this :-/.

You shouldn't be writing the
same index with more than 1 IndexWriter though (not sure if this was
just a bad example or a real scenario).
 

Yes... I realize that you shouldn't use more than one IndexWriter. That 
was the point. The locks are to prevent this from happening. If one were 
to accidentally do this the locks would be in different directories and 
our IndexWriter would corrupt the index.

This is why I think it makes more sense to use our own java.io.tmpdir to 
be on the safe side.

--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
No... I changed the mergeFactor back to 10 as you suggested.

Then I am confused about why it should take so long.
Did you by chance set the IndexWriter.infoStream to something, so that 
it logs merges? If so, it would be interesting to see that output, 
especially the last entry.

No I didn't actually... If I run it again I'll be sure to do this.
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
This is why I think it makes more sense to use our own java.io.tmpdir 
to be on the safe side.

I think the bug is that Tomcat changes java.io.tmpdir. I thought that 
the point of the system property java.io.tmpdir was to have a portable 
name for /tmp on unix, c:\windows\tmp on Windows, etc. Tomcat breaks 
that. So must Lucene have its own way of finding the platform-specific 
temporary directory that everyone can write to? Perhaps, but it seems 
a shame, since Java already has a standard mechanism for this, which 
Tomcat abuses...
I've seen this done in other places as well. I think Weblogic did/does 
it. I'm wondering what some of these big EJB containsers use which is 
why I brought this up. I'm not sure the problem is just with Tomcat.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Something sounds very wrong for there to be that many files.
The maximum number of files should be around:
(7 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

With 14M documents, log_10(14M/1000) is 4, which gives, for you:
(7 + numIndexedFields) * 36 = 230k
7*36 + numIndexedFields*36 = 230k
numIndexedFields = (230k - 7*36) / 36 =~ 6k
So you'd have to have around 6k unique field names to get 230k files. 
Or something else must be wrong. Are you running on win32, where file 
deletion can be difficult?

With the typical handful of fields, one should never see more than 
hundreds of files.

We only have 13 fields... Though to be honest I'm worried that even if I 
COULD do the optimize that it would run out of file handles.

This is very strange...
I'm going to increase minMergeDocs to 1 and then run the full 
converstion on one box and then try to do an optimize (of the corrupt) 
another box. See which one finishes first.

I assume the speed of optimize() can be increased the same way that 
indexing is increased...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene shouldn't use java.io.tmpdir

2004-07-07 Thread Kevin A. Burton
As per 1.3 (or was it 1.4) Lucene migrated to using java.iot.tmpdir to 
store the locks for the index.

While under most situations this is save a lot of application servers 
change java.io.tmpdir at runtime.

Tomcat is a good example.  Within Tomcat this property is set to 
TOMCAT_HOME/temp..

Under this situation if I were to create two IndexWriters within two VMs 
and try to write to the same index  the index would get corrupted if one 
Lucene instance was within Tomcat and the other was within a standard VM.

I think we should consider either:
1. Using out own tmpdir property based on the given OS.
2. Go back to the old mechanism of storing the locks within the index 
basedir (if it's not readonly).

Thoughts?
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Most efficient way to index 14M documents (out of memory/file handles)

2004-07-06 Thread Kevin A. Burton
I'm trying to burn an index of 14M documents.
I have two problems.
1.  I have to run optimize() every 50k documents or I run out of file 
handles.  this takes TIME and of course is linear to the size of the 
index so it just gets slower by the time I complete.  It starts to crawl 
at about 3M documents.

2.  I eventually will run out of memory in this configuration.
I KNOW this has been covered before but for the life of me I can't find 
it in the archives, the FAQ or the wiki. 

I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
every 50k documents.

Does it make sense to just create a new IndexWriter for every 50k docs 
and then do one big optimize() at the end?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Preventing duplicate document insertion during optimize

2004-04-30 Thread Kevin A. Burton
Let's say you have two indexes each with the same document literal.  All 
the fields hash the same and the document is a binary duplicate of a 
different document in the second index.

What happens when you do a merge to create a 3rd index from the first 
two?  I assume you now have two documents that are identical in one 
index.  Is there any way to prevent this?

It would be nice to figure out if there's a way to flag a field as a 
primary key so that if it has already added it to just skip.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
I've noticed this really strange problem on one of our boxes.  It's 
happened twice already.

We have indexes where when Lucnes starts it says 'Lock obtain timed out' 
... however NO locks exist for the directory. 

There are no other processes present and no locks in the index dir or /tmp.

Is there anyway to figure out what's going on here?

Looking at the index it seems just fine... But this is only a brief 
glance.  I was hoping that if it was corrupt (which I don't think it is) 
that lucene would give me a better error than Lock obtain timed out

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
[EMAIL PROTECTED] wrote:

It is possible that a previous operation on the index left the lock open.
Leaving the IndexWriter or Reader open without closing them ( in a finally
block ) could cause this.
 

Actually this is exactly the problem... I ran some single index tests 
and a single process seems to read from it.

The problem is that we were running under Tomcat with diff webapps for 
testing and didn't run into this problem before.  We had an 11G index 
that just took a while to open and during this open Lucene was creating 
a lock. 

I wasn't sure that Tomcat was multithreading this so maybe it is and 
it's just taking longer to open the lock in some situations.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
Kevin A. Burton wrote:

Actually this is exactly the problem... I ran some single index tests 
and a single process seems to read from it.

The problem is that we were running under Tomcat with diff webapps for 
testing and didn't run into this problem before.  We had an 11G index 
that just took a while to open and during this open Lucene was 
creating a lock.
I wasn't sure that Tomcat was multithreading this so maybe it is and 
it's just taking longer to open the lock in some situations.

This is strange... after removing all the webapps (besides 1) Tomcat 
still refuses to allow Lucene to open this index with Lock obtain timed out.

If I open it up from the console it works just fine.  I'm only doing it 
with one index and a ulimit -n so it's not a files issue.  Memory is 1G 
for Tomcat.

If I figure this out will be sure to send a message to the list.  This 
is a strange one

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
James Dunn wrote:

Which version of lucene are you using?  In 1.2, I
believe the lock file was located in the index
directory itself.  In 1.3, it's in your system's tmp
folder.  
 

Yes... 1.3 and I have a script that removes the locks from both dirs... 
This is only one process so it's just fine to remove them.

Perhaps it's a permission problem on either one of
those folders.  Maybe your process doesn't have write
access to the correct folder and is thus unable to
create the lock file?  
 

I thought about that too... I have plenty of disk space so that's not an 
issue.  Also did a chmod -R so that should work too.

You can also pass lucene a system property to increase
the lock timeout interval, like so:
-Dorg.apache.lucene.commitLockTimeout=6

or 

-Dorg.apache.lucene.writeLockTimeout=6
 

I'll give that a try... good idea.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
Gus Kormeier wrote:

Not sure if our installation is the same or not, but we are also using
Tomcat.
I had a similiar problem last week, it occurred after Tomcat went through a
hard restart and some software errors had the website hammered.
I found the lock file in /usr/local/tomcat/temp/ using locate.
According to the README.txt this is a directory created for the JVM within
Tomcat.  So it is a system temp directory, just inside Tomcat.
 

Man... you ROCK!  I didn't even THINK of that... Hm... I wonder if we 
should include the name of the lock file in the Exception within 
Tomcat.  That would probably have saved me a lot of time :)

Either that or we can put this in the wiki

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Created LockObtainTimedOut wiki page

2004-04-28 Thread Kevin A. Burton
I just created a LockObtainTimedOut wiki entry... feel free to add.  I 
just entered the Tomcat issue with java.io.tmpdir as well.

http://wiki.apache.org/jakarta-lucene/LockObtainTimedOut  

Peace!

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Does a RAMDirectory ever need to merge segments... (performanceissue)

2004-04-21 Thread Kevin A. Burton
Gerard Sychay wrote:

I've always wondered about this too.  To put it another way, how does
mergeFactor affect an IndexWriter backed by a RAMDirectory?  Can I set
mergeFactor to the highest possible value (given the machine's RAM) in
order to avoid merging segments?
 

Yes... actually I was thinking of increasing these vars on the 
RAMDirectory in the hope to avoid this CPU overhead..

Also I think the var you want is minMergeDocs not mergeFactor.  the only 
problem is that the source to maybeMergeSegments says:

  private final void maybeMergeSegments() throws IOException {
long targetMergeDocs = minMergeDocs;
while (targetMergeDocs = maxMergeDocs) {
So I guess to prevent this we would have to set minMergeDocs to 
maxMergeDocs+1 ... which makes not sense.  Also by default maxMergeDocs 
is Integer.MAX_VALUE so that will have to be changed.

Anyway... I'm still playing with this myself. It might be easier to just 
use an ArrayList of N documents if you know for sure how big your RAM 
dir will grow to.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Does a RAMDirectory ever need to merge segments... (performance issue)

2004-04-20 Thread Kevin A. Burton
I've been benchmarking our indexer to find out if I can squeeze any more 
performance out of it.

I noticed one problem with RAMDirectory... I'm storing documents in 
memory and then writing them to disk every once in a while. ...

IndexWriter.maybeMergeSegments is taking up 5% of total runtime. 
DocumentWriter.addDocument is taking up another 17% of total runtime.

Notice that this doesn't == 100% becuase there are other tasks taking up 
CPU before and after Lucene is called.

Anyway... I don't see why RAMDirectory is trying to merge segments.  Is 
there anyway to prevent this?  I could just store them in a big 
ArrayList until I'm ready to write them to a disk index but I'm not sure 
how efficient this will be.

Anyone run into this before?

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)

2004-04-14 Thread Kevin A. Burton
petite_abeille wrote:

On Apr 13, 2004, at 02:45, Kevin A. Burton wrote:

He mentioned that I might be able to squeeze 5-10% out of index 
merges this way.


Talking of which... what strategy(ies) do people use to minimize 
downtime when updating an index?

This should probably be a wiki page.

Anyway... two thoughts I had on the subject a while back:

You maintain two disk (not RAID ... you get reliability through software).

Searches are load balanced between disks for performance reasons.  If 
one fails you just stop using it.

When you want to do an index merge you read from disk0 and write to 
disk1.  Then you take disk0 out of search rotation and add disk1 and 
copy the contents of disk1 to disk two.  Users shouldn't notice much of 
a performance issue during the merge because it will be VERY fast and 
it's just reads from disk0.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: verifying index integrity

2004-04-12 Thread Kevin A. Burton
Doug Cutting wrote:

If you use this method, it is possible to corrupt things.  In 
particular, if you unlock an index that another process is modifying, 
then modify it, then these two processes might step on one another.  
So this method should only be called when you are certain that no one 
else is modifying the index.

We're handling this by using .pid files.  We use a standard initializer 
and use your own lock files with process IDs.  If you're on UNIX I can 
give you the source to the JNI getpid that I created.  I've been meaning 
on Open Sourcing this anyway... putting it into commons probably.

This way you can prevent multiple initialization if a java process is 
currently running that might be working with your index.  Otherwise 
there's no real way to be sure the lock isn't stale (unless time is a 
factor but that slows things down)

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Performance of hit highlighting and finding term positions for

2004-04-01 Thread Kevin A. Burton
[EMAIL PROTECTED] wrote:

730 msecs is the correct number for 10 * 16k docs with StandardTokenizer! 
The 11ms per doc figure in my post was for highlighlighting using a \
lower-case-filter-only analyzer. 5ms of this figure was the cost of the \
lower-case-filter-only analyzer.

73 msecs is the cost of JUST StandardTokenizer (no highlighting)
StandardAnalyzer uses StandardTokenizer so is probably used in a lot of apps. It \
tries to keep certain text eg email addresses as one term. I can live without it and \
I suspect most apps can too. I haven't looked into why its slow but I notice it does \
make use of Vectors. I think a lot of people's highlighter performance issues may \
extend from this.
 

Looking at StandardTokenizer I can't see anything that would slow it 
down much... can we get the source to your lower case fitler?!

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


[patch] MultiSearcher should support getSearchables()

2004-03-30 Thread Kevin A. Burton
Seems to only make sense to allow a caller to find the searchables a 
MultiSearcher was created with:

 'diff' -uN MultiSearcher.java.bak MultiSearcher.java
--- MultiSearcher.java.bak  2004-03-30 14:57:41.660109642 -0800
+++ MultiSearcher.java  2004-03-30 14:57:46.530330183 -0800
@@ -208,4 +208,8 @@
return searchables[i].explain(query,doc-starts[i]); // dispatch to 
searcher
  }

+  public Searchable[] getSearchables() {
+return searchables;
+  }
+
}
--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster	



signature.asc
Description: OpenPGP digital signature


Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Kevin A. Burton
I'm playing with this package:

http://home.clara.net/markharwood/lucene/highlight.htm

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms. 

This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.

My question is whether it's hard to find a TermPosition for a given term 
in a given document rather than the whole index.

IndexReader.termPositions( Term term ) is term specific not term and 
document specific.

Also it seems that after all this time that Lucene should have efficient 
hit highlighting as a standard package.  Is there any interest in seeing 
a contribution in the sandbox for this if it uses the index positions?

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Kevin A. Burton
Erik Hatcher wrote:

On Mar 30, 2004, at 7:56 PM, Kevin A. Burton wrote:

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms.
This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.


What if the original analyzer removed stopped words, stemmed, and 
injected synonyms?
Just use the same analyzer :)... I agree it's not the best approach for 
this reason and the CPU reason.

Also it seems that after all this time that Lucene should have 
efficient hit highlighting as a standard package.  Is there any 
interest in seeing a contribution in the sandbox for this if it uses 
the index positions?


Big +1, regardless of the implementation details.  Hit hilighting is 
so commonly requested that having it available at least in the 
sandbox, or perhaps even in the core, makes a lot of sense. 
Well if we could make it efficient by using the frequency and positions 
of terms we're all set :)... I just need to figure out how to do this 
efficiently per document.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: [patch] MultiSearcher should support getSearchables()

2004-03-30 Thread Kevin A. Burton
Erik Hatcher wrote:

On Mar 30, 2004, at 5:59 PM, Kevin A. Burton wrote:

Seems to only make sense to allow a caller to find the searchables a 
MultiSearcher was created with:


Could you elaborate on why it makes sense?  What if the caller changed 
a Searchable in the array?  Would anything bad happen?  (I don't know, 
haven't looked at the code).
Yes... something bad could happen... but that would be amazingly stupid 
... we should probably recommend that it be readonly.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Kevin A. Burton
I have a 7G index.  A query for a random term comes back fast (300ms) 
when I'm not using a DateFilter but when I add the DateFilter it takes 
2.6 seconds.  Way too long.  I assume this is because the filter API 
does a post process so it has to read fields off disk.

Is it possible to do to this with a RangeQuery.  For example you could 
create a days since January 1, 1970 fields and do a range query from 
between 5 and 10... and then add the original field as well.

I have to make some app changes so I figured I would ask here before 
moving forward.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Kevin A. Burton
Doug Cutting wrote:

One way to force larger read-aheads might be to pump up Lucene's input 
buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE 
to 1024*1024 or larger.  You'll want to do this just for the merge 
process and not for searching and indexing.  That should help you 
spend more time doing transfers with less wasted on seeks.  If that 
helps, then perhaps we ought to make this settable via system property 
or somesuch.

Good suggestion... seems about 10% - 15% faster in a few strawman 
benchmarks I ran.   

Note that right now this var is final and not public... so that will 
probably need to change.  Does it make sense to also increase the 
OutputStream.BUFFER_SIZE?  This would seem to make sense since an 
optimize is a large number of reads and writes.  

I'm obviously willing to throw memory at the problem

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Tracking/Monitoring Search Terms in Lucene

2004-03-29 Thread Kevin A. Burton
Katie Lord wrote:

I am trying to figure out how to track the search terms that visitors are
using on our site on a monthly basis. Do you all have any suggestions?
 

Don't use lucene for this... just have your form record the search terms.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Kevin A. Burton
Erik Hatcher wrote:

One more point... caching is done by the IndexReader used for the 
search, so you will need to keep that instance (i.e. the 
IndexSearcher) around to benefit from the caching.

Great... Damn... looked at the source of CachingWrapperFilter and it 
makes sense.  Thanks for the pointer.  The results were pretty amazing.  
Here are the results before and after. Times are in millis:

Before caching the Field:

Searching for Jakarta:
2238
1910
1899
1901
1904
1906
After caching the field:
2253
10
6
8
6
6
That's a HUGE difference :)

I'm very happy :)

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster




signature.asc
Description: OpenPGP digital signature


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Kevin A. Burton
Doug Cutting wrote:

How long is it taking to merge your 5GB index?  Do you have any stats 
about disk utilization during merge (seeks/second, bytes 
transferred/second)?  Did you try buffer sizes even larger than 1MB? 
Are you writing to a different disk, as suggested?
I'll do some more testing tonight and get back to you

Note that right now this var is final and not public... so that will 
probably need to change.


Perhaps.  I'm reticent to make it too easy to change this.  People 
tend to randomly tweak every available knob and then report bugs, or, 
if it doesn't crash, start recommending that everyone else tweak the 
knob as they do.  There are lots of tradeoffs with buffer size, cases 
that folks might not think of (like that a wildcard query creates a 
buffer for every term that matches), etc.
Or you can do what I do and recompile ;) 

Does it make sense to also increase the OutputStream.BUFFER_SIZE?  
This would seem to make sense since an optimize is a large number of 
reads and writes.


It might help a little if you're merging to the same disk as you're 
reading from, but probably not a lot.  If you're merging to a 
different disk then it shouldn't make much difference at all.

Right now we are merging to the same disk...  I'll perform some real 
benchmarks with this var too.  Long term we're going to migrate to using 
to SCSI disks per machine and then doing parallel queries across them 
with optimized indexes.

Also with modern disk controllers and filesystems I'm not sure how much 
difference this should make.  Both Reiser and XFS do a lot of internal 
buffering as does our disk controller.  I guess I'll find out...

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: BooleanQuery$TooManyClauses

2004-03-29 Thread Kevin A. Burton
hui wrote:

Hi,
I have a range query for the date like [20011201 To 20040201], it works fine
for Lucene API 1.3 RC1. When I upgrade to 1.3 final, I got
BooleanQuery$TooManyClauses exception sometimes no matter the index is
created by 1.3RC1 or 1.3 final. Check on the email archive, it seems related
with maxClauseCount. Is increasing maxClauseCount the only way to avoid this
issue in 1.3 final? The dev mail list has some discussion on the future plan
on this.
 

I've noticed the same problem.. The strange thing is that it only 
happens on some queries.  For example the query blog results in this 
exception but the query for linux in my index works just fine. 

This is the stacktrace if anyone's interested:

org.apache.lucene.search.BooleanQuery$TooManyClauses
   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109)
   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101)
   at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:99)
   at 
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
   at 
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
   at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:188)
   at org.apache.lucene.search.Query.weight(Query.java:120)
   at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128)
   at 
org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:150)
   at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
   at org.apache.lucene.search.Hits.init(Hits.java:80)
   at org.apache.lucene.search.Searcher.search(Searcher.java:71)

For the record I'm also using a DateRange but I disabled it still noticed the same behavior.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Lucene optimization with one large index and numerous small indexes.

2004-03-28 Thread Kevin A. Burton
We're using lucene with one large target index which right now is 5G.  
Every night we take sub-indexes which are about 500M and merging them 
into this main index.  This merge (done via 
IndexWriter.addIndexes(Directory[]) is taking way too much time.

Looking at the stats for the box we're essentially blocked on reads.  
The disk is blocked on read IO and CPU is at 5%.  If I'm right I think 
this could be minimized by continually picking the two smaller indexes, 
merging them, then picking the next two smallest, merging them, and then 
keep doing this until we're down to one index.

Does this sound about right?

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: too many files open error

2004-03-26 Thread Kevin A. Burton
Charlie Smith wrote:

/opt/famhistdev/fhstage/jbin/.docSearcher/indexes/fhstage_update/_3ff.f6 (Too
many open files)
 

Just a suggestion... why not put a URL string in the Too many open 
files.  Exception.   Tons of people keep running into this problem and 
we keep wasting both our time annd their time. 

We could just link to the FAQ entry.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: too many files open error

2004-03-26 Thread Kevin A. Burton
Chad Small wrote:

Is this :) serious?  

Because we have a need/interest in the new field sorting capabilities

URL to documentation for field sorting?

and QueryParser keyword handling of dashes (-) that would be in 1.4, I believe.  It's so much easier to explain that we'll use a final release of Lucene instead of a dev build Lucene.  
 



--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: code works with 1.3-rc1 but not with 1.3-final??

2004-03-22 Thread Kevin A. Burton
Dan wrote:

I have some code that creates a lucene index. It has been working fine 
with lucene-1.3-rc1.jar but I wanted to upgrade to 
lucene-1.3-final.jar. I did this and the indexer breaks. I get the 
following error when running the index with 1.3-final:

Optimizing the index
IOException: /home/danl001/index-Mar-22-14_31_30/_ni.f43 (Too many 
open files)
Indexed 884 files in 8 directories
Index creation took 242 seconds
%

No... it's you... ;)

Read the FAQ and then run

ulimit -n 100 or so...

You need to increase your file handles.  Chance are you never noticed 
this before but the problem was still present.  If you're on a Linux box 
you would be amazed to find out that you're only about 200 file handles 
away from running out of your per-user quota file quota.

You might have to su as root to change this.. RedHat is more strict 
because it uses the glibc resource restrictions thingy. (who's name 
slips my mind at the moment). 

Debian is configured better here as per defaults.

Also a google query would have solved this for you very quickly ;)..

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Lock timeout should show the index it failed on...

2004-03-22 Thread Kevin A. Burton
Just an RFE... if a lock times out we should probably throw the name of 
the FSDirectory (or if it's a RAMDirectory) ...

I'm lazy so this is a reminder for either myself to do this or wait 
until one of you guys take care of it :)

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Real time indexing and distribution to lucene on separate boxes (long)

2004-03-11 Thread Kevin A. Burton
I'm curious to find out what others are doing in this situation.

I have two boxes... the indexer and the searcher.  The indexer is taking
documents and indexing them and creating indexes in a RAMDirectory (for
efficiency) and is then writing these indexes to disk as we begin to run 
out of
memory.  Usually these aren't very big... 15-100M or so.

Obviously I'm dividing the indexing and searching onto dedicated boxes to
improve efficiency.  The real issue though is that the searchers need to 
be live
all the time as indexes are being added at runtime.

So if that wasn't clear.  I actually have to push out fresh indexes 
WHILE users
are searching them.  Not a very easy thing to do.

Here's my question.  What are the optimum ways to then distribute these 
index
segments to the secondary searcher boxes.  I don't want to use the 
MultiSearcher
because it's slow once we have too many indexes (see my PS)

Here's what I'm currently thinking:

1.  Have the indexes sync'd to the searcher as shards directly.  This 
doesn't
scale as I would have to use the MultiSearcher which is slow when it has too
many indexes.  (And ideally we would want an optimized index).

2. Merge everything into one index on the indexer.  Lock the searcher, 
then copy
over the new index via rsync.  The problem here is that the searcher 
would need
to lock up while the sync is happening to prevent reads on the index.  
If I do
this enough and the system is optimzed I think I would only have to 
block for 5
seconds or so but that's STILL very long.

3. Have two directories on the searcher.  The indexer would then sync to 
a tmp
directory and then at run time swap them via a rename once the sync is over.
The downside here is that this will take up 2x disk space on the 
searcher.  The
upside is that the box will only slow down while the rsync is happening.

4. Do a LIVE index merge on the production box.  This might be an 
interesting
approach.  The major question I have is whether you can do an 
optimize/merge on
an index that's currently being used.  I *think* it might be possible 
but I'm
not sure.  This isn't as fast as performing the merge on the indexer 
before hand
but it does have the benefits of both worlds.

If anyone has any other ideas I would be all ears...

PS.. Random question.  The performance of the MultiSearcher is Mlog(N) 
correct?
Where N is the number of documents in the index and M is the number of 
indexes?
Is this about right?

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


int vs long and document ids on 64bit machines.

2004-03-11 Thread Kevin A. Burton
A discussion I had a while back had someone note (Doug?) that the 
decision to go with 32bit ints for document IDs was that on 32 bit 
machines that 64bits weren't threadsafe.

Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)? 

How hard would it be to build a lucene64 that used 64bit document 
handles (longs) for 64bit procesors?!  Is it just a recompile?  Will the 
file format break and need updating?!

Also ... what are the symptoms of a Lucene build using 64bit ints on 
32bit processors.  Right now we're personally stuck on 32bit machines 
but I would like to see us migrate to 64 bit boxes over the next 6 
months...

Anyway... thinking out loud.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: int vs long and document ids on 64bit machines.

2004-03-11 Thread Kevin A. Burton
Doug Cutting wrote:

Somone, not me, perhaps provided that rationalization, which isn't a 
bad one.  In fact, the situation was more that, in 1997, when I 
started Lucene, 2 billion documents seemed like a lot for a Java-based 
search engine which was designed to scale to perhaps millions of 
documents, but probably not to the world.  Java was slow then, remember?
Yes... agreed.

Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)?
How hard would it be to build a lucene64 that used 64bit document 
handles (longs) for 64bit procesors?!  Is it just a recompile?  Will 
the file format break and need updating?!


I think the file format is 64-bit safe.  But the code changes would be 
quite numerous.  No doubt we should make this change someday.  Do you 
anticipate more than 2 billion documents in your Lucene index sometime 
soon, e.g., this year?

Also, with Java, it's not just a recompile, it's a lot of code changes.
Weill ... the refactor should at LEAST be pretty easy... just start 
changing int-long and follow up until the code compiles.  Not sure if 
it's that easy.

Also ... what are the symptoms of a Lucene build using 64bit ints on 
32bit processors.  Right now we're personally stuck on 32bit machines 
but I would like to see us migrate to 64 bit boxes over the next 6 
months...


Java's int datatype is defined as 32 bit.  So there are no 64-bit 
ints.  There are longs.  I doubt longs are much slower than ints to 
deal with on most JVMs today.  However a long[] is twice as big as an 
int[], and an array may only be indexed by an int.  Currently Lucene 
uses a byte[] indexed by document number to store normalization 
factors.  This would not work if document numbers are longs.  Filters 
index bit vectors with document numbers, and that also would not work 
if document numbers were longs.  Working around these will not only 
take some code, it may also impact performance a bit.

I suspect that Java will soon evolve to better embrace 64-bit 
machines.  Someday assignment of longs will be atomic.  (This is 
hinted at in the language spec.)  Someday arrays will probably be 
indexable by longs. I'd prefer to wait until these changes happen 
before changing Lucene's document numbers to longs.

At some point I might take a look at the code and see how hard it would 
be... Thanks for you notes... I'll probably use these in the future.

The main problem that with indexes that have lots of SMALL documents you 
could see yourself running out of ints.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Real time indexing and distribution to lucene on separate boxes (long)

2004-03-11 Thread Kevin A. Burton
Otis Gospodnetic wrote:

I like option 3.  I've done it before, and it worked well.  I dealt
with very small indices, though, and if your indices are several tens
or hundred gigs, this may be hard for you.
Option 4: search can be performed on an index that is being modified
(update, delete, insert, optimize).  You'd just have to make sure not
to recreate new IndexSearcher too frequently, if your index is being
modified often.  Just change it every X index modification or every X
minutes, and you'll be fine.
 

Right now I'm thinking about #4... Disk may be cheap but a fast RAID 10 
array with 100G twice isn't THAT cheap... That's the worse case scenario 
of course and most modern search clusters use cheap hardware though...

Also... since the new indexes are SO small (~100M) the merges would 
probably be easier on the machine than just doing a whole new write.  Of 
course it's hard to make that argument with a 100G RAID array but we're 
using rysnc to avoid distribution of network IO so the CPU computation 
and network read would slow things down.

The only way around this is the re-upload the whole 100G index but even 
over gigabit ethernet this will take 15 minutes.  This doesn't scale as 
we add more searchers.

Thanks for the feedback!  I think now tha tI know that optmize is safe 
as long as I don't create a new reader... I'll be fine.  I do have think 
about how I'm going to handle search result nav.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: update performance

2004-03-11 Thread Kevin A. Burton
Chris Kimm wrote:

Unfortunately, I'm not able to batch the updates.  The application 
needs to make some descisions based on what each document looks like 
before and  after the update, so I have to do it one at a time.  I 
guess this is not a common useage scenario for Lucene.  Otherwise, an 
update() might already be built in somewhere.

Is there anything in the locking/sync framework which precludes saving 
the cost of closing the Directory object and deleting the temp lock 
file each time an update is made?

Use a RAM directory... then when you're pretty sure you're done call 
IndexWriter.addIndexes() on the disk index.

Will that work for you?

You can also do this every N documents, or minutes, or memory usage, and 
have the commit work with a synchronized thread.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Kevin A. Burton
Scott ganyo wrote:

I don't buy it.  HashSet is but one implementation of a Set.  By 
choosing the HashSet implementation you are not only tying the class 
to a hash-based implementation, you are trying the interface to *that 
specific* hash-based implementation or it's subclasses.  In the end, 
either you buy the concept of the interface and its abstraction or you 
don't.  I firmly believe in using interfaces as they were intended to 
be used.
An interface isn't just the concept of a Java interface but ALSO the 
implied and required semantics.

TreeSet, etc are too slow to be used with the StopFitler thus we should 
prevent their use. 

We require HashSet/Map...

Scott

P.S. In fact, HashSet isn't always going to be the most efficient 
anyway.  Just for one example:  Consider possible implementations if I 
have only 1 or 2 entries.

HashSet is not always the most efficient... if you need to do runtime 
inserts and bulk removal TreeSet/Map might be more efficient.  Also if 
you need to sort the map then you're stuck with a tree.

KEvin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Kevin A. Burton
Erik Hatcher wrote:

I will refactor again using Set with no copying this time (except for 
the String[] and Hashtable) constructors.  This was my original 
preference, but I got caught up in the arguments by Kevin and lost my 
ideals temporarily :)

I expect to do this later tonight or tomorrow.
How about this as a compromise...

No copy on constructor... use a Set but in the documentation summarize 
this conversation and point out that the user should use a HashSet and 
NOT any other type of set and that it will result in a copy..

I think Doug's comment about a potentially faster impl in the future was 
a good point...

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Kevin A. Burton
Erik Hatcher wrote:


Also... while you're at it... the private variable name is 'table' 
which this HashSet certainly is *not* ;)


Well, depends on your definition of 'table' I suppose :)  I changed it 
to a type-agnostic stopWords.
Did you know that internally HashSet uses a HashMap?

I sure didn't!

hashset.contains() maps to hashmap.containsKey()

It uses a key - value mapping to a generic PRESENT Object... hm. 

Probably makes sense to just call this variable 'hashset' and then 
force the type to be HashSet since it's necessary for this to be a 
HashSet to maintain any decent performance.  You'll need to update 
your second constructor to require a HashSet too.. would be very bad 
to let callers use another set impl... TreeSet and SortedSet would 
still be too slow...
I refuse to expose HashSet... sorry!  :)  But I did wrap what is 
passed in, like above, in a HashSet in my latest commit. 
Hm... You're doing this EVEN if the caller passes a HashSet directly?!

Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash 
based implementation.  Doing anything else is just wrong and would 
seriously slow down Lucene indexing.

Also... you're HashSet constructor has to copy values from the original 
HashSet into the new HashSet ... not very clean and this can just be 
removed by forcing the caller to use a HashSet (which they should).

:)

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Kevin A. Burton
Doug Cutting wrote:

Erik Hatcher wrote:

Also... you're HashSet constructor has to copy values from the 
original HashSet into the new HashSet ... not very clean and this 
can just be removed by forcing the caller to use a HashSet (which 
they should).


I've caved in and gone HashSet all the way.


Did you not see my message suggesting a way to both not expose HashSet 
publicly and also not to copy values?  If not, I attached it.

For the record I didn't see it... but it echos my points...

Thanks!

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Otis Gospodnetic wrote:

I really don't think this will make any noticable difference, but why
not.  Could you please send a diff -uN patch, please?
I made the same changes locally about a year ago, but have since thrown
away my local changes (for no good reason that I recall).
 

Just diff it locally... it's just a search replace for Hashtable - 
HashMap...

Pretty trivial.

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Erik Hatcher wrote:

I don't see any reason for this to be a Hashtable.

It seems an acceptable alternative to not share analyzer/filter  
instances across threads - they don't really take up much space, so 
is  there a reason to share them?  Or I'm guessing you're sharing it  
implicitly through an IndexWriter, huh?

I'll away further feedback before committing this change, but seems  
reasonable to me.

Yeah... I'm using a RAMDirectory and adding documents to it across 
multiple threads... some of them index at the same time.

The patch is super small... the only difference is that it's using a 
HashMap which isn't synchronized... it can't hurt anything...

but feedback is a good thing :)

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
Doug Cutting wrote:

Erik Hatcher wrote:

Well, one issue you didn't consider is changing a public method 
signature.  I will make this change, but leave the Hashtable 
signature method there.  I suppose we could change the signature to 
use a Map instead, but I believe there are some issues with doing 
something like this if you do not recompile your own source code 
against a new Lucene JAR so I will simply provide another 
signature too.


This is also a problem for folks who're implementing analyzers which 
use StopFilter.  For example:

public MyAnalyzer extends Analyzer {

  private static Hashtable stopTable =
StopFilter.makeStopTable(stopWords);
  public TokenStream tokenStream(String field, Reader reader) {
... new StopFilter(stopTable) ...
}

This would no longer compile with the change Kevin proposes.

To make things back-compatible we must:

1. Keep but deprectate StopFilter(Hashtable) constructor;
2. Keep but deprecate StopFilter.makeStopTable(String[]);
3. Add a new constructor: StopFilter(HashMap);
4. Add a new method: StopFilter.makeStopMap(String[]);
Does that make sense?
Ah... ok... good point.  If no one does this I'll take care of it...

Kevin

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Kevin A. Burton
David Spencer wrote:

Maybe I missed something but I always thought the stop list should be 
a Set, not a Map (or Hashtable/Dictionary). After all, all you need to 
know is existence and that's what a Set does.
It stores the word as the key and the value...

I don't care either way... There was no HashSet back when this was 
written. I was just going to leave it as a HashMap so that in the future 
if we ever wanted to change the value we could...

Either way.

--

Please reply using PGP:

   http://peerfear.org/pubkey.asc

   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

begin:vcard
fn:Kevin Burton
n:Burton;Kevin
email;internet:[EMAIL PROTECTED]
x-mozilla-html:TRUE
version:2.1
end:vcard



signature.asc
Description: OpenPGP digital signature


  1   2   >