Re: Analyzers, perfect hash, ICU

Ken Krugler Wed, 11 Jan 2006 11:03:10 -0800

Hi all,
I'm working on the analyzer for the slovanic latin languages(cs,sk) w/o stemming at first.
I would like to ask you:
The StopWord analyzer uses often HashSet implementation, but the theStopwords are not changed often (if ever) from shipped in the javacode. Do you think that is there benefit for the perfect hashalgorithm?


My guess is that you wouldn't save much time here using a perfect hash.

I will do an ICU analyzer for latin chars (decompositing and returnbase char). Have you any exp. with icu(.sf.net) some problems,bottlenecks?

This could be a significant performance hit. Using ICU is a goodidea, but typically putting some simple front-end filtering in frontcan save you a lot of time.

E.g. if there are a lot of characters that don't require anydecomposition, you could do some quick (and very conservative) checksto skip calls to ICU.


But of course, measure then optimize :)

P. S.: also I would like these stuff contribute to lucene-contrib ifit'll be recognized useful. Is there any howto set the Eclipse forLucene/Apache related project?

If you're asking about how to set up Eclipse to do development forLucene, I found some posts to the mailing list a while back, butnothing definitive.

FWIW, my experience w/Eclipse 3.1 was that trying to auto-createEclipse projects using the Ant build file didn't work very well. Sowe wound up manually creating the project, setting up the classpath,etc.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analyzers, perfect hash, ICU

Reply via email to