Re: Should Token be immutable?

Dmitry Serebrennikov Mon, 06 Jan 2003 19:53:42 -0800

Otis Gospodnetic wrote:

Ah, sorry about bringing up performance, I mixed that with another
thread.
Anyhow, I still think that setPosition offers a nice feature that some
people may want to use.  It was on a to do list for a while, and it was
there because people requested it, so even though Lucene doesn't use
setPosition internally, maybe Lucene-based apps out there are.

Most likely it would be analyzers for additional languages that would make use of this. One example where I have considered using this feature was in a special-purpose analyzer that placed multiple forms of a token into the same position. For example, a given word "10cm" can be parsed into two: "10", "cm". This would allow a document to be found when the query includes "10 cm" or "10cm". I ended up doing just this, but I do not currently bother with positions, only because I do not run phrase queries. However, if phrase queries were needed, I think I would probably want to place them at the same position.

Another example where this could be useful would be with languages where a single word can be composed of many component words - such as German. Perhaps it can also be useful in oriental languages?

Dmitry.

Otis

--- stephane vaucher <[EMAIL PROTECTED]> wrote:

I'm not sure if I understand your question. I'm not trying to
optimise anything. This thread was spawned because the usage of Token was
unclear and inconsistent (I don't see the purpose here a package scoped members). The result of this is that a few of us thought that an immutable Token might be clearer.

The most simple change (I personally believe it's an essential
change) is to make the members private.
The second change for the object to be immutable would be to remove
the positionIncrement, but since I'm no lucene guru, I can't tell what is

better (hence the email).

I'll test the simples changes tonight to see if there is a sizable performance hit, and I'll wait to see if a guru speaks out about the controversial second change (which is also trivial).

Stephane

Otis Gospodnetic wrote:

It sounds to me that having the ability to do that that point 13. in
CHANGES states is more important than trying to only slightly

decrease

the number of temporary objects instantiated.

By the way, have you observed or measured the difference in
performance, memory consumption or anything else, before and after

your

local changes?
Not having those and making Token immutable for performance reasons
would be wrong.

Thanks,
Otis

--- stephane vaucher <[EMAIL PROTECTED]> wrote:

I've noticed that there is a method public void
setPositionIncrement(int positionIncrement) that would probably have to disappear for Token

to

be immutable. The CHANGES.txt doc seems to mention some good reasons

why

it was added, but there is no code in CVS that seems to depend on it.

From CHANGES:
13. Added new method Token.setPositionIncrement().

This permits, for the purpose of phrase searching, placing
multiple terms in a single position. This is useful with
stemmers that produce multiple possible stems for a word.

This also permits the introduction of gaps between terms, so
that
terms which are adjacent in a token stream will not be matched
by
and exact phrase query. This makes it possible, e.g., to

build

an analyzer where phrases are not matched over stop words

which

have been removed.

Finally, repeating a token with an increment of zero can also

be

used to boost scores of matches on that token. (cutting)

Any comments? With an immutable Token, does the positionIncrement
still have a reason for being there? If not, then I'll remove getPositionIncrement as well.

Stephane

Doug Cutting wrote:

stephane vaucher wrote:

1) Does anyone mind? Will it break anything?

It shouldn't break anything.

2) Are there units tests for this? (particularly

PorterStemFilter).

The changes are obviously not spectacular, but I prefer not to

screw

everyone up...

I don't know of any unit tests specifically for this. Mostly this

change will affect compilation. In general though, if you don't

see

unit tests for things that you think you might break, then it

never

hurts to write more unit tests.

3) I've checked-out the latest version of lucene, is there

anything

special I need to do if I get the go ahead to check my stuff in

(like

a dev list review)?

If you're not a regular committer then please send diffs to

lucene-dev

before comitting and give folks a few days to consider the

changes.

Doug

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:

<mailto:[EMAIL PROTECTED]>

For additional commands, e-mail:

<mailto:[EMAIL PROTECTED]>

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Should Token be immutable?

Reply via email to