Re: Indexing product keys with and without spaces in them

Christoph Kaser Tue, 03 Jan 2012 06:02:14 -0800

Hi Ian,

thank you for your reply.

Unfortunately this will be hard, as we have no way of knowing at whichposition the user might enter spaces, so we cannot expand the productkeys at indexing time.

The other way round (triplets without spaces or hyphens) might work,however we have no real way of knowing whether product keys really aretriplets, so we also have to make doublets and quadruplets (and maybeeven quintuplets). So for every two, three, four, five consecutivetokens in our index, we would have to include the concatenated version.If we treat the user input the same way, we should be able to find typeidentifiers regardless of their spelling.

However, this would dramatically increase the index size and lead tofalse positives for situations where other words are concatenated whichform a different word.

Has somebody ever tried something like this? Is there a way to do thiswithout increasing the index to about 15 times (1+2+3+4+5) its originalsize?


Christoph


Am 03.01.2012 11:06, schrieb Ian Lea:

When indexing you could normalise them down to a standard format
without spaces or hyphens, but searching is much harder if you really
can't identify possible product ids within user queries.  Make
triplets without spaces or hyphens?  "CRX USB-2.0 16GB" ==>
CRXUSB2.016GB but also "some random words" ==>  somerandomwords.  The
latter wouldn't match, the former would if it was a valid id.

Some form of synonym analysis/injection at indexing would be better if
you could do that: CRXUSB2.016GB ==>  "CRX USB2.0 16GB", to be indexed
as well as the base value.

If you can't either have a dedicated product id search field or
standardise the product ids, this is going to be hard.


--
Ian,


On Tue, Jan 3, 2012 at 8:44 AM, Christoph Kaser<lucene_l...@iconparc.de>  wrote:

Hello,

we use lucene as search engine in an online shop. The products in this shop
often contain product keys like CRXUSB2.0-16GB.
We would like our customers to be able to find products by entering their
key. The problem is that product keys sometimes contain spaces or dashes and
customers sometimes don't enter these whitespaces correctly. On the other
hand, some customers enter whitespaces where there are none. Is there an
analyzer or some other method that allows us to find the product if the user
enters things like:
- "CRX USB2.0 16GB"
- "CRXUSB2.016GB"
- "CRX USB-2.0 16GB"
...

The problem is that the product keys don't all have a common format and are
contained in the normal text, so we don't have an easy way to treat them
different to the rest of the text.

Any help would be great!

Best regards,
Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
Dipl.-Inf. Christoph Kaser

IconParc GmbH
Sophienstrasse 1
80333 München

www.iconparc.de

Tel +49 -89- 15 90 06 - 21
Fax +49 -89- 15 90 06 - 49

Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
121830, Amtsgericht München




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing product keys with and without spaces in them

Reply via email to