Re: Indexing product keys with and without spaces in them

Christoph Kaser Tue, 03 Jan 2012 07:06:06 -0800

Unfortunately, we don't have a designated field for product identifiers,and the product identifiers are from various manufacturers. So it ishard to normalize product keys, as we can't distinguish them from otherparts of the document.Examples are xbox 360 (which might be searched as xbox360) or ipod(which might be searched as i-pod).


We have about 100,000 products, so we might go with memory prefix matching.


I will give it a try!

Regards,
Christoph

Am 03.01.2012 15:48, schrieb Ian Lea:

My suggestion wasn't to store/index the triplets, just a normalized
version of the product key.  So if you had

id: CRXUSB2.0-16GB
desc: some 16GB USB thing

you'd index, in your searchable words field, "CRXUSB2.016GB some 16GB USB thing"

And then at search time you'd take "CRX USB2.0-16G" and normalize that
to CRXUSB2.016GB and get a hit.  If they entered "CRX USB2.0 16G usb
thing" you'd also make usb2.016gusb and 16gusbthing, and
CRXUSB2.016Gusbthing if you're going to quintuplets. So your queries
would be more complex, but the index wouldn't be larger.  You'd need
to make the combinations optional rather than required of course, and
a long query would generate lots of combinations, but since most won't
match the search would likely still be fast.

I'm not really saying this is a good idea, just something that might
work.  Personally I'd go and shout at whoever makes up the product ids
and replace them all with something simple and consistent.  Like
numbers.


Roughly how many products do you have?  If not a massive number you
could try some in memory prefix matching along the lines of

for each word in query
    if is product id
      OK, got product id
    if possible prefix
     add next word
      is product id?
        OK, got product id
      is still possible prefix?
       add next word
     etc.

Might even be able to do it with lucene term and prefix queries on a
normalized product id field.


--
Ian.


On Tue, Jan 3, 2012 at 2:09 PM, Uwe Schindler<u...@thetaphi.de>  wrote:

Hi,

Has somebody ever tried something like this? Is there a way to do this

without

increasing the index to about 15 times (1+2+3+4+5) its original size?

The index will not have 15 times the size as it is inverted index and only
indexes the unique parts of your tokens. In most cases it will have approx.
maybe the double size. Just try it out, depends on your data!

Uwe

Christoph


Am 03.01.2012 11:06, schrieb Ian Lea:

When indexing you could normalise them down to a standard format
without spaces or hyphens, but searching is much harder if you really
can't identify possible product ids within user queries.  Make
triplets without spaces or hyphens?  "CRX USB-2.0 16GB" ==>
CRXUSB2.016GB but also "some random words" ==>    somerandomwords.

The

latter wouldn't match, the former would if it was a valid id.

Some form of synonym analysis/injection at indexing would be better if
you could do that: CRXUSB2.016GB ==>    "CRX USB2.0 16GB", to be indexed
as well as the base value.

If you can't either have a dedicated product id search field or
standardise the product ids, this is going to be hard.


--
Ian,


On Tue, Jan 3, 2012 at 8:44 AM, Christoph Kaser<lucene_l...@iconparc.de>

wrote:

Hello,

we use lucene as search engine in an online shop. The products in
this shop often contain product keys like CRXUSB2.0-16GB.
We would like our customers to be able to find products by entering
their key. The problem is that product keys sometimes contain spaces
or dashes and customers sometimes don't enter these whitespaces
correctly. On the other hand, some customers enter whitespaces where
there are none. Is there an analyzer or some other method that allows
us to find the product if the user enters things like:
- "CRX USB2.0 16GB"
- "CRXUSB2.016GB"
- "CRX USB-2.0 16GB"
...

The problem is that the product keys don't all have a common format
and are contained in the normal text, so we don't have an easy way to
treat them different to the rest of the text.

Any help would be great!

Best regards,
Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing product keys with and without spaces in them

Reply via email to