I am working on an application that is using Tika to index text based documents 
and store the text results in Lucene.  These documents can range anywhere from 
1 page to thousands of pages.

We are currently using Lucene 3.0.3.  I am currently using the StandarAnalyzer 
to index and search for the text that is contained in one Lucene document field.

For strictly alpha based, English words, the searches return the results as 
expected.  The problem has to do with searching for numeric values in the 
indexed documents.  So examples of text in the documents that cannot be found 
unless wild cards are used are:

Ø  1-800-costumes.com

o   800 does not find the text above

Ø  $118.30

o   118 does not find the text above

Ø  3tigers

o   3 does not find the text above

Ø  000000123456

o   123456 does not find the text above

Ø  123,abc,foo,bar,456

o   This is in a CSV file

o   123 nor 456 finds the text above

§  I realize that it has to do with the texted only being separated by commas 
and so it is treated as one token, but I think the issue is no different than 
the others

The expectation from our users is that if they can open the document in its 
default application (Word, Adobe, Notepad, etc.) and perform a "find" within 
that application and find the text, then our application based on Lucene should 
be able to find the same text.

It is not reasonable for us to request that our users surround their search 
with wildcards.  Also, it seems like a kludge to programmatically put wild 
cards around any numeric values the user may enter for searching.

Is there some type of numeric parser or filter that would help me out with 
these scenarios?

I've looked at Solr and we already have strong foundation of code utilizing 
Spring, Hibernate, and Lucene.  Trying to integrate Solr into our application 
would take too much refactoring and time that isn't available for this release.

Also, since these numeric values are embedded within the documents, I don't 
think storing them as their own field would make sense since I want to maintain 
the context of the numeric values within the document.

Thank you.

Reply via email to