from:"Uwe Goetzke"

Problem with sorting on NumericFields

2010-10-26 Thread Uwe Goetzke

I got stuck on a problem using  NumericFields using with lucene 2.9.3

 

I add values  to the document by

doc.add(new NumericField(minprice).setDoubleValue(net_price));

 

If I want to search with a sorter for this field, I get this error:

 

java.lang.NumberFormatException: Invalid shift value in prefixCoded
string (is encoded value really an INT?)

at
org.apache.lucene.util.NumericUtils.prefixCodedToInt(NumericUtils.java:2
33)

at
org.apache.lucene.search.FieldCache$8.parseFloat(FieldCache.java:256)

at
org.apache.lucene.search.FieldCacheImpl$FloatCache.createValue(FieldCach
eImpl.java:514)

at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22
4)

at
org.apache.lucene.search.FieldCacheImpl.getFloats(FieldCacheImpl.java:48
7)

at
org.apache.lucene.search.FieldCacheImpl$FloatCache.createValue(FieldCach
eImpl.java:504)

at
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22
4)

at
org.apache.lucene.search.FieldCacheImpl.getFloats(FieldCacheImpl.java:48
7)

at
org.apache.lucene.search.FieldComparator$FloatComparator.setNextReader(F
ieldComparator.java:269)

at
org.apache.lucene.search.TopFieldCollector$MultiComparatorNonScoringColl
ector.setNextReader(TopFieldCollector.java:435)

at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:257)

at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:240)

at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:181)

at
org.apache.lucene.search.Searcher.search(Searcher.java:90)

 

The Sort field as seen by the debugger:

*   sort_fields = {org.apache.lucene.search.sortfield...@9010} 
*   [0] = {org.apache.lucene.search.sortfi...@9011}! 
*   field = {java.lang.str...@8642}minprice 
*   type = 5 
*   locale = null 
*   reverse = true 
*   factory = null 
*   parser = null 
*   comparatorSource = null 
*   useLegacy = false 

I run out of ideas what might go wrong. I did look at the index with
luke and I do not see anything special. 

As this happens with the same code on other servers, too, It looks like
some kind of programming error.

Any hints? 

Thx

 

Uwe

AW: Problem with sorting on NumericFields

2010-10-26 Thread Uwe Goetzke

Thx Uwe,

after sleeping over the problem...
The solution just hit me ;)
I index a double for the Numeric field but my Sortfield was setup as a float.
(Maybe this is something for a FAQ for NumericFields)

Thx

Uwe

-Ursprüngliche Nachricht-
Von: Uwe Schindler [mailto:u...@thetaphi.de] 
Gesendet: Dienstag, 26. Oktober 2010 09:30
An: java-user@lucene.apache.org
Betreff: RE: Problem with sorting on NumericFields

This happens if your field still contains other value types in this field,
maybe from deleted documents. The problem is that even if no document
contains the old field encoding anymore, it could still be leftover terms in
terms index. So the FieldCache code loads the terms (even if no longer
documents are attached) and  tries to parse it. So if it was before a
different field type like a conventional plain text encoded numeric, the
parsing of those old terms fails. You should reindex the whole stuff or at
least try to optimize the index to get rid of deleted documents and the
terms.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


 -Original Message-
 From: Uwe Goetzke [mailto:uwe.goet...@veenion.de]
 Sent: Monday, October 25, 2010 9:43 PM
 To: java-user@lucene.apache.org
 Subject: Problem with sorting on NumericFields
 
 I got stuck on a problem using  NumericFields using with lucene 2.9.3
 
 
 
 I add values  to the document by
 
 doc.add(new NumericField(minprice).setDoubleValue(net_price));
 
 
 
 If I want to search with a sorter for this field, I get this error:
 
 
 
 java.lang.NumberFormatException: Invalid shift value in prefixCoded string
(is
 encoded value really an INT?)
 
 at
 org.apache.lucene.util.NumericUtils.prefixCodedToInt(NumericUtils.java:2
 33)
 
 at
 org.apache.lucene.search.FieldCache$8.parseFloat(FieldCache.java:256)
 
 at
 org.apache.lucene.search.FieldCacheImpl$FloatCache.createValue(FieldCach
 eImpl.java:514)
 
 at
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22
 4)
 
 at
 org.apache.lucene.search.FieldCacheImpl.getFloats(FieldCacheImpl.java:48
 7)
 
 at
 org.apache.lucene.search.FieldCacheImpl$FloatCache.createValue(FieldCach
 eImpl.java:504)
 
 at
 org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:22
 4)
 
 at
 org.apache.lucene.search.FieldCacheImpl.getFloats(FieldCacheImpl.java:48
 7)
 
 at
 org.apache.lucene.search.FieldComparator$FloatComparator.setNextReader(F
 ieldComparator.java:269)
 
 at
 org.apache.lucene.search.TopFieldCollector$MultiComparatorNonScoringColl
 ector.setNextReader(TopFieldCollector.java:435)
 
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:257)
 
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:240)
 
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:181)
 
 at
 org.apache.lucene.search.Searcher.search(Searcher.java:90)
 
 
 
 The Sort field as seen by the debugger:
 
 * sort_fields = {org.apache.lucene.search.sortfield...@9010}
 * [0] = {org.apache.lucene.search.sortfi...@9011}!
 * field = {java.lang.str...@8642}minprice
 * type = 5
 * locale = null
 * reverse = true
 * factory = null
 * parser = null
 * comparatorSource = null
 * useLegacy = false
 
 I run out of ideas what might go wrong. I did look at the index with luke
and I
 do not see anything special.
 
 As this happens with the same code on other servers, too, It looks like
some
 kind of programming error.
 
 Any hints?
 
 Thx
 
 
 
 Uwe
 
 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

AW: How can I merge .cfx and .cfs into a single cfs file?

2010-05-05 Thread Uwe Goetzke

Index all into a directory and determine the size of all files in it.

From http://lucene.apache.org/java/3_0_1/fileformats.html 
Starting with Lucene 2.3, doc store files (stored field values and term 
vectors) can be shared in a single set of files for more than one segment. When 
compound file is enabled, these shared files will be added into a single 
compound file (same format as above) but with the extension .cfx.

In addition to
Compound File   .cfsAn optional virtual file consisting of all the other 
index files for systems that frequently run out of file handles.

Uwe


-Ursprüngliche Nachricht-
Von: 张志田 [mailto:zhitian.zh...@dianping.com] 
Gesendet: Mittwoch, 5. Mai 2010 08:24
An: java-user@lucene.apache.org
Betreff: How can I merge .cfx and .cfs into a single cfs file?

Hi all,

I have an index task which will index thousands of records with lucene 3.0.1. 
My confusion is lucene will always create a .cfx and a .cfs file in the file 
system, sometimes more, while I thought it should create a single .cfs file if 
I optimize the index data. Is it by design? If yes, is there any 
way/configuration I can do to merge all of the index files into a singe one?

By the way, I have a logic to validate the index data, if the size of .cfs 
increases dramatically comparing to the file generated last time, there may be 
something wrong, a warning message will be threw. This is the reason that I 
want to generate a single .cfs file. Any other suggestion about the index 
validation?

Any body can give me a hand?

Thanks in advance.

Garry

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

AW: Relevancy Practices

2010-05-03 Thread Uwe Goetzke

Regarding Part3: 
Data quality
For our search domain (catalog products) we face very often the problem that 
the search data is full of acronyms and abbreviations like:
cable,nym-j,pvc,3x2.5mm²
or
dvd-/cd-/usb-carradio,4x50W,divx,bl

We solved this by a combination of normalization for better data quality (or 
less variations)
and some tolerant sloppy phrase search where the search token needs only to 
partly match an indexed token.
We use here a dictionary lookup approach into the indexed tokens of some fields 
and expand the users query with a well weighted set of search terms.

It took us some iterations to get this right and fast enough to search in 
several million products.
The next step on our list are facets.

Uwe 


-Ursprüngliche Nachricht-
Von: mbennett.idea...@gmail.com [mailto:mbennett.idea...@gmail.com] Im Auftrag 
von Mark Bennett
Gesendet: Donnerstag, 29. April 2010 16:59
An: java-user@lucene.apache.org
Betreff: Re: Relevancy Practices

Hi Grant,

You're welcome to use any of my slides (Dave's got them), with attribution
of course.

BUT

Have you considered a section something like why the hell do you think
Relevancy tweaking is gonna save you!?!?

Basically that, as a corpus grows exponentially, so do results list sizes,
so ALL relevancy tweaks will eventually fail.  And FACETS (or other
navigators) are the answer.  I've got slides on that as well.

Of course relevancy matters but it's only ONE of perhaps a three pronged
approach:
1: Organic Relevancy and top query suggetions
2: Results list Navigators, the best the system can support, and
3: Data quality (spidering, METADATA quality, source weighting, etc)

Mark

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Thu, Apr 29, 2010 at 7:14 AM, Grant Ingersoll gsing...@apache.orgwrote:

 I'm putting on a talk at Lucene Eurocon (
 http://lucene-eurocon.org/sessions-track1-day2.html#1) on Practical
 Relevance and I'm curious as to what people put in practice for testing and
 improving relevance.  I have my own inclinations, but I don't want to muddy
 the water just yet.  So, if you have a few moments, I'd love to hear
 responses to the following questions.

 What worked?
 What didn't work?
 What didn't you understand about it?
 What tools did you use?
 What tools did you wish you had either for debugging relevance or fixing
 it?
 How much time did you spend on it?
 How did you avoid over/under tuning?
 What stage of development/testing/production did you decide to do relevance
 tuning?  Was that timing planned or not?


 Thanks,
 Grant


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

MergePolicy$MergeException because of FileNotFoundException because wrong path to index-file

2009-08-31 Thread Uwe Goetzke

We have an IndexWriter.optimize running on 4 Proc Xenon Java 1.5 Win2003
machine.
We get a repeatable FileNotFoundException because the path to the file
is wrong:

D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c
fs
Instead of
D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140\_0.
cfs

I have no idea what is different here because we use the same code
successfully on other machines (even multi-core)
1. 2009.08.28 13:10:30 :
[B:60043][N:org.apache.lucene.index.MergePolicy$MergeException]
org.apache.lucene.index.MergePolicy$MergeException:
java.io.FileNotFoundException:
D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c
fs (The system cannot find the file specified)
at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
ncurrentMergeScheduler.java:309)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
entMergeScheduler.java:286)
Caused by: java.io.FileNotFoundException:
D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c
fs (The system cannot find the file specified)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.init(RandomAccessFile.java:212)
at
org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.init(FSDir
ectory.java:552)
at
org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java
:582)
at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
at
org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.jav
a:70)
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:321)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:260)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4220)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3884)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMerge
Scheduler.java:205)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
entMergeScheduler.java:260)
2. 2009.08.28 13:10:31 : [B:60043][N:java.io.IOException]
java.io.IOException: background merge hit exception: _0:c71339-_0
_1:c36232-_0 _2:c37691-_0 _3:c29335-_0 _4:c29954-_0 _5:c33617-_0
_6:c37092-_0 _7:c35483-_0 _8:c25244-_0 _9:c31566-_0 _a:c4891-_0
into _b [optimize]
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2273)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2218)
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2198)

I have looked through the code of FSDirectory
// Inherit javadoc
  public IndexInput openInput(String name, int bufferSize) throws
IOException {
ensureOpen();
return new FSIndexInput(new File(directory, name), bufferSize);
  }

Checking further, one would assume that in Win32FileSystem the following
would be not set
slash = ((String) AccessController.doPrivileged(
  new GetPropertyAction(file.separator))).charAt(0);

Which sounds more than strange to me...

Any idea?

Regards

Uwe Goetzke 


---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, 
durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

AW: MergePolicy$MergeException because of FileNotFoundException because wrong path to index-file

2009-08-31 Thread Uwe Goetzke

Ups, sorry 2.4.1

Thx

 Uwe Goetzke

-Ursprüngliche Nachricht-
Von: Uwe Schindler [mailto:u...@thetaphi.de] 
Gesendet: Montag, 31. August 2009 17:42
An: java-user@lucene.apache.org
Betreff: RE: MergePolicy$MergeException because of FileNotFoundException 
because wrong path to index-file

Which Lucene Version? The RC2 of 2.9?

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

 -Original Message-
 From: Uwe Goetzke [mailto:uwe.goet...@healy-hudson.com]
 Sent: Monday, August 31, 2009 5:40 PM
 To: java-user@lucene.apache.org
 Subject: MergePolicy$MergeException because of FileNotFoundException
 because wrong path to index-file
 
 We have an IndexWriter.optimize running on 4 Proc Xenon Java 1.5 Win2003
 machine.
 We get a repeatable FileNotFoundException because the path to the file
 is wrong:
 
 D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c
 fs
 Instead of
 D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140\_0.
 cfs
 
 I have no idea what is different here because we use the same code
 successfully on other machines (even multi-core)
 1. 2009.08.28 13:10:30 :
 [B:60043][N:org.apache.lucene.index.MergePolicy$MergeException]
 org.apache.lucene.index.MergePolicy$MergeException:
 java.io.FileNotFoundException:
 D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c
 fs (The system cannot find the file specified)
 at
 org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(Co
 ncurrentMergeScheduler.java:309)
 at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
 entMergeScheduler.java:286)
 Caused by: java.io.FileNotFoundException:
 D:\data0\impact\ordering\prod\work\search_index\s_index1251456210140_0.c
 fs (The system cannot find the file specified)
 at java.io.RandomAccessFile.open(Native Method)
 at java.io.RandomAccessFile.init(RandomAccessFile.java:212)
 at
 org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor.init(FSDir
 ectory.java:552)
 at
 org.apache.lucene.store.FSDirectory$FSIndexInput.init(FSDirectory.java
 :582)
 at org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:488)
 at
 org.apache.lucene.index.CompoundFileReader.init(CompoundFileReader.jav
 a:70)
 at
 org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:321)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:306)
 at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:260)
 at
 org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4220)
 at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3884)
 at
 org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMerge
 Scheduler.java:205)
 at
 org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Concurr
 entMergeScheduler.java:260)
 2. 2009.08.28 13:10:31 : [B:60043][N:java.io.IOException]
 java.io.IOException: background merge hit exception: _0:c71339-_0
 _1:c36232-_0 _2:c37691-_0 _3:c29335-_0 _4:c29954-_0 _5:c33617-_0
 _6:c37092-_0 _7:c35483-_0 _8:c25244-_0 _9:c31566-_0 _a:c4891-_0
 into _b [optimize]
 at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2273)
 at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2218)
 at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2198)
 
 I have looked through the code of FSDirectory
 // Inherit javadoc
   public IndexInput openInput(String name, int bufferSize) throws
 IOException {
 ensureOpen();
 return new FSIndexInput(new File(directory, name), bufferSize);
   }
 
 Checking further, one would assume that in Win32FileSystem the following
 would be not set
   slash = ((String) AccessController.doPrivileged(
   new GetPropertyAction(file.separator))).charAt(0);
 
 Which sounds more than strange to me...
 
 Any idea?
 
 Regards
 
 Uwe Goetzke
 
 
 ---
 Healy Hudson GmbH - D-55252 Mainz Kastel
 Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
 
 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger
 sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn
 Sie diese Email durch einen Fehler bekommen haben, teilen Sie uns dies
 bitte umgehend mit, indem Sie diese Email an den Absender zuruckschicken.
 Bitte loschen Sie danach diese Email.
 This email is confidential. If you are not the intended recipient, you
 must not disclose or use this information contained in it. If you have
 received this email in error please tell us immediately by return email
 and delete the document.
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr

AW: Most frequently indexed term

2009-06-08 Thread Uwe Goetzke

Hello Ganesh,

What about  making a seperate index for each day, get your analysis and merge 
thereafter that index.

I am not sure but I think this might work. Use MultiSearcher for the search.

Regards

Uwe Goetzke


-Ursprüngliche Nachricht-
Von: Ganesh [mailto:emailg...@yahoo.co.in] 
Gesendet: Montag, 8. Juni 2009 12:31
An: java-user@lucene.apache.org
Betreff: Re: Most frequently indexed term

Thanks. This works well.

The logic is 
1. Do the search, For every document get the list of terms and its frequency. 
2. Use SortedTermVectorMapper to generate a list of unique terms and its 
frequency. 
2. Sort them to get the list of top numbered frequently indexed terms in a 
given date range (any given criteria).

My Question is:
I need to get the top 20 highly indexed term in a day. 1 million documents 
could be indexed in a day. I need to traverse the 1 million records and store 
the unique terms and its frequencies. It may consume huge amount of memory. Is 
there any other way out? With out using term vector, i could get the list of 
most frequently indexed term in a database. Similarly is there any other way to 
get the list of most frequently indexed term in a date range or a subset of 
database.

Regards
Ganesh 



- Original Message - 
From: Preetham Kajekar preet...@cisco.com
To: java-user@lucene.apache.org
Sent: Tuesday, May 26, 2009 11:08 PM
Subject: Re: Most frequently indexed term


 Have a look at
 http://stackoverflow.com/questions/195434/how-can-i-get-top-terms-for-a-subset-of-documents-in-a-lucene-index
 
 (I have not tried the above out)
 
 Ganesh wrote:
 Hello All,

 I need to build some stats. I need to know Top 5 frequently indexed term in 
 a date range (In a day or a Month).

 Any idea of how to achieve this.

 Regards
 GaneshIéÝŠ{-j{fzËë-£*.®‰åŠwŸ®'§vÈm¶ŸÿŠyž²Ç§êòj(com=
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
IéÝŠ{-j{fzËë-£*.®‰åŠwŸ®'§vÈm¶ŸÿŠyž²Ç§êòj(

---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

AW: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss

2008-11-18 Thread Uwe Goetzke

case '\u00EE' : // î
case '\u00EF' : // ï
output.append(i);
break;
case '\u00F0' : // ð
output.append(d);
break;
case '\u00F1' : // ñ
output.append(n);
break;
case '\u00F2' : // ò
case '\u00F3' : // ó
case '\u00F4' : // ô
case '\u00F5' : // õ
case '\u00F8' : // ø
output.append(o);
break;
case '\u00F6' : // ö
case '\u0153' : // œ
output.append(oe);
break;
case '\u00DF' : // ß
output.append(ss);
break;
case '\u00FE' : // þ
output.append(th);
break;
case '\u00F9' : // ù
case '\u00FA' : // ú
case '\u00FB' : // û
output.append(u);
break;
case '\u00FC' : // ü
output.append(ue);
break;
case '\u00FD' : // ý
case '\u00FF' : // ÿ
output.append(y);
break;
default :
output.append(input.charAt(i));
break;
}
}
return output.toString();
}
}

Regards

Uwe Goetzke
Leiter Produktentwicklung 
Healy Hudson GmbH 
Procurement  Retail Solutions   


-Ursprüngliche Nachricht-
Von: Sascha Fahl [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 18. November 2008 13:07
An: java-user@lucene.apache.org
Betreff: Transforming german umlaute like ö,ä,ü,ß into oe, ae, ue, ss

Hi,
what is the best to transform the german umlaute ö,ä,ü,ß into oe, ae,  
ue, ss during the process of analyzing?

Thanks,


Sascha Fahl
Softwareentwicklung

evenity GmbH
Zu den Mühlen 19
D-35390 Gießen

Mail: [EMAIL PROTECTED]









-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

2008-03-26 Thread Uwe Goetzke

Hi Jay,

Sorry for the confusion, I wrote NgramStemFilter in an early stage of the
project which is essentially the same as NGramTokenFilter from Otis with the
addition that I add begin and end token markers (e.g. word gets and _word_ and
so _w wo rd d_ ).

As I modified a lot of our lucene code which we developed since lucene version
1.2 to move to a 2.x version, I did not notice of the existence of
NGramTokenFilter.

Stemming is anyway not useful for our problem domain (product catalogs).
We chained WhiteSpace Tokenizer with a modified version of
ISOLatin1AccentFilter to nomalize some character based language aspects (e.g. ß
= ss, ö = oe), then make the token Lowercase before getting the bigrams.
The advantage for us is anyway the TolerantPhraseQuery (see my other post AW:
Implement a relaxed PhraseQuery?) which gives us a first step for less
language dependent searching.

Regards Uwe

-Ursprüngliche Nachricht-
Von: yu [mailto:[EMAIL PROTECTED]
Gesendet: Mittwoch, 26. März 2008 05:26
An: java-user@lucene.apache.org
Betreff: Re: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

Sorry for my ignorance, I am looking for

NgramStemFilter specifically.
Are you suggesting that it's the same as NGramTokenFilter? Does it have
stemming in it?

Thanks again.

Jay

Otis Gospodnetic wrote:
Sorry, I wrote this stuff, but forgot the naming.
Look:
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/org/apache/lucene/analysis/ngram/package-summary.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: yu [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Wednesday, March 26, 2008 12:04:33 AM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

Hi Otis,
I checked that contrib before and could not find NgramStemFilter. Am I
missing other contrib?
Thanks for the link!

Jay

Otis Gospodnetic wrote:

Hi Jay,

Sorry, lapsus calami, that would be Lucene *contrib*.
Have a look:
http://lucene.apache.org/java/2_3_1/api/contrib-analyzers/index.html

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: Jay [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 6:15:54 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

Sorry, I could not find the filter in the 2.3 API class list (core +
contrib + test). I am not ware of lucene config file either. Could you
please tell me where it is in 2.3 release?

Thanks!

Jay

Otis Gospodnetic wrote:

Jay,

Have a look at Lucene config, it's all there, including tests. This filter
will take a token such as foobar and chop it up into n-grams (e.g. foobar
- fo oo ob ba ar would be a set of bi-grams). You can specify the n-gram
size and even min and max n-gram size.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message
From: Jay [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, March 25, 2008 1:32:24 PM
Subject: Re: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

Hi Uwe,

I am curious what NGramStemFilter is? Is it a combination of porter
stemming and word ngram identification?

Thanks!

Jay

Uwe Goetzke wrote:

Hi Ivan,
No, we do not use StandardAnalyser or StandardTokenizer.

Most data is processed by
fTextTokenStream = result = new
org.apache.lucene.analysis.WhitespaceTokenizer(reader);
result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter
modified that ö - oe
result = new org.apache.lucene.analysis.LowerCaseFilter(result);
result = new org.apache.lucene.analysis.NGramStemFilter(result,2);
//just a bigram tokenizer

We use our own queryparser. The bigramms are searched with a tolerant
phrase query, scoring in a doc the greatest bigramms clusters covering the
phrase token.

Best Regards

Uwe

-Ursprüngliche Nachricht-
Von: Ivan Vasilev [mailto:[EMAIL PROTECTED]
Gesendet: Freitag, 21. März 2008 16:25
An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2-2.3.1

Hi Uwe,

Could you tell what Analyzer do you use when you marked so big indexing
speedup?
If you use StandardAnalyzer (that uses StandardTokenizer) may be the
reason is in it. You can see the pre last report in the thread Indexing
Speed: 2.3 vs 2.2 (real world numbers). According to the reporter Jake
Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
that now is generated by JFlex instead of JavaCC.
I am asking because I noticed a great speedup in adding documents to
index in our system. We have time control on this in the debug mode. NOW
THEY ARE ADDED 5 TIMES FASTER!!!
But in the same time the total process of indexing in our case has
improvement of about 8%. As our system is very big and complex I am
wondering if really

AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

2008-03-25 Thread Uwe Goetzke

Jake,

With the bigram-based index we gave up for the struggle to find a well working 
language based index.
We had implemented soundex (or different sound-alikes) and hyphenating but 
failed to deliver a user explainable search result (why is this ranked higher 
and so on...). One reason may be that product descriptions contain a lot of 
abbreviations.

The index size grew about 30%.
The search performance seems a bit slower but I no concrete figures. The 
evaluation for a for one document is a bit more complex than a phrase query. 
One reason of course is that there a more terms evaluated. But nevertheless it 
is quite good.

The search relevance improved tremendously. Missing characters, switched 
letters and partial word fragments are no real problems any more (of course 
dependent on the length of the search word).
Search term weekday finds also day of the week, disabigaute finds 
disambiguate.
The algorithms I developed might not fit other domains but for multi language 
catalogs of products it works quite well for us. So far...


Regards Uwe

-Ursprüngliche Nachricht-
Von: Jake Mannix [mailto:[EMAIL PROTECTED] 
Gesendet: Dienstag, 25. März 2008 17:13
An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2-2.3.1

Uwe,
  This is a little off thread-topic, but I was wondering how your
search relevance and search performance has fared with this
bigram-based index.  Is it significantly better than before you use
the NGramAnalyzer?
   -jake



On 3/24/08, Uwe Goetzke [EMAIL PROTECTED] wrote:
 Hi Ivan,
 No, we do not use StandardAnalyser or StandardTokenizer.

 Most data is processed by
   fTextTokenStream = result = new
 org.apache.lucene.analysis.WhitespaceTokenizer(reader);
   result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter
 modified that ö - oe
   result = new org.apache.lucene.analysis.LowerCaseFilter(result);
   result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
 //just a
 bigram tokenizer

 We use our own queryparser. The bigramms are searched with a tolerant phrase
 query, scoring in a doc the greatest bigramms clusters covering the phrase
 token.

 Best Regards

 Uwe

 -Ursprüngliche Nachricht-
 Von: Ivan Vasilev [mailto:[EMAIL PROTECTED]
 Gesendet: Freitag, 21. März 2008 16:25
 An: java-user@lucene.apache.org
 Betreff: Re: feedback: Indexing speed improvement lucene 2.2-2.3.1

 Hi Uwe,

 Could you tell what Analyzer do you use when you marked so big indexing
 speedup?
 If you use StandardAnalyzer (that uses StandardTokenizer) may be the
 reason is in it. You can see the pre last report in the thread Indexing
 Speed: 2.3 vs 2.2 (real world numbers). According to the reporter Jake
 Mannix this is because now StandardTokenizer uses StandardTokenizerImpl
 that now is generated by JFlex instead of JavaCC.
 I am asking because I noticed a great speedup in adding documents to
 index in our system. We have time control on this in the debug mode. NOW
 THEY ARE ADDED 5 TIMES FASTER!!!
 But in the same time the total process of indexing in our case has
 improvement of about 8%. As our system is very big and complex I am
 wondering if really the whole process of indexing is reduces so
 remarkably and our system causes this slowdown or may be Lucene does
 some optimizations on the index, merges or something else and this is
 the reason the total process of indexing to be not so reasonably faster.

 Best Regards,
 Ivan



 Uwe Goetzke wrote:
  This week I switched the lucene library version on one customer system.
  The indexing speed went down from 46m32s to 16m20s for the complete task
  including optimisation. Great Job!
  We index product catalogs from several suppliers, in this case around
  56.000 product groups and 360.000 products including descriptions were
  indexed.
 
  Regards
 
  Uwe
 
 
 
  ---
  Healy Hudson GmbH - D-55252 Mainz Kastel
  Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076
 
  Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger
 sind, durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie
 diese Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte
 umgehend mit, indem Sie diese Email an den Absender zuruckschicken. Bitte
 loschen Sie danach diese Email.
  This email is confidential. If you are not the intended recipient, you
 must not disclose or use this information contained in it. If you have
 received this email in error please tell us immediately by return email and
 delete the document.
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
  __ NOD32 2913 (20080301) Information __
 
  This message was checked by NOD32 antivirus system.
  http://www.eset.com

AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

2008-03-24 Thread Uwe Goetzke

Hi Ivan,
No, we do not use StandardAnalyser or StandardTokenizer.

Most data is processed by 
fTextTokenStream = result = new 
org.apache.lucene.analysis.WhitespaceTokenizer(reader);
result = new ISOLatin2AccentFilter(result); // ISOLatin1AccentFilter  
modified that ö - oe
result = new org.apache.lucene.analysis.LowerCaseFilter(result);
result = new org.apache.lucene.analysis.NGramStemFilter(result,2); 
//just a bigram tokenizer

We use our own queryparser. The bigramms are searched with a tolerant phrase 
query, scoring in a doc the greatest bigramms clusters covering the phrase 
token. 

Best Regards

Uwe

-Ursprüngliche Nachricht-
Von: Ivan Vasilev [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 21. März 2008 16:25
An: java-user@lucene.apache.org
Betreff: Re: feedback: Indexing speed improvement lucene 2.2-2.3.1

Hi Uwe,

Could you tell what Analyzer do you use when you marked so big indexing 
speedup?
If you use StandardAnalyzer (that uses StandardTokenizer) may be the 
reason is in it. You can see the pre last report in the thread Indexing 
Speed: 2.3 vs 2.2 (real world numbers). According to the reporter Jake 
Mannix this is because now StandardTokenizer uses StandardTokenizerImpl 
that now is generated by JFlex instead of JavaCC.
I am asking because I noticed a great speedup in adding documents to 
index in our system. We have time control on this in the debug mode. NOW 
THEY ARE ADDED 5 TIMES FASTER!!!
But in the same time the total process of indexing in our case has 
improvement of about 8%. As our system is very big and complex I am 
wondering if really the whole process of indexing is reduces so 
remarkably and our system causes this slowdown or may be Lucene does 
some optimizations on the index, merges or something else and this is 
the reason the total process of indexing to be not so reasonably faster.

Best Regards,
Ivan



Uwe Goetzke wrote:
 This week I switched the lucene library version on one customer system.
 The indexing speed went down from 46m32s to 16m20s for the complete task
 including optimisation. Great Job!
 We index product catalogs from several suppliers, in this case around
 56.000 product groups and 360.000 products including descriptions were
 indexed.

 Regards

 Uwe



 ---
 Healy Hudson GmbH - D-55252 Mainz Kastel
 Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

 Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, 
 durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
 Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
 mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie 
 danach diese Email.
 This email is confidential. If you are not the intended recipient, you must 
 not disclose or use this information contained in it. If you have received 
 this email in error please tell us immediately by return email and delete the 
 document.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 __ NOD32 2913 (20080301) Information __

 This message was checked by NOD32 antivirus system.
 http://www.eset.com



   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: Implement a relaxed PhraseQuery?

2008-03-24 Thread Uwe Goetzke

Hi Cuong ,

I have written a TolerantPhraseScorer starting with the code from PhraseScorer 
but I think I have modified it to much to be generally useful. We use it with 
bigramm clusters and therefore does not need the slop factor for scoring but 
have a tolerance factor (depending on the length of the phrase). Here are the 
most relevant code fragments to start with... 
So the idea is to keep queue ordered (calling firstToLast2 and moveLast). I 
have not yet checked the code for optimisations. If you find one, I would be 
glad to hear about it... ;-)


protected TolerantPhrasePositions first, last, reallast; // last point to the 
last tpp for the doc varying from tolerance to phrase size (reallast)

protected int tolerance;

/**
 * similar to PhraseScorer but with a tolerance factor
 *
 * @see PhraseScorer
 */
TolerantPhraseScorer(Weight weight, TermPositions[] tps, int[] 
positions, Similarity similarity,
 byte[] norms, int tolerance)
{
super(similarity);
this.norms = norms;
this.weight = weight;
this.value = weight.getValue();
this.tolerance = tolerance;
termsize = 0;
// convert tps to a list
for (int i = 0; i  tps.length; i++) {
if (tps[i] != null) {
TolerantPhrasePositions pp = new 
TolerantPhrasePositions(tps[i], positions[i]);
termsize++;
if (reallast != null) {   // 
add next to end of list
reallast.next = pp;
pp.previous = reallast;
}
else
first = pp;
reallast = pp;
if ((termsize = tolerance)  (last == null))
last = pp;
}
}
pq = new TolerantPhraseQueue(termsize); // 
construct empty pq
}


public boolean next() throws IOException
{
if (firstTime) {
init();
firstTime = false;
}
else if (more) {
int doc = last.doc;
while (doc == last.doc) {
more = last.next(); // 
trigger further scanning
moveLast();
}
}
return doNext();
}

// next without initial increment
private boolean doNext() throws IOException
{
while (more) {
while (more  first.doc  last.doc) {  // find doc 
w/ all the terms
more = first.skipTo(last.doc);// 
skip first upto last
firstToLast2();// 
and move it to the end
}
if (more) {
// found a doc with all of the terms
freq = phraseFreq();  // 
check for phrase
if (freq == 0.0f) {// 
no match
int doc = last.doc;
while (doc == last.doc) {
more = last.next(); 
// trigger further scanning
moveLast();
}
}
else
return true;
// found a match
}
}
return false; // no more matches
}


private void firstToLast2()
{
TolerantPhrasePositions newfirst = first.next;
TolerantPhrasePositions test = last;
TolerantPhrasePositions insertp = test;
while ((test != null)  (first.doc = test.doc)) {
insertp = test;
test = test.next;
}
if (insertp == null) { // last elem should not happen
System.out.println(firstToLast2-insertp==null);
}
else {
first.previous = insertp;  // einkoppeln
first.next = insertp.next;
if (first.next != null)
first.next.previous = first;
insertp.next = first;

feedback: Indexing speed improvement lucene 2.2-2.3.1

2008-03-01 Thread Uwe Goetzke


This week I switched the lucene library version on one customer system.
The indexing speed went down from 46m32s to 16m20s for the complete task
including optimisation. Great Job!
We index product catalogs from several suppliers, in this case around
56.000 product groups and 360.000 products including descriptions were
indexed.

Regards

Uwe



---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschaftsfuhrer Christian Konhauser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfanger sind, 
durfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zuruckschicken. Bitte loschen Sie 
danach diese Email.
This email is confidential. If you are not the intended recipient, you must not 
disclose or use this information contained in it. If you have received this 
email in error please tell us immediately by return email and delete the 
document.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

AW: Does Lucene support partition-by-keyword indexing?

2008-03-01 Thread Uwe Goetzke

Hi,

I do not yet fully understand what you want to achieve. 
You want to spread the index split by keywords to reduce the time to distribute 
indexes? 
And you want the distribute queries to the nodes based on the same split 
mechanism? 


You have several nodes with different kind of documents.
You want to build one index for all nodes and split and distribute the index 
based on a set of keywords specific to a node. This you want to do to split the 
queries so each query involves communicating with constant number of nodes.

Do documents at the nodes contain only such keywords? I doubt. 
So you need anyway a reference where the indexed doc can be found and retrieve 
it from its node for display. 
You could index at each node, merge all indexes from all nodes and distribute 
the combined index.
On what criteria you can split the queries? If you have a combined index each 
node can distribute the queries to other nodes on statistical data found in the 
term distribution. 
You need to merge the results anyway.

I doubt that this kind of overhead is worth the trouble because you introduce a 
lot of single points of failure. And the scalability seems limited because you 
would need to recalibrate the whole network when a adding a new node. Why don't 
you distribute the complete index (we do this after getting it locally zipped 
and later unzipped on the receiver node, size is less than one third for 
transfering). Each node should have some activity indicator. Distribute the 
complete query to the node with the smallest activiy. So you get redundancy, do 
not need to split queries and merge results. OK, one evil query can bring a 
node down but the network is still working.

Do you have any results using lucene on a single node for your approach? How 
many queries and how many documents do you expect? 

Regards

Uwe

-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von ??
Gesendet: Sonntag, 2. März 2008 03:05
An: java-user@lucene.apache.org
Betreff: Re: Does Lucene support partition-by-keyword indexing?

Hi,

I agree with your point that it is easier to partition index by document.
But the partition-by-keyword approach has much greater scalability over the
partition-by-document approach. Each query involves communicating with
constant number of nodes; while partition-by-doc requires spreading the
query a long all or many of the nodes. And I am actually doing some small
research on this. By the way, the documents to be indexed are not
necessarily web pages. They are mostly files stored on each node's file
system.

Node failures are also handled by replicas. The index for each term will be
replicated on multiple nodes, whose nodeIDs are near to each other. This
mechanism is handled by the underlying DHT system.

So any idea how can partition index by keyword in lucene? Thanks.

On Sun, Mar 2, 2008 at 5:50 AM, Mathieu Lecarme [EMAIL PROTECTED]
wrote:

 The easiest way is to split index by Document. In Lucene, index
 contains Document and inverse index of Term. If you wont to put Term
 in different place, Document will be duplicated on each index, with
 only a part of their Term.

 How will you manage node failure in your network?

 They were some trial to build big p2p search engine to compet with
 Google, but, it will be easier to split by Document.

 If you have to many computers and want to see them working together,
 why don't use Nutch with Hadoop?

 M.
 Le 1 mars 08 à 19:16, Yin Qiu a écrit :

  Hi,
 
  I'm planning to implement a search infrastructure on a P2P overlay. To
  achieve this, I want to first distribute the indices to various nodes
  connected by this overlay. My approach is to partition the indices by
  keyword, that is, one node takes care of certain keywords (or
  terms). When a
  simple TermQuery is encountered, we just find the node associated
  with that
  term (with distributed hash table) and get the result. And suppose a
  BooleanQuery is issued, we contact all the nodes involved in this
  query and
  finally merge the result.
 
  So my question is: does Lucene support partitioning the indices by
  keywords?
 
  Thanks in advance.
 
  --
  Look before you leap
  ---


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-- 
Look before you leap
---

---
Healy Hudson GmbH - D-55252 Mainz Kastel
Geschäftsführer Christian Konhäuser - Amtsgericht Wiesbaden HRB 12076

Diese Email ist vertraulich. Wenn Sie nicht der beabsichtigte Empfänger sind, 
dürfen Sie die Informationen nicht offen legen oder benutzen. Wenn Sie diese 
Email durch einen Fehler bekommen haben, teilen Sie uns dies bitte umgehend 
mit, indem Sie diese Email an den Absender zurückschicken. Bitte löschen Sie 
danach

Re: Chinese Segmentation with Phase Query

2007-11-10 Thread Uwe Goetzke

Hi Cedric,

Although I have no idea how to use the Chinese language 
but I went a different route to overcome language specific problems.

Instead of using a language specific segmentation we use now the statistical 
segmentation with bigrams 

e.g.
Given a your sentence XYZABCDEF
suppose the segmentation is
XY YZ ZA AB BC CD DE EF 

A SpanNearQuery of (XY, BC, DE, EF) with distance of 10 should work to
match this document.

I am not sure if this works in your case because we index product information 
and their descriptions which are not language friendly anyway because of the 
abbreviations) 

Regards

Uwe Goetzke


-Ursprüngliche Nachricht-
Von: Cedric Ho [mailto:[EMAIL PROTECTED] 
Gesendet: Samstag, 10. November 2007 02:28
An: java-user@lucene.apache.org
Betreff: - Re: Chinese Segmentation with Phase Query 

On Nov 10, 2007 2:08 AM, Steven A Rowe [EMAIL PROTECTED] wrote:
 Hi Cedric,

 On 11/08/2007, Cedric Ho wrote:
  a sentence containing characters ABC, it may be segmented into AB, C or A, 
  BC.
 [snip]
  In this cases we would like to index both segmentation into the index:
 
  AB offset (0,1) position 0A offset (0,0) position 0
  C offset (2,2) position 1 BC offset (1,2) position 1
 
  Now the problem is, when someone search using a PhraseQuery (AC) it
  will find this line ABC because it match A (position 0) and C
  (position 1).
 
  Are there any ways to search for exact match using the offset
  information instead of the position information ?

 Since you are writing the tokenizer (the Lucene term for the module that 
 performs the segmentation), you yourself can substitute the beginning offset 
 for the position.  But I think that without the end offset, it won't get you 
 what you want.

 For example, if your above example were indexed with beginning offsets as 
 positions, a phrase query for AB, C will fail to match -- even though it 
 should match -- because the segments' beginning offsets (0 and 2) are not 
 contiguous.

 The new Payloads feature could provide the basis for storing beginning and 
 ending offsets required to determine contiguity when matching phrases, but 
 you would have to write matching and scoring for this representation, and 
 that may not be the quickest route available to you.

 Solution #1: Create multiple fields, one for each full alternative 
 segmentation, and then query against all of them.

 Solution #2: Store the alternative segmentations in the same field, but 
 instead of interleaving the segments' positions, as in your example, make the 
 position ranges of the alternatives non-contiguous.  Recasting your example:

 lternative #1   Alternative #2  Alternative #3
 -   --  --
 AB position 0   A position 100  A position 200
 C position 1BC position 101 B position 201
 C position 202

 There is a problem with both of the above-described solutions: in my limited 
 experience with Chinese segmentation, substantially less than half the text 
 has alternative segmentations.  As a result, the segments on which all of 
 alternatives agree (call them uncontested segments) will have higher term 
 frequencies than those segments which differ among the alternatives 
 (contested segments).  This means that document scores will be influenced 
 by the variable density of the contested segments they contain.

 However, if you were to use my above-described Solution #1 along with a 
 DisjunctionMaxQuery[1] as a wrapper around one query per alternative 
 segmentation field, the term frequency problem would no longer be an issue.  
 From the API doc for DisjunctionMaxQuery:

 A query that generates the union of documents produced by its
 subqueries, and that scores each document with the maximum
 score for that document as produced by any subquery, plus a
 tie breaking increment for any additional matching subqueries.
 This is useful when searching for a word in multiple fields
 with different boost factors (so that the fields cannot be
 combined equivalently into a single search field).  We want
 the primary score to be the one associated with the highest
 boost, not the sum of the field scores (as BooleanQuery would
 give).

 Unlike the use-case mentioned above, where each field will be boosted 
 differently, you probably don't have any information about the relative 
 probability of the alternative segmentations, so you'll want to use the same 
 boost for each sub-query.

 Steve

 [1] 
 http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html

 --
 Steve Rowe
 Center for Natural Language Processing
 http://www.cnlp.org/tech/lucene.asp


Hi Steve,

We have actually thought about solution #1, and in our case, sorting
by scoring is not a very important factor as well. However this would
double the index size. A full index of our documents now would

Problem with sorting on NumericFields

AW: Problem with sorting on NumericFields

AW: How can I merge .cfx and .cfs into a single cfs file?

AW: Relevancy Practices

MergePolicy$MergeException because of FileNotFoundException because wrong path to index-file

AW: MergePolicy$MergeException because of FileNotFoundException because wrong path to index-file

AW: Most frequently indexed term

AW: Transforming german umlaute like ö,ä,ü ,ß into oe, ae, ue, ss

AW: AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

AW: feedback: Indexing speed improvement lucene 2.2-2.3.1

AW: Implement a relaxed PhraseQuery?

feedback: Indexing speed improvement lucene 2.2-2.3.1

AW: Does Lucene support partition-by-keyword indexing?

Re: Chinese Segmentation with Phase Query

15 matches

Site Navigation

Mail list logo

Footer information