RE: spanish stemmer
One more question to the group. From what I have gathered, my choices for indexing and querying Spanish content are: 1. StandardAnalyzer (I read that this analyzer could be used for "European" languages) 2. SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); <--custom stop words from Ernesto class below Can I assume that choice 2 would be the better for Spanish content? thanks, chad. -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 3:31 PM To: Lucene Users List Subject: Re: spanish stemmer Because the SnowballAnalyzer, and SpanishStemmer don´t have a default stopword set. SnowballAnalyzer constructor: /** Builds the named analyzer with no stop words. */ public SnowballAnalyzer(String name) { this.name = name; } Note the comment. Bye, Ernesto. - Original Message - From: "Chad Small" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, August 23, 2004 4:57 PM Subject: RE: spanish stemmer Excellent Ernesto. Was there a reason you used your own stop word list and not just the default constructor SnowballAnalyzer("Spanish")? thanks, chad. -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 2:03 PM To: Lucene Users List Subject: Re: spanish stemmer Yes, is too easy. You need do a wrapper for spanish Snowball initilization. analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); above the complete code. Bye, Ernesto. -- public class SpanishAnalyzer extends Analyzer { private static SnowballAnalyzer analyzer; private String SPANISH_STOP_WORDS[] = { "un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras", "otro", "algún", "alguno", "alguna", "algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy", "esta", "estamos", "estais", "estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba", "ante", "antes", "siendo", "ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis", "pueden", "fui", "fue", "fuimos", "fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada", "fin", "incluso", "primero", "desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos", "consiguen", "ir", "voy", "va", "vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene", "tenemos", "teneis", "tienen", "el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos", "ellas", "nos", "nosotros", "vosotros", "vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe", "sabemos", "sabeis", "saben", "ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas", "sus", "entonces", "tiempo", "verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta", "ciertas", "intentar", "intento", "intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo", "arriba", "encima", "usar", "uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo", "empleas", "emplean", "ampleamos", "empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien", "cual", "cuando", "donde", "mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar", "trabajas", "trabaja", "trabajamos", "trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian", "podriais", "yo", "aquel", "mi", "de", "a", "e", "i", "o", "u"}; public SpanishAnalyzer() { analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); } public SpanishAnalyzer(String stopWords[]) { analyzer = new SnowballAnalyzer("Spanish", stopWords); } public TokenStream tokenStream(String fieldName, Reader reader) { return analyzer.tokenStream(fieldName, reader); } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: spanish stemmer
Excellent Ernesto. Was there a reason you used your own stop word list and not just the default constructor SnowballAnalyzer("Spanish")? thanks, chad. -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 2:03 PM To: Lucene Users List Subject: Re: spanish stemmer Yes, is too easy. You need do a wrapper for spanish Snowball initilization. analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); above the complete code. Bye, Ernesto. -- public class SpanishAnalyzer extends Analyzer { private static SnowballAnalyzer analyzer; private String SPANISH_STOP_WORDS[] = { "un", "una", "unas", "unos", "uno", "sobre", "todo", "también", "tras", "otro", "algún", "alguno", "alguna", "algunos", "algunas", "ser", "es", "soy", "eres", "somos", "sois", "estoy", "esta", "estamos", "estais", "estan", "en", "para", "atras", "porque", "por qué", "estado", "estaba", "ante", "antes", "siendo", "ambos", "pero", "por", "poder", "puede", "puedo", "podemos", "podeis", "pueden", "fui", "fue", "fuimos", "fueron", "hacer", "hago", "hace", "hacemos", "haceis", "hacen", "cada", "fin", "incluso", "primero", "desde", "conseguir", "consigo", "consigue", "consigues", "conseguimos", "consiguen", "ir", "voy", "va", "vamos", "vais", "van", "vaya", "bueno", "ha", "tener", "tengo", "tiene", "tenemos", "teneis", "tienen", "el", "la", "lo", "las", "los", "su", "aqui", "mio", "tuyo", "ellos", "ellas", "nos", "nosotros", "vosotros", "vosotras", "si", "dentro", "solo", "solamente", "saber", "sabes", "sabe", "sabemos", "sabeis", "saben", "ultimo", "largo", "bastante", "haces", "muchos", "aquellos", "aquellas", "sus", "entonces", "tiempo", "verdad", "verdadero", "verdadera", "cierto", "ciertos", "cierta", "ciertas", "intentar", "intento", "intenta", "intentas", "intentamos", "intentais", "intentan", "dos", "bajo", "arriba", "encima", "usar", "uso", "usas", "usa", "usamos", "usais", "usan", "emplear", "empleo", "empleas", "emplean", "ampleamos", "empleais", "valor", "muy", "era", "eras", "eramos", "eran", "modo", "bien", "cual", "cuando", "donde", "mientras", "quien", "con", "entre", "sin", "trabajo", "trabajar", "trabajas", "trabaja", "trabajamos", "trabajais", "trabajan", "podria", "podrias", "podriamos", "podrian", "podriais", "yo", "aquel", "mi", "de", "a", "e", "i", "o", "u"}; public SpanishAnalyzer() { analyzer = new SnowballAnalyzer("Spanish", SPANISH_STOP_WORDS); } public SpanishAnalyzer(String stopWords[]) { analyzer = new SnowballAnalyzer("Spanish", stopWords); } public TokenStream tokenStream(String fieldName, Reader reader) { return analyzer.tokenStream(fieldName, reader); } } - Original Message - From: "Chad Small" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, August 23, 2004 3:49 PM Subject: RE: spanish stemmer Do you mind sharing how you implemented your SpanishAnalyzer using Snowball? Sorry I can't help with your question. I am trying to implement Snowball Spanish or a Spanish Analyzer in Lucene. thanks, chad. -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 8:30 AM To: Lucene Users List Subject: spanish stemmer Hello I use the Snowball jar for implement my SpanishAnalyzer. I found that the words finished in 'bol' are not stripped. For example: In spanish for say basketball, you can say basquet or basquetbol. But for SpanishStemmer are different words. Idem with voley and voleybol. Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t exist in spanish. you think that I are correct? you can change this? Ernesto. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: spanish stemmer
Do you mind sharing how you implemented your SpanishAnalyzer using Snowball? Sorry I can't help with your question. I am trying to implement Snowball Spanish or a Spanish Analyzer in Lucene. thanks, chad. -Original Message- From: Ernesto De Santis [mailto:[EMAIL PROTECTED] Sent: Monday, August 23, 2004 8:30 AM To: Lucene Users List Subject: spanish stemmer Hello I use the Snowball jar for implement my SpanishAnalyzer. I found that the words finished in 'bol' are not stripped. For example: In spanish for say basketball, you can say basquet or basquetbol. But for SpanishStemmer are different words. Idem with voley and voleybol. Not idem with futbol (football), we not say fut for futbol. But 'fut' don´t exist in spanish. you think that I are correct? you can change this? Ernesto. --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.737 / Virus Database: 491 - Release Date: 11/08/2004 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene with English and Spanish Best Practice?
Thanks for the info Grant. << As for indexes, do you anticipate adding more fields later in Spanish? Is the content just a translation of the English, or do you have separate conetent in Spanish? Are your users querying in only one language (cross-lingual) or are the Spanish speakers only querying against Spanish content? >> Our fields are pretty much going to be one-for-one between English and Spanish (a translation of current content from English to Spanish). Something like title_en and title_sp, body_en and body_sp, keywords_en and keywords_sp. Our users will be querying cross-lingual. So I see your point, it looks like it would be easier if we added the Spanish fields to our current indexes, then we wouldn't have to filter out same results between English and Spanish indexes. << I am doing Arabic and English (and have done Spanish, French, and Japanese in the past), although our cross-lingual system supports any languages that you have resources for. >> Did you use Snowball for the Spanish? Or is there just a Lucene Spanish Analyzer available (I couldn't find one). Or do people just use something like a plain old StandardAnalyzer to index and query Spanish content? I'm a little confused on the Snowball project, is it a multi-language Stemmer Analyzer for Lucene? We just use plan old Standard and Whitespace Analyzers now for our English content. Can we just use those same Analyzers for Spanish content? Or would it be better to use the Snowball project? thanks, chad. -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Saturday, August 21, 2004 2:16 PM To: [EMAIL PROTECTED] Subject: Re: Lucene with English and Spanish Best Practice? I think the Snowball stuff works well, although I have only used the English Porter stemmer implementation. As for indexes, do you anticipate adding more fields later in Spanish? Is the content just a translation of the English, or do you have separate conetent in Spanish? Are your users querying in only one language (cross-lingual) or are the Spanish speakers only querying against Spanish content? I am doing Arabic and English (and have done Spanish, French, and Japanese in the past), although our cross-lingual system supports any languages that you have resources for. We lean towards separate indexes, but mostly b/c they are based on separate content. The key is you have to be able to match up the analysis of the query with the analysis of the index. Having a mixed index may make this more difficult. If you have a mixed index would you filter out Spanish results that had hits from an English query? For instance, what if the query was a term that was common to both languages (banana, mosquito, etc.) or are you requiring the user to specify which fields they are searching against. I guess we really need to know more about how your user is going to be interacting. -Grant >>> [EMAIL PROTECTED] 8/20/2004 5:27:40 PM >>> Hello, I'm interested in any feedback from anyone who has worked through implementing Internationalization (I18N) search with Lucene or has ideas for this requirement. Currently, we're using Lucene with straight English and are looking to add Spanish to the mix (with maybe more languages to follow). This is our current IndexWriter setup utilizing the PerFieldAnalyzerWrapper: PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer()); analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writer = new IndexWriter(indexDir, analyzer, create); Would people suggest we switch this over to Snowball so there are English and Spanish Analyzers and IndexWriters? Something like this: PerFieldAnalyzerWrapper analyzerEnglish = new PerFieldAnalyzerWrapper(new SnowballAnalyzer("English")); analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish, create); PerFieldAnalyzerWrapper analyzerSpanish = new PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish")); analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish, create); Are multiple indexes or mirrors of each index then usually created for every language? We currently have 4 indexes that are all English. Would we then create 4 more that are Spanish? Then at search time we would determine the language and which set of indexes to search against, English or Spanish. Or another approach could be to add a Spanish field to the existing 4 indexes since most of the indexes have only one field that will be translated from English to Spanish. thanks a bunch, chad. --
Lucene with English and Spanish Best Practice?
Hello, I'm interested in any feedback from anyone who has worked through implementing Internationalization (I18N) search with Lucene or has ideas for this requirement. Currently, we're using Lucene with straight English and are looking to add Spanish to the mix (with maybe more languages to follow). This is our current IndexWriter setup utilizing the PerFieldAnalyzerWrapper: PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer()); analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writer = new IndexWriter(indexDir, analyzer, create); Would people suggest we switch this over to Snowball so there are English and Spanish Analyzers and IndexWriters? Something like this: PerFieldAnalyzerWrapper analyzerEnglish = new PerFieldAnalyzerWrapper(new SnowballAnalyzer("English")); analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish, create); PerFieldAnalyzerWrapper analyzerSpanish = new PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish")); analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish, create); Are multiple indexes or mirrors of each index then usually created for every language? We currently have 4 indexes that are all English. Would we then create 4 more that are Spanish? Then at search time we would determine the language and which set of indexes to search against, English or Spanish. Or another approach could be to add a Spanish field to the existing 4 indexes since most of the indexes have only one field that will be translated from English to Spanish. thanks a bunch, chad. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene with English and Spanish Best Practice?
Hello, I'm interested in any feedback from anyone who has worked through implementing Internationalization (I18N) search with Lucene or has ideas for this requirement. Currently, we're using Lucene with straight English and are looking to add Spanish to the mix (with maybe more languages to follow). This is our current IndexWriter setup utilizing the PerFieldAnalyzerWrapper: PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer()); analyzer.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzer.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writer = new IndexWriter(indexDir, analyzer, create); Would people suggest we switch this over to Snowball so there are English and Spanish Analyzers and IndexWriters? Something like this: PerFieldAnalyzerWrapper analyzerEnglish = new PerFieldAnalyzerWrapper(new SnowballAnalyzer("English")); analyzerEnglish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzerEnglish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writerEnglish = new IndexWriter(indexDir, analyzerEnglish, create); PerFieldAnalyzerWrapper analyzerSpanish = new PerFieldAnalyzerWrapper(new SnowballAnalyzer("Spanish")); analyzerSpanish.addAnalyzer(FIELD_TITLE_STARTS_WITH, new WhitespaceAnalyzer()); analyzerSpanish.addAnalyzer(FIELD_CATEGORY, new WhitespaceAnalyzer()); IndexWriter writerSpanish = new IndexWriter(indexDir, analyzerSpanish, create); Are multiple indexes or mirrors of each index then usually created for every language? We currently have 4 indexes that are all English. Would we then create 4 more that are Spanish? Then at search time we would determine the language and which set of indexes to search against, English or Spanish. Or another approach could be to add a Spanish field to the existing 4 indexes since most of the indexes have only one field that will be translated from English to Spanish. thanks a bunch, chad. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
"starts with" query functionality
We have a requirement to return documents with a "title" field that starts with a certain letter. Is there a way to do something like this? We're using the StandardAnalyzer Example title fields: This is the title of a document. And this is a title of a different document. This query doesn't fulfill the requirement: +(t*) - just want to return the 1st document that starts with "This", and not the 2nd article that has "this" as the 2nd word. Or is it just a matter of creating a field in the index called "title_starts_with" that would look like this for the example: T A Now, the query +(t) would only get a hit on the 1st document. Or is there a better way? thanks, chad.
RE: Searching in "all"
See MultiFieldQueryParser, like this: String[] fields = getFieldsArray(); Query multiFieldQuery = MultiFieldQueryParser.parse(this.queryString, fields, new StandardAnalyzer()); System.out.println("multiFieldQuery: " + multiFieldQuery.toString()); -Original Message- From: Tate Avery [mailto:[EMAIL PROTECTED] Sent: Thu 4/1/2004 9:30 AM To: [EMAIL PROTECTED] Cc: Subject: Searching in "all" Hello, If I have, for example, 3 fields in my document (title, body, notes)... is there some easy what to search 'all'? Below are the only 2 ideas I currently have/use: 1) If I want to search for 'x' in all, I do something like: title:x OR body:x OR notes:x ... but this does not really work if you are search for (a AND b) and a is in the title and b is in the notes, etc... leading to an explosion of boolean combinations it seems. 2) Actually index an 'all' field for my document by just concatenating the content from the title, body, and notes fields. ... but this doubles my index size. :( So, is there a better way out there? Thanks, Tate - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene 1.4 - lobby for final release
thanks Erik. Ok this is my official lobby effort for the release of 1.4 to final status. Anyone else need/want a 1.4 release? Does anyone have any information on 1.4 release plans? thanks, chad. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Fri 3/26/2004 1:25 PM To: Lucene Users List Cc: Subject: Re: too many files open error On Mar 26, 2004, at 1:33 PM, Chad Small wrote: > Is this :) serious? This is open-source. I'm only as serious as it would take for someone to push it through. I don't know what the timeline is, although lots of new features are available. > Because we have a need/interest in the new field sorting capabilities > and QueryParser keyword handling of dashes ("-") that would be in 1.4, > I believe. It's so much easier to explain that we'll use a "final" > release of Lucene instead of a "dev build" Lucene. Why explain it?! Just show great results and let that be the explanation :) > > If so, what would an expected release date be? *shrug* - feel free to lobby for it. I don't know what else is planned before a release. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: too many files open error
Is this :) serious? Because we have a need/interest in the new field sorting capabilities and QueryParser keyword handling of dashes ("-") that would be in 1.4, I believe. It's so much easier to explain that we'll use a "final" release of Lucene instead of a "dev build" Lucene. If so, what would an expected release date be? thanks, chad. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Fri 3/26/2004 12:23 PM To: Lucene Users List Cc: Subject: Re: too many files open error The compound format was added to Lucene 1.3 and was not part of 1.2. I'd definitely recommend upgrading. Heck, Lucene 1.4 could be released any day now :) Erik On Mar 26, 2004, at 12:25 PM, Charlie Smith wrote: > I'm using lucene-1.2.jar as part of the build for this docSearcher > application. > Would these recommendations work for this or should I upgrade to > lucene 1.3. > > In doing so, I'm not sure if a rewrite of the docSearcher will be > necessary or > not. > > Daniel Naber wrote on 3/26/04: > Try IndexWriter.setUseCompoundFile(true) to limit the number of files. > Erik Hatcher 3/26/2004 2:32:16 AM >>> > If you are using Lucene 1.3, try using the index in "compound" format. > You will have to rebuild (or convert) your index to this format. The > handy utility Luke will convert an index easily. > > Erik > > > On Mar 25, 2004, at 9:34 PM, Charlie Smith wrote: > >> I need to get solution to following error ASAP. Please help me with >> this. >> I'm getting following error returned from call to >> >> >> >> try { >> searcher = new IndexSearcher( >> IndexReader.open(indexName) //create an >> indexSearcher for our page >> ); >> } catch (Exception e) { //any error >> that >> happens is probably due >> //to a >> permission >> problem or non-existant >> //or otherwise >> corrupt >> index >> %> >> ERROR opening the Index - contact sysadmin! >> While parsing query: <%=e.getMessage()%> >> <%error = true; >> //don't do >> anything up to the footer >> } >> >> >> >> Output: >> ERROR opening the Index - contact sysadmin! >> >> While parsing query: >> /opt/famhistdev/fhstage/jbin/.docSearcher/indexes/fhstage_update/ >> _3ff.f6 (Too >> many open files) >> >> >> >> Charlie >> 3/25/04 >> >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to order search results by Field value?
Was there any conclusion to message: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=6762 Regarding "Ordering by a Field"? I have a similar need and didn't see the resolusion in that thread. Is it a current patch to the 1.3-final, I could see one? My other option, I guess, is just to code a comparator on a collection build off of the Hits. thanks, chad.
RE: Query syntax on Keyword field question
Ahh, without the bin on the javacc.home - 3.2 seems to work for me to. -Original Message- From: Chad Small Sent: Wed 3/24/2004 8:34 AM To: Lucene Users List Cc: Subject: RE: Query syntax on Keyword field question I'm getting this with 3.2: javacc-check: BUILD FAILED file:D:/applications/lucene-1.3-final/build.xml:97: ## JavaCC not found. JavaCC Home: /applications/javacc-3.2/bin JavaCC JAR: D:\applications\javacc-3.2\bin\bin\lib\javacc.jar Please download and install JavaCC from: <http://javacc.dev.java.net> Then, create a build.properties file either in your home directory, or within the Lucene directory and set the javacc.home property to the path where JavaCC is installed. For example, if you installed JavaCC in /usr/local/java/javacc-3.2, then set the javacc.home property to: javacc.home=/usr/local/java/javacc-3.2 If you get an error like the one below, then you have not installed things correctly. Please check all your paths and try again. java.lang.NoClassDefFoundError: org.javacc.parser.Main ## even though I put a build.properties file in my root lucene directory with this in it: javacc.home=/applications/javacc-3.2/bin hmm? -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wed 3/24/2004 8:29 AM To: Lucene Users List Cc: Subject: RE: Query syntax on Keyword field question JavaCC 3.2 works for me. Otis --- Chad Small <[EMAIL PROTECTED]> wrote: > thanks. I was in the process of getting javacc3.2 setup. I'll have > to hunt for 2.x. > > chad. > > -Original Message- > From: Morus Walter [mailto:[EMAIL PROTECTED] > Sent: Wed 3/24/2004 8:00 AM > To: Lucene Users List > Cc: > Subject: RE: Query syntax on Keyword field question > > > > Hi Chad, > > > But I assume this fix won't come out for some time. Is there a > way I can get this fix sooner? > > I'm up against a deadline and would very much like this > functionality. > > Just get lucenes sources, change the line and recompile. > The difficult part is to get a copy of JavaCC 2 (3 won't do), but I > think > this can be found in the archives. > > > > > And to go one more step with the KeywordAnalyzer that I wrote, > changing this method to skip the escape: > > protected boolean isTokenChar(char c) > > { > > if (c == '\\') > > { > > return false; > > } > > else > > { > > return true; > > } > > } > > The test then returns with a space: > > healthecare.domain.lucenesearch.KeywordAnalyzer: > > [HW-NCI_TOPICS] > > query.ToString = +category:"HW -NCI_TOPICS" +space > > junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is > > Expected:+category:HW\-NCI_TOPICS +space > > Actual :+category:"HW -NCI_TOPICS" +space <note space where > escape was. > > Sure. If \ isn't a token char, it end's the token. > So you will have to look for a different way of implementing
RE: Query syntax on Keyword field question
I'm getting this with 3.2: javacc-check: BUILD FAILED file:D:/applications/lucene-1.3-final/build.xml:97: ## JavaCC not found. JavaCC Home: /applications/javacc-3.2/bin JavaCC JAR: D:\applications\javacc-3.2\bin\bin\lib\javacc.jar Please download and install JavaCC from: <http://javacc.dev.java.net> Then, create a build.properties file either in your home directory, or within the Lucene directory and set the javacc.home property to the path where JavaCC is installed. For example, if you installed JavaCC in /usr/local/java/javacc-3.2, then set the javacc.home property to: javacc.home=/usr/local/java/javacc-3.2 If you get an error like the one below, then you have not installed things correctly. Please check all your paths and try again. java.lang.NoClassDefFoundError: org.javacc.parser.Main ## even though I put a build.properties file in my root lucene directory with this in it: javacc.home=/applications/javacc-3.2/bin hmm? -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wed 3/24/2004 8:29 AM To: Lucene Users List Cc: Subject: RE: Query syntax on Keyword field question JavaCC 3.2 works for me. Otis --- Chad Small <[EMAIL PROTECTED]> wrote: > thanks. I was in the process of getting javacc3.2 setup. I'll have > to hunt for 2.x. > > chad. > > -Original Message- > From: Morus Walter [mailto:[EMAIL PROTECTED] > Sent: Wed 3/24/2004 8:00 AM > To: Lucene Users List > Cc: > Subject: RE: Query syntax on Keyword field question > > > > Hi Chad, > > > But I assume this fix won't come out for some time. Is there a > way I can get this fix sooner? > > I'm up against a deadline and would very much like this > functionality. > > Just get lucenes sources, change the line and recompile. > The difficult part is to get a copy of JavaCC 2 (3 won't do), but I > think > this can be found in the archives. > > > > > And to go one more step with the KeywordAnalyzer that I wrote, > changing this method to skip the escape: > > protected boolean isTokenChar(char c) > > { > > if (c == '\\') > > { > > return false; > > } > > else > > { > > return true; > > } > > } > > The test then returns with a space: > > healthecare.domain.lucenesearch.KeywordAnalyzer: > > [HW-NCI_TOPICS] > > query.ToString = +category:"HW -NCI_TOPICS" +space > > junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is > > Expected:+category:HW\-NCI_TOPICS +space > > Actual :+category:"HW -NCI_TOPICS" +space <note space where > escape was. > > Sure. If \ isn't a token char, it end's the token. > So you will have to look for a different way of implementing the > analyzer. Shouldn't be that difficult since you have only one token. > > Maybe it should be the job of the query parser to remove the escape > character > (would make more sense to me at least) but that would be another > change > of the query parser... > > Morus > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query syntax on Keyword field question
For others reference - here is the old version url: https://javacc.dev.java.net/servlets/ProjectDocumentList?folderID=212 -Original Message- From: Chad Small Sent: Wed 3/24/2004 8:07 AM To: Lucene Users List Cc: Subject: RE: Query syntax on Keyword field question thanks. I was in the process of getting javacc3.2 setup. I'll have to hunt for 2.x. chad. -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wed 3/24/2004 8:00 AM To: Lucene Users List Cc: Subject: RE: Query syntax on Keyword field question Hi Chad, > But I assume this fix won't come out for some time. Is there a way I can get this fix sooner? > I'm up against a deadline and would very much like this functionality. Just get lucenes sources, change the line and recompile. The difficult part is to get a copy of JavaCC 2 (3 won't do), but I think this can be found in the archives. > > And to go one more step with the KeywordAnalyzer that I wrote, changing this method to skip the escape: > protected boolean isTokenChar(char c) > { > if (c == '\\') > { > return false; > } > else > { > return true; > } > } > The test then returns with a space: > healthecare.domain.lucenesearch.KeywordAnalyzer: > [HW-NCI_TOPICS] > query.ToString = +category:"HW -NCI_TOPICS" +space > junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is > Expected:+category:HW\-NCI_TOPICS +space > Actual :+category:"HW -NCI_TOPICS" +space <note space where escape was. Sure. If \ isn't a token char, it end's the token. So you will have to look for a different way of implementing the analyzer. Shouldn't be that difficult since you have only one token. Maybe it should be the job of the query parser to remove the escape character (would make more sense to me at least) but that would be another change of the query parser... Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query syntax on Keyword field question
thanks. I was in the process of getting javacc3.2 setup. I'll have to hunt for 2.x. chad. -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wed 3/24/2004 8:00 AM To: Lucene Users List Cc: Subject: RE: Query syntax on Keyword field question Hi Chad, > But I assume this fix won't come out for some time. Is there a way I can get this fix sooner? > I'm up against a deadline and would very much like this functionality. Just get lucenes sources, change the line and recompile. The difficult part is to get a copy of JavaCC 2 (3 won't do), but I think this can be found in the archives. > > And to go one more step with the KeywordAnalyzer that I wrote, changing this method to skip the escape: > protected boolean isTokenChar(char c) > { > if (c == '\\') > { > return false; > } > else > { > return true; > } > } > The test then returns with a space: > healthecare.domain.lucenesearch.KeywordAnalyzer: > [HW-NCI_TOPICS] > query.ToString = +category:"HW -NCI_TOPICS" +space > junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is > Expected:+category:HW\-NCI_TOPICS +space > Actual :+category:"HW -NCI_TOPICS" +space
RE: Query syntax on Keyword field question
Great info Morus, After making the "escape the dash" change to the QueryParser: Query query = QueryParser.parse("+category:HW\\-NCI_TOPICS AND SPACE", "description", analyzer); Hits hits = searcher.search(query); System.out.println("query.ToString = " + query.toString("description")); assertEquals("HW-NCI_TOPICS kept as-is", "+category:HW\\-NCI_TOPICS +space", query.toString("description")); <--note that this passes with the escape put in, so not "as-is". assertEquals("doc found!", 1, hits.length()); I'm still getting this output: domain.lucenesearch.KeywordAnalyzer: [HW-NCI_TOPICS] query.ToString = +category:HW\-NCI_TOPICS +space junit.framework.AssertionFailedError: doc found! expected:<1> but was:<0> It look like bug, http://issues.apache.org/bugzilla/show_bug.cgi?id=27491 <http://issues.apache.org/bugzilla/show_bug.cgi?id=27491> , was fixed today: --- Additional Comments From Otis Gospodnetic <mailto:[EMAIL PROTECTED]> 2004-03-24 10:10 --- Although tft-monitor should not really result in a phrase query "tft monitor", I agree that this is better than converting it to tft AND NOT monitor (tft -monitor). Moreover, I have seen query syntax where '-' characters are used for phrase queries instead or in addition to quotes, so one could use either morus-walter or "morus walter". I applied your change, as it doesn't look like it breaks anything, and I hope nobody relied on ill behaviour where tft-monitor would result in AND NOT query. --- But I assume this fix won't come out for some time. Is there a way I can get this fix sooner? I'm up against a deadline and would very much like this functionality. And to go one more step with the KeywordAnalyzer that I wrote, changing this method to skip the escape: protected boolean isTokenChar(char c) { if (c == '\\') { return false; } else { return true; } } The test then returns with a space: healthecare.domain.lucenesearch.KeywordAnalyzer: [HW-NCI_TOPICS] query.ToString = +category:"HW -NCI_TOPICS" +space junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is Expected:+category:HW\-NCI_TOPICS +space Actual :+category:"HW -NCI_TOPICS" +space <note space where escape was. thanks, chad. -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Wed 3/24/2004 1:43 AM To: Lucene Users List Cc: Subject: RE: Query syntax on Keyword field question Chad Small writes: > Here is my attempt at a KeywordAnalyzer - although is not working? Excuse the length of the message, but wanted to give actual code. > > With this output: > > Analzying "HW-NCI_TOPICS" > org.apache.lucene.analysis.WhitespaceAnalyzer: > [HW-NCI_TOPICS] > org.apache.lucene.analysis.SimpleAnalyzer: > [hw] [nci] [topics] > org.apache.lucene.analysis.StopAnalyzer: > [hw] [nci] [topics] > org.apache.lucene.analysis.standard.StandardAnalyzer: > [hw] [nci] [topics] > healthecare.domain.lucenesearch.KeywordAnalyzer: > [HW-NCI_TOPICS] > > query.ToString = category:HW -"nci topics" +space > > junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is > Expected:+category:HW-NCI_TOPICS +space > Actual :category:HW -"nci topics" +space > Well query parser does not allow `-' within words currently. So before your analyzer is called, query parser reads one word HW, a `-' operator, one word NCI_TOPICS. The latter is analyzed as "nci topics" because it's not in field category anymore, I guess. I suggested to change this. See http://issues.apache.org/bugzilla/show_bug.cgi?id=27491 Either you escape the - using category:HW\-NCI_TOPICS in your query (untested. and I don't know where the escape character will be removed) or you apply my suggested change. Another option for using keywords with query parser might be adding a keyword syntax to the query parser. Something like category:key("HW-NCI_TOPICS") or category="HW-NCI_TOPICS". HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query syntax on Keyword field question
Here is my attempt at a KeywordAnalyzer - although is not working? Excuse the length of the message, but wanted to give actual code. package domain.lucenesearch; import java.io.*; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.CharTokenizer; import org.apache.lucene.analysis.TokenStream; public class KeywordAnalyzer extends Analyzer { public TokenStream tokenStream(String s, Reader reader) { return new KeywordTokenizer(reader); } private class KeywordTokenizer extends CharTokenizer { public KeywordTokenizer(Reader in) { super(in); } /** * Collects all characters. */ protected boolean isTokenChar(char c) { return true; } } However, this test: fails public class KeywordAnalyzerTest extends TestCase { RAMDirectory directory; private IndexSearcher searcher; public void setUp() throws Exception { directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true); Document doc = new Document(); doc.add(Field.Keyword("category", "HW-NCI_TOPICS")); doc.add(Field.Text("description", "Illidium Space Modulator")); writer.addDocument(doc); writer.close(); searcher = new IndexSearcher(directory); } public void testPerFieldAnalyzer() throws Exception { analyze("HW-NCI_TOPICS"); PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer()); analyzer.addAnalyzer("category", new KeywordAnalyzer()); //|#1 Query query = QueryParser.parse("category:HW-NCI_TOPICS AND SPACE", "description", analyzer); Hits hits = searcher.search(query); System.out.println("query.ToString = " + query.toString("description")); assertEquals("HW-NCI_TOPICS kept as-is", "category:HW-NCI_TOPICS +space", query.toString("description")); assertEquals("doc found!", 1, hits.length()); } private void analyze(String text) throws Exception { Analyzer[] analyzers = new Analyzer[]{ new WhitespaceAnalyzer(), new SimpleAnalyzer(), new StopAnalyzer(), new StandardAnalyzer(), new KeywordAnalyzer(), //new SnowballAnalyzer("English", StopAnalyzer.ENGLISH_STOP_WORDS) }; System.out.println("Analzying \"" + text + "\""); for (int i = 0; i < analyzers.length; i++) { Analyzer analyzer = analyzers[i]; System.out.println("\t" + analyzer.getClass().getName() + ":"); System.out.print("\t\t"); TokenStream stream = analyzer.tokenStream("category", new StringReader(text)); while (true) { Token token = stream.next(); if (token == null) break; System.out.print("[" + token.termText() + "] "); } System.out.println("\n"); } } } With this output: Analzying "HW-NCI_TOPICS" org.apache.lucene.analysis.WhitespaceAnalyzer: [HW-NCI_TOPICS] org.apache.lucene.analysis.SimpleAnalyzer: [hw] [nci] [topics] org.apache.lucene.analysis.StopAnalyzer: [hw] [nci] [topics] org.apache.lucene.analysis.standard.StandardAnalyzer: [hw] [nci] [topics] healthecare.domain.lucenesearch.KeywordAnalyzer: [HW-NCI_TOPICS] query.ToString = category:HW -"nci topics" +space junit.framework.ComparisonFailure: HW-NCI_TOPICS kept as-is Expected:+category:HW-NCI_TOPICS +space Actual :category:HW -"nci topics" +space See anything? thanks, chad. -Original Message- From: Chad Small Sent: Tue 3/23/2004 8:48 PM To: Lucene Users List Cc: Subject: RE: Query syntax on Keyword field question Thanks-you Erik and Incze. I now understand the issue and I'm trying to create a "KeywordAnalyzer" as suggested from you book excerpt, Erik: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=6727 However, not being all that familiar with the Analyzer framework, I'm not sure how to implement the "KeywordAnalyzer" even though it might be "trivial" :) Any hints, code, or messages to look at? <> Ok, here is the section from Lucene in Action. I'll leave the development of KeywordAnalyzer as an exercise for the reader (although its implementation is trivial, one of the simplest analyzers possib
RE: Query syntax on Keyword field question
Thanks-you Erik and Incze. I now understand the issue and I'm trying to create a "KeywordAnalyzer" as suggested from you book excerpt, Erik: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=6727 However, not being all that familiar with the Analyzer framework, I'm not sure how to implement the "KeywordAnalyzer" even though it might be "trivial" :) Any hints, code, or messages to look at? <> Ok, here is the section from Lucene in Action. I'll leave the development of KeywordAnalyzer as an exercise for the reader (although its implementation is trivial, one of the simplest analyzers possible - only emit one token of the entire contents). I hope this helps. Erik >> thanks again, chad. -Original Message- From: Incze Lajos [mailto:[EMAIL PROTECTED] Sent: Tue 3/23/2004 8:08 PM To: Lucene Users List Cc: Subject: Re: Query syntax on Keyword field question On Tue, Mar 23, 2004 at 08:10:15PM -0500, Erik Hatcher wrote: > QueryParser and Field.Keyword fields are a strange mix. For some > background, check the archives as this has been covered pretty > extensively. > > A quick answer is yes you can use MFQP and QP with keyword fields, > however you need to be careful which analyzer you use. > PerFieldAnalyzerWrapper is a good solution - you'll just need to use an > analyzer for your keyword field which simply tokenizes the whole string > as one chunk. Perhaps such an analyzer should be made part of the > core? > > Erik I've implemented suche an analyzer but it's only partial solution if your keyword field contains spaces, as the QP would split the query, e.g.: NOTTOKNIZED:(term with spaces*) would give you no hit even with an not tokenized field "term with spaces and other useful things". The full solution would be to be able to tell the QP not to split at spaces, either by 'do not split till apos' syntax, or by the good ol' backslash: do\ not\ notice\ these\ spaces. incze - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query syntax on Keyword field question
I have since learned that using the TermQuery instead of the MultiFieldQueryParser works for the keyword field in question below (HW-NCI_TOPICS). apiQuery = new BooleanQuery(); apiQuery.add(new TermQuery(new Term("category", "HW-NCI_TOPICS")), true, false); This finds a match. I found a message that talked about having to use the the Query API when searching Keyword fields in the index. Is this true? Is there not a way to get the MultiFieldQueryParser to find a match on this keyword? thanks, chad. -Original Message- From: Chad Small Sent: Tue 3/23/2004 10:57 AM To: [EMAIL PROTECTED] Cc: Subject: Query syntax on Keyword field question Hello, How can I format a query to get a hit? I'm using the StandardAnalyzer() at both index and search time. If I'm indexing a field like this: luceneDocument.add(Field.Keyword("category","HW-NCI_TOPICS")); I've tried the following with no success: // String searchArgs = "HW\\-NCI_TOPICS"; // String searchArgs = "HW\\-NCI_TOPICS".toLowerCase(); // String searchArgs = "+HW+NCI+TOPICS"; //this works with .Text field // String searchArgs = "+hw+nci+topics"; // String searchArgs = "hw nci topics"; thanks, chad. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query syntax on Keyword field question
Hello, How can I format a query to get a hit? I'm using the StandardAnalyzer() at both index and search time. If I'm indexing a field like this: luceneDocument.add(Field.Keyword("category","HW-NCI_TOPICS")); I've tried the following with no success: // String searchArgs = "HW\\-NCI_TOPICS"; // String searchArgs = "HW\\-NCI_TOPICS".toLowerCase(); // String searchArgs = "+HW+NCI+TOPICS"; //this works with .Text field // String searchArgs = "+hw+nci+topics"; // String searchArgs = "hw nci topics"; thanks, chad.