Re: better stemming engine than Porter?
Porter stemmer is not only agressive, it is ugly, too. The generated code is too old, too few object centric and should be too slow. If your kstem compile with java 1.4, why don't you suggest it to lucene core? M. Wagner,Harry a écrit : Hi HH, Here's a note I sent Solr-dev a while back: --- I've implemented a Solr plug-in that wraps KStem for Solr use (someone else had already written a Lucene wrapper for it). KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. --- I believe there was interest in it, but I never opened an issue for it and I don't know if it was ever followed-up on. I'd be happy to do that now. Can someone on the Solr-dev team point me in the right direction for opening an issue? Thanks... harry -Original Message- From: Hung Huynh [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 11:59 AM To: solr-user@lucene.apache.org Subject: better stemming engine than Porter? I recall I've read some where in one of the mailing-list archives that some one had developed a better stemming algo for Solr than the built-in Porter stemming. Does anyone have link to that stemming module? Thanks, HH
Re: CorruptIndexException
Robert Haschart [EMAIL PROTECTED] wrote: To answer your questions: I completely deleted the index each time before retesting. and the java command as shown by ps does show -Xbatch. The program is running on: uname -a Linux lab8.betech.virginia.edu 2.6.18-53.1.14.el5 #1 SMP Tue Feb 19 07:18:21 EST 2008 i686 i686 i386 GNU/Linux more /etc/redhat-release Red Hat Enterprise Linux Server release 5.1 (Tikanga) after downgrading from the originally reported version of java: Java(TM) SE Runtime Environment (build 1.6.0_05-b13) to this one: java -version java version 1.6.0_02 Java(TM) SE Runtime Environment (build 1.6.0_02-b05) Java HotSpot(TM) Server VM (build 1.6.0_02-b05, mixed mode) the indexing run sucessfully completed processing all 112 record chunks! Yea! (with -Xbatch on the command line, I didn't try with the 1.6.0_02 java without -Xbatch) OK, that's good and bad news. Good in that this still appears to be a JVM issue (scary, really) since downgrading to 1.6.0_02 resolves it. Bad in that -Xbatch is not always a viable workaround. But at least you have a way forward... So at this point it looks like the problem is in my marc-8 to utf-8 translation code. I'll look into this possibility further. OK. Let me know if this seems to come back to a Lucene issue! Mike
Re: XSLT transform before update?
hi , There is this new patch which implements these features. I shall update the wiki with the documentation I guess we do not need to be too worried about the memory consumption. A few MB of memory should be fine (unless your are using a file which is in 10's of MB ). Consider using XPathEntityProcessor (if possible ) it uses Stax and it is pretty efficient. thanks for your support --Noble A few MB of memory for an xml must be fine. The XPathEnt On Mon, Apr 21, 2008 at 5:57 PM, David Smiley @MITRE.org [EMAIL PROTECTED] wrote: Cool. So you're saying that this xslt file will operate on the entire XML document that was fetched from the URL and just pass it on to solr? Thanks for supporting this. The XML files I have coming from the my data source are big but not not too big to risk an out-of-memory error. And I've found xslt to perform fast for me. I like your proposed TemplateTransformer too... I'm tempted to use that in place of XSLT. Great job Paul. It'd be neat to have an XSLT transformer for your framework that operates on a single entity (that addresses the memory usage problem). I know your entities are HashMap based instead of XML, however. ~ David Noble Paul നോബിള് नोब्ळ् wrote: We are planning to incorporate both your requests in the next patch. The implementation is going to be as follows.mention the xsl file location as follows entity processor=XPathEntitityprocessor xslt=file:/c:/my-own.xsl /entity So the processing will be done after the XSL transformation. If after your XSL transformation it produces a valid 'add' document not even fields is necessary. Otherwise you will need to write all the fields and their xpaths like any other xml entity processor=XPathEntitityprocessor xslt=file:/c:/my-own.xsl useSolrAddXml=true/ So it will assume that the schema is same as that of the add xml and does the needful. Another feature is going to be a TemplateTransformer which takes in a Template as follows entity name=e transformer=TemplateTransformer field column=field1_2 template=${e.field1} ${e.field2}/ /entity Please let us know what u think about this. And keep giving us these great use-cases so that we can make the tool better. --Noble On Mon, Apr 21, 2008 at 12:07 AM, David Smiley @MITRE.org [EMAIL PROTECTED] wrote: Thanks Shalin. The particular XSLT processor used is not relevant; it's a spec. Just use the standard Java APIs. If I want a particular processor, then I can get that to happen by using a system property and/or you could offer a configuration input for the standard factory class implementation for a processor of my choice. ~ David Shalin Shekhar Mangar wrote: Hi David, Actually you can concatenate values, however you'll have to write a bit of code. You can write this in javascript (if you're using Java 6) or in Java. Basically, you need to write a Transformer to do it. Look at http://wiki.apache.org/solr/DataImportHandler#head-a6916b30b5d7605a990fb03c4ff461b3736496a9 For example, lets say you get fields first-name and last-name in the XML. But in the schema.xml you have a field called name in which you need to concatenate the values of first-name and last-name (with a space in between). Create a Java class: public class ConcatenateTransformer { public Object transformRow(MapString, Object row) { String firstName = row.get(first-name); String lastName = row.get(last-name); row.put(name, firstName + + lastName); return row; } } Add this class to solr's classpath by putting its jar in solr/WEB-INF/lib The data-config.xml should like this: entity name=myEntity processor=XPathEntityProcessor url= http://myurl/example.xml; transformer=com.yourpackage.ConcatenateTransformer field column=first-name xpath=/record/first-name / field column=last-name xpath=/record/last-name / field column=name / /entity This will call ConcatenateTransformer.transformRow method for each row and you can concatenate any field with any field (or constant). Note that solr document will keep only those fields which are in the schema.xml, the rest are thrown away. If you don't want to write this in Java, you can use JavaScript by using the built-in ScriptTransformer, for an example look at http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9 However, I'm beginning to realize that XSLT is a common need, let me see how best we can accomodate it in DataImportHandler. Which XSLT processor will you prefer? On Sat, Apr 19, 2008 at 12:13 AM, David Smiley @MITRE.org [EMAIL PROTECTED] wrote:
RE: better stemming engine than Porter?
Thanks Ryan. I just opened SOLR-546. Please let me know if I can provide further help. Cheers! h -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 2:33 PM To: solr-user@lucene.apache.org Subject: Re: better stemming engine than Porter? Hey- to create an issue, make an account on jira and post it... https://issues.apache.org/jira/browse/SOLR Give that a try and holler if you have trouble. ryan On Apr 21, 2008, at 12:31 PM, Wagner,Harry wrote: Hi HH, Here's a note I sent Solr-dev a while back: --- I've implemented a Solr plug-in that wraps KStem for Solr use (someone else had already written a Lucene wrapper for it). KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. --- I believe there was interest in it, but I never opened an issue for it and I don't know if it was ever followed-up on. I'd be happy to do that now. Can someone on the Solr-dev team point me in the right direction for opening an issue? Thanks... harry -Original Message- From: Hung Huynh [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 11:59 AM To: solr-user@lucene.apache.org Subject: better stemming engine than Porter? I recall I've read some where in one of the mailing-list archives that some one had developed a better stemming algo for Solr than the built-in Porter stemming. Does anyone have link to that stemming module? Thanks, HH
Re: More Like This boost
On Apr 21, 2008, at 5:02 PM, Francisco Sanmartin wrote: Is it possible to boost the query that MoreLikeThis returns before sending it to Solr? I mean, technically is possible, because you can add a factor to the whole query but...does it make sense? (Remember that MoreLikeThis can already boosts each term inside the query). For example, this could be a result of MoreLikeThis (with native boosting enabled) queryResultMLT = (this^0.4 is^0.5 a^0.6 query^0.33 of^0.29 morelikethis^0.67) what I want to do is queryResulltMLT = (this^0.4 is^0.5 a^0.6 query^0.33 of^0.29 morelikethis^0.67)^0.60 ---(notice the boost of 0.60 for the whole query) That last boost wouldn't change the doc ordering at all, so it'd be kinda useless. What are you trying to accomplish? Erik
Re: More Like This boost
I know that only one query of that type does not change anything. But when it's two or more with different boosts, i hope it does. Here is the situation: My docs have Title and Description. What I want to do is to give more relevancy to the morelikethis on the title than on the description. So the query would be like this: query = (words^0.4 in^0.3 the^0.56 title^0.65)^0.70 (words^0.7 in^0.33 the^0.49 description^0.43)^0.30 This way, the words in the title are more relevant than the words in the description, right? Thanks! Pako Erik Hatcher wrote: On Apr 21, 2008, at 5:02 PM, Francisco Sanmartin wrote: Is it possible to boost the query that MoreLikeThis returns before sending it to Solr? I mean, technically is possible, because you can add a factor to the whole query but...does it make sense? (Remember that MoreLikeThis can already boosts each term inside the query). For example, this could be a result of MoreLikeThis (with native boosting enabled) queryResultMLT = (this^0.4 is^0.5 a^0.6 query^0.33 of^0.29 morelikethis^0.67) what I want to do is queryResulltMLT = (this^0.4 is^0.5 a^0.6 query^0.33 of^0.29 morelikethis^0.67)^0.60 ---(notice the boost of 0.60 for the whole query) That last boost wouldn't change the doc ordering at all, so it'd be kinda useless. What are you trying to accomplish? Erik
Re: More Like This boost
No, the MLT feature does not have that kind of field-specific boosting capability. It sounds like it could be a useful enhancement though. Of course you do get boosts for interesting terms already, but maybe having an additional field-specific boost would be a nice touch too. Erik On Apr 22, 2008, at 9:13 AM, Francisco Sanmartin wrote: I know that only one query of that type does not change anything. But when it's two or more with different boosts, i hope it does. Here is the situation: My docs have Title and Description. What I want to do is to give more relevancy to the morelikethis on the title than on the description. So the query would be like this: query = (words^0.4 in^0.3 the^0.56 title^0.65)^0.70 (words^0.7 in^0.33 the^0.49 description^0.43)^0.30 This way, the words in the title are more relevant than the words in the description, right? Thanks! Pako Erik Hatcher wrote: On Apr 21, 2008, at 5:02 PM, Francisco Sanmartin wrote: Is it possible to boost the query that MoreLikeThis returns before sending it to Solr? I mean, technically is possible, because you can add a factor to the whole query but...does it make sense? (Remember that MoreLikeThis can already boosts each term inside the query). For example, this could be a result of MoreLikeThis (with native boosting enabled) queryResultMLT = (this^0.4 is^0.5 a^0.6 query^0.33 of^0.29 morelikethis^0.67) what I want to do is queryResulltMLT = (this^0.4 is^0.5 a^0.6 query^0.33 of^0.29 morelikethis^0.67)^0.60 ---(notice the boost of 0.60 for the whole query) That last boost wouldn't change the doc ordering at all, so it'd be kinda useless. What are you trying to accomplish? Erik
Re: More Like This boost
It should help to weight the terms with their frequency in the original document. That will distinguish between two documents with the same terms, but different focus. wunder On 4/22/08 7:46 AM, Erik Hatcher [EMAIL PROTECTED] wrote: No, the MLT feature does not have that kind of field-specific boosting capability. It sounds like it could be a useful enhancement though. Of course you do get boosts for interesting terms already, but maybe having an additional field-specific boost would be a nice touch too. Erik On Apr 22, 2008, at 9:13 AM, Francisco Sanmartin wrote: I know that only one query of that type does not change anything. But when it's two or more with different boosts, i hope it does. Here is the situation: My docs have Title and Description. What I want to do is to give more relevancy to the morelikethis on the title than on the description. So the query would be like this: query = (words^0.4 in^0.3 the^0.56 title^0.65)^0.70 (words^0.7 in^0.33 the^0.49 description^0.43)^0.30 This way, the words in the title are more relevant than the words in the description, right? Thanks! Pako Erik Hatcher wrote: On Apr 21, 2008, at 5:02 PM, Francisco Sanmartin wrote: Is it possible to boost the query that MoreLikeThis returns before sending it to Solr? I mean, technically is possible, because you can add a factor to the whole query but...does it make sense? (Remember that MoreLikeThis can already boosts each term inside the query). For example, this could be a result of MoreLikeThis (with native boosting enabled) queryResultMLT = (this^0.4 is^0.5 a^0.6 query^0.33 of^0.29 morelikethis^0.67) what I want to do is queryResulltMLT = (this^0.4 is^0.5 a^0.6 query^0.33 of^0.29 morelikethis^0.67)^0.60 ---(notice the boost of 0.60 for the whole query) That last boost wouldn't change the doc ordering at all, so it'd be kinda useless. What are you trying to accomplish? Erik
Enhancing the query language
The kind usage we have in our seaching the contents news we need a more sofisticated query language. currently the solr query language is not enough for our needs. I understand it is possible to add our own customized query parse to the system, but I was wondering if anybody have done that and if there is any idea to share how and from where to start. for example we need to have : paragraphs proximity i.e. (termsgroup1) near/n (termgroup2) termsgroup1 n paragraph apart from termgroup2 finding terms for number of times i.e. atleast/n abcd in text abcd should show up atleast n times Thanks, Kamran shadkhast -- View this message in context: http://www.nabble.com/Enhancing-the-query-language-tp16824860p16824860.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: better stemming engine than Porter?
Hi Wagner, Thanks for the intro of KStem! I quickly scanned the original paper on KStem by Robert Krovetz but could not find any timing comparison data on KStem and Porter stem. I wonder how slow/fast Kstem is compared to Porter stem based on your use in your application? Jay Wagner,Harry wrote: Mathieu, It's not my Kstem. It was written by someone at Umass, Amherst. More info here: http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi Someone else had already ported it to Lucene. I simply modified that wrapper to work with Solr. I'll open an issue for it so that it can (hopefully) be integrated into the project. Cheers... harry -Original Message- From: Mathieu Lecarme [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 22, 2008 3:57 AM To: solr-user@lucene.apache.org Subject: Re: better stemming engine than Porter? Porter stemmer is not only agressive, it is ugly, too. The generated code is too old, too few object centric and should be too slow. If your kstem compile with java 1.4, why don't you suggest it to lucene core? M. Wagner,Harry a écrit : Hi HH, Here's a note I sent Solr-dev a while back: --- I've implemented a Solr plug-in that wraps KStem for Solr use (someone else had already written a Lucene wrapper for it). KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. --- I believe there was interest in it, but I never opened an issue for it and I don't know if it was ever followed-up on. I'd be happy to do that now. Can someone on the Solr-dev team point me in the right direction for opening an issue? Thanks... harry -Original Message- From: Hung Huynh [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 11:59 AM To: solr-user@lucene.apache.org Subject: better stemming engine than Porter? I recall I've read some where in one of the mailing-list archives that some one had developed a better stemming algo for Solr than the built-in Porter stemming. Does anyone have link to that stemming module? Thanks, HH
RE: better stemming engine than Porter?
Hi Jay, I did not do a timing comparison either, but any change in performance after switching to Kstem was not noticeable. Cheers... h -Original Message- From: Jay [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 22, 2008 12:26 PM To: solr-user@lucene.apache.org Subject: Re: better stemming engine than Porter? Hi Wagner, Thanks for the intro of KStem! I quickly scanned the original paper on KStem by Robert Krovetz but could not find any timing comparison data on KStem and Porter stem. I wonder how slow/fast Kstem is compared to Porter stem based on your use in your application? Jay Wagner,Harry wrote: Mathieu, It's not my Kstem. It was written by someone at Umass, Amherst. More info here: http://ciir.cs.umass.edu/cgi-bin/downloads/downloads.cgi Someone else had already ported it to Lucene. I simply modified that wrapper to work with Solr. I'll open an issue for it so that it can (hopefully) be integrated into the project. Cheers... harry -Original Message- From: Mathieu Lecarme [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 22, 2008 3:57 AM To: solr-user@lucene.apache.org Subject: Re: better stemming engine than Porter? Porter stemmer is not only agressive, it is ugly, too. The generated code is too old, too few object centric and should be too slow. If your kstem compile with java 1.4, why don't you suggest it to lucene core? M. Wagner,Harry a écrit : Hi HH, Here's a note I sent Solr-dev a while back: --- I've implemented a Solr plug-in that wraps KStem for Solr use (someone else had already written a Lucene wrapper for it). KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. --- I believe there was interest in it, but I never opened an issue for it and I don't know if it was ever followed-up on. I'd be happy to do that now. Can someone on the Solr-dev team point me in the right direction for opening an issue? Thanks... harry -Original Message- From: Hung Huynh [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 11:59 AM To: solr-user@lucene.apache.org Subject: better stemming engine than Porter? I recall I've read some where in one of the mailing-list archives that some one had developed a better stemming algo for Solr than the built-in Porter stemming. Does anyone have link to that stemming module? Thanks, HH
Re: Highlighted field gets truncated
On 19-Apr-08, at 3:02 AM, Christian Wittern wrote: Mike Klaas wrote: Fragments are generated independently from matching (I realize this isn't an ideal algorithm). So it could be that the match is not part of the fragment? This sounds a bit strange. Is there a way to make sure the fragment contains the match other than returning the whole field and do the fragmenting myself? The highlighting algorithm is as follows: 1. fragment the whole field into N fragments 2. score each fragment based on the keyword matches (more matches the better; prefer different keyword matching to many of the same keyword matching). fragments that have no matching keywords do not have a positive score. 3. return the top hl.maxSnippets fragments that score 0 As you can see, only fragments containing a match are returned (note that there is very often multiple matches--you seemed to assume only one). -Mike
logging through log4j
Hi, I'm (still) seeking more advice on this deployment issue which is to use org.apache.log4j instead of java.util.logging. I'm not seeking re-starting any discussion on solr4j/commons/log4j/jul respective benefits; I'm seeking a way to bridge jul to log4j with the minimum specific per-container configuration or restriction. I've failed to find a way that would work for all servlet containers (Tomcat,WebSphere,Jetty) without disrupting SolrCode. My last current attempt that requires code modification is posted in last reply here http://www.nabble.com/logging-through-log4j-to13747253.html#a16825364. Comments/experience welcome. Thanks Henri -- View this message in context: http://www.nabble.com/logging-through-log4j-tp16825424p16825424.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Highlighted field gets truncated
Mike Klaas wrote: On 19-Apr-08, at 3:02 AM, Christian Wittern wrote: So it could be that the match is not part of the fragment? This sounds a bit strange. Is there a way to make sure the fragment contains the match other than returning the whole field and do the fragmenting myself? [...] As you can see, only fragments containing a match are returned (note that there is very often multiple matches--you seemed to assume only one). Mike, thank you for the clarification. Now I understand what went wrong in the example I looked at. I am querying ngram indexed data (Chinese text). A user enters two or three characters and expect them to be matched more or less as a substring match. The fragment I looked at did contain only one of the characters (the other was cut off at the end), this is what made me wondering. From what you say, even adding quotation marks around the query will not prevent this from happening (in this case, it would simply obscure the match). Are there any plans to improve the algorithm for fragmentation? Or are there other work arounds? All the best, Christian
Re: better stemming engine than Porter?
I actually doubt Porter's is slow. From what I recall, it's a bunch of simple if/elses. KStem can't get added to Lucene core due to its license (search Lucene JIRA for an issue that covered this several years ago). Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Mathieu Lecarme [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, April 22, 2008 3:57:15 AM Subject: Re: better stemming engine than Porter? Porter stemmer is not only agressive, it is ugly, too. The generated code is too old, too few object centric and should be too slow. If your kstem compile with java 1.4, why don't you suggest it to lucene core? M. Wagner,Harry a écrit : Hi HH, Here's a note I sent Solr-dev a while back: --- I've implemented a Solr plug-in that wraps KStem for Solr use (someone else had already written a Lucene wrapper for it). KStem is considered to be more appropriate for library usage since it is much less aggressive than Porter (i.e., searches for organization do NOT match on organ!). If there is any interest in feeding this back into Solr I would be happy to contribute it. --- I believe there was interest in it, but I never opened an issue for it and I don't know if it was ever followed-up on. I'd be happy to do that now. Can someone on the Solr-dev team point me in the right direction for opening an issue? Thanks... harry -Original Message- From: Hung Huynh [mailto:[EMAIL PROTECTED] Sent: Monday, April 21, 2008 11:59 AM To: solr-user@lucene.apache.org Subject: better stemming engine than Porter? I recall I've read some where in one of the mailing-list archives that some one had developed a better stemming algo for Solr than the built-in Porter stemming. Does anyone have link to that stemming module? Thanks, HH
Spellchecker Question
I'm using the Spellchecker handler but am a little confused. The docs say to run the cmd=rebuild when building the first time. Do I need to supply a q param with that cmd=rebuild? The examples show a url with the q param set while rebuilding, but the main section on the cmd param doesn't say much about it. My hunch is that I need to supply a q? Thanks, Matt