Hello Wouter and Nick, > Hi Ard, > > We are using a custom DutchAccentlessAnalyzer for the > caption, which is a wrapper around StandardAnalyzer that adds > an accent remover and uses some stop words. That means that > we can't use string indexing, right? If so, I think that Nick > will probably have to do his own post-query sorting, or do > you have a better solution?
Of course I have :-) There basically are two separate solutions: 1) Index the property as type="string" and use the 'DutchAccentlessAnalyzer' (though, from top of my head, I am not sure if 'string' indexing can define a separate analyzer, otherwise, the default analyzer in the indexer.xml should be set to 'DutchAccentlessAnalyzer'). Now, if you want to search in the property on words, you use <s:propSearch> instead of <s:property-contains> 2) extract the property twice, ones indexed as string, ones as text. The first to sort on, and do an equals comparison if you like, the second to do text based searching > > However, I'm wondering why there is a difference in sorting > ability between: > > <analyzer > class="nl.hippo.slide.index.analysis.SimpleStandardAnalyzer"/> > <property namespace="http://hippo.nl/cms/1.0" name="caption" > type="string"/> > > and > > <property namespace="http://hippo.nl/cms/1.0" name="caption" > type="text" > analyzer="nl.hippo.slide.index.analysis.SimpleStandardAnalyzer"/> > > It seems to me that in both cases the same analyzer is used Basically, yes, but 'string' indexing does something extra: beside indexing the string/text with the help of some analyzer, it also indexes the text/string unanalyzed ( = untokenized) --> giving you a single lucene term for the entire property, such you can sort on it, and do an equals for example > and the same sorting ability should be possible. Is there a > hidden assumption that the default/string analyzer will > return a single sortable value, while in the second case an > array of values may be returned? Or perhaps using the String > type results in a concatenation of the array values. Whatever > the case may be, that is quite an ugly design, I think. Hold your horses, there are of course reasons for this :-)) > Ideally, you would want the ability to define a comparator in > the indexer.xml for a property, so every property can be > sorted, in a flexible way. A simpler solution may be to treat > text and string properties the same way when sorting. Is that > a viable change or would it require changes to Lucene? You can easily do this, and as a matter of fact, you already can: just index everything as 'string' and use s:propsearch to search within a property. But, there is a tradeoff. Some properties might be long text fields, like 'the first paragraph of a document', or 'the summary'. Now, my question: would you ever like to sort on the first paragraph of a document? Or ever say, 'return me all documents where the paragraph = "some very long lorem ipsum text continuing several pages" '. As lots of projects do have (long) properties on which sorting and doing equal are completely meaningless, why store the entire text as a single lucene token? This is bad for the FS size of your index, but imagine 100.000 which you are going to sort on the property 'first paragraph'. As lucene has internal term caching, suppose the paragraphs are on everage about 1 kb --> 100 Mb memory. Sort on another prop, yet another 100 Mb gone. Or suppose a range query or postfix wildcard query on properties of 1 kb. Lucene generally translates such queries into OR queries, where all terms matching the prefix are 'OR-ed'...Or-ing 10.000 terms of 1 kb per term, results in terrible performance and memory consumption. Even worse, a general axiom in inverted indexes is hard prefix wildcard performance. Do a prefix wildcard search for 1.000 documents where the property you search on is on average 1 kb will kill the jvm for minutes So, there are certainly good reasons for having string and text indexed differently. If you would like to search on some long text field, I would advice writing an analyzer, that just stores the first 50 chars as a keyword, and sort on this. If the first 50 chars are equal, lucene cannot sort it for you. So, the reasons might not be clear from the documentation, and also, I am not sure if you want to burden the average developer with it, but, when the number of documents grows way beyond the 100.000, you have to start thinking about these kind of implications. Regards Ard > > Regards, > > Wouter > > On Thu, Feb 12, 2009 at 5:28 PM, Ard Schrijvers > <[email protected]>wrote: > > > Hello Nick, > > > > Most likely you configured the h:caption to be text indexed > > (indexer.xml). This means, you can not properly sort on it. If you > > index it as String (this is default how properties are indexed), > > sorting will be correct. If you are searching in captions like with > > S:property-contains, you have to replace this with > S:propsearch, see > > [1] > > > > If you are interested in the reasoning behind text/string > indexing let > > me know and I'll explain a litte more, > > > > Regards Ard > > > > [1] > > > http://wiki.hippo.nl/display/CMS/06.+Using+DASL+Queries#06.UsingDASLQu > > er > > > ies-%3CS%3A(not)propertycontains%2F%3E<http://wiki.hippo.nl/display/CM > > > S/06.+Using+DASL+Queries#06.UsingDASLQuer%0Aies-%3CS%3A%28not%29proper > > tycontains%2F%3E> > > > > > > > > > > I am executing a DASL query [1] against the repository 1.2.15 to > > > select a number of results ordered by their caption. I > get a number > > > of results [2] which are wrongly ordered. I have tried to > throw away > > > the index at the repository and rebuild it, but that didn't help. > > > What can cause this behavior? > > > > > > With regards, > > > > > > Nick Stolwijk > > > ~Java Developer~ > > > > > > Iprofs BV. > > > Claus Sluterweg 125 > > > 2012 WS Haarlem > > > www.iprofs.nl > > > > > > [1] > > > <?xml version="1.0" encoding="utf-8" ?> > > > <!-- productlist_dasl.ftl --> > > > <d:searchrequest xmlns:s="http://jakarta.apache.org/slide/" > > > xmlns:h="http://hippo.nl/cms/1.0" xmlns:d="DAV:"> > > > <d:basicsearch> > > > <d:select> > > > <d:prop> > > > <s:nrHits/> > > > <d:displayname/> > > > <h:caption /> > > > > > > </d:prop> > > > </d:select> > > > <d:from> > > > <d:scope> > > > <d:href>content</d:href> > > > <d:depth>Infinity</d:depth> > > > </d:scope> > > > </d:from> > > > > > > <limit xmlns="DAV:"> > > > <nresults>12</nresults> > > > <offset > > > xmlns="http://jakarta.apache.org/slide/">0</offset> > > > </limit> > > > > > > <d:orderby> > > > <d:order> > > > <d:prop> > > > <h:caption/> > > > </d:prop> > > > <d:ascending /> > > > </d:order> > > > </d:orderby> > > > </d:basicsearch> > > > </d:searchrequest> > > > > > > > > > [2] > > > > > > <?xml version="1.0" encoding="UTF-8"?> <D:multistatus > > > xmlns:D="DAV:"> > > > <D:response> > > > > > > > <D:href>/default/files/default.www/content/aaaa-buiten.xml</D:href> > > > <D:propstat> > > > <D:prop> > > > <S:nrHits > > > xmlns:S="http://jakarta.apache.org/slide/">4</S:nrHits> > > > <D:displayname>aaaa-buiten.xml</D:displayname> > > > <caption xmlns="http://hippo.nl/cms/1.0">AAAA > > > - buiten</caption> > > > </D:prop> > > > <D:status>HTTP/1.1 200 OK</D:status> > > > </D:propstat> > > > </D:response> > > > <D:response> > > > > > > <D:href>/default/files/default.www/content/dominicus.xml</D:href> > > > <D:propstat> > > > <D:prop> > > > <S:nrHits > > > xmlns:S="http://jakarta.apache.org/slide/">4</S:nrHits> > > > > > > <D:displayname>slovenie-istrie-rg-dominicus.xml</D:displayname> > > > <caption > > > xmlns="http://hippo.nl/cms/1.0">DOMINICUS</caption> > > > </D:prop> > > > <D:status>HTTP/1.1 200 OK</D:status> > > > </D:propstat> > > > </D:response> > > > <D:response> > > > > > > <D:href>/default/files/default.www/content/appdata/webwinkel/c > > > ategorien/reis-en-cadeauartikelen/reisartikelen/aaaa-leden.xml > > > </D:href> > > > <D:propstat> > > > <D:prop> > > > <S:nrHits > > > xmlns:S="http://jakarta.apache.org/slide/">4</S:nrHits> > > > > > > <D:displayname>anwb-werelstekker-voordeel-voor-leden.xml</D:di > > > splayname> > > > <caption xmlns="http://hippo.nl/cms/1.0">AAAA > > > leden</caption> > > > </D:prop> > > > <D:status>HTTP/1.1 200 OK</D:status> > > > </D:propstat> > > > </D:response> > > > <D:response> > > > > > > <D:href>/default/files/default.www/content/zonnebril.xml</D:href> > > > <D:propstat> > > > <D:prop> > > > <S:nrHits > > > xmlns:S="http://jakarta.apache.org/slide/">4</S:nrHits> > > > <D:displayname>zonnebril-202.xml</D:displayname> > > > <caption > > > xmlns="http://hippo.nl/cms/1.0">ZONNEBRIL 202</caption> > > > </D:prop> > > > <D:status>HTTP/1.1 200 OK</D:status> > > > </D:propstat> > > > </D:response> > > > </D:multistatus> > > > ******************************************** > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > > Searchable archives can be found at: > > > MarkMail: http://hippocms-dev.markmail.org > > > Nabble: http://www.nabble.com/Hippo-CMS-f26633.html > > > > > > > > ******************************************** > > Hippocms-dev: Hippo CMS development public mailinglist > > > > Searchable archives can be found at: > > MarkMail: http://hippocms-dev.markmail.org > > Nabble: http://www.nabble.com/Hippo-CMS-f26633.html > > > > > > > -- > Met vriendelijke groet, > > Wouter Zelle > ******************************************** > Hippocms-dev: Hippo CMS development public mailinglist > > Searchable archives can be found at: > MarkMail: http://hippocms-dev.markmail.org > Nabble: http://www.nabble.com/Hippo-CMS-f26633.html > > ******************************************** Hippocms-dev: Hippo CMS development public mailinglist Searchable archives can be found at: MarkMail: http://hippocms-dev.markmail.org Nabble: http://www.nabble.com/Hippo-CMS-f26633.html
