Re: Difference between SortedDocValues and SortedSetDocValues
On Thu, Oct 12, 2017 at 8:53 AM, Chellasamy Gwrote: > Could anyone please explain the difference between SortedDocValues and > SortedSetDocValues. SortedDocValues has at most 1 value per document (single-valued). SortedSetDocValues supports a set of values per document (multi-valued). -Yonik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: no concurrent merging?
On Thu, Aug 4, 2016 at 9:35 AM, Michael McCandlesswrote: > Lucene's merging is concurrent, but Solr unfortunately uses > UninvertingReader on each DBQ ... I'm not sure why. It looks like DeleteByQueryWrapper was added by https://issues.apache.org/jira/browse/LUCENE-5666 But other than perhaps changing how long a DBQ takes to execute, it should be unrelated to the question of if other merges can proceed in parallel. A quick look at the lucene IndexWriter code says, no... Lucene DBQ processing cannot proceed in parallel. IndexWriter.mergeInit is synchronized (on IW). The DBQ processing is called from there and thus anything else that needs the IW monitor will block. -Yonik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Port of Custom value source from v4.10.3 to v6.1.0
Use getSortedDocValues for a single-valued field, or getSortedSetDocValues for multi-valued. -Yonik On Fri, Jul 8, 2016 at 12:29 PM, paule_lecuyerwrote: > Many Thanks Yonik, I will try that. > > For my understanding, what is the difference between SortedSetDocValues > getSortedSetDocValues(String field) and SortedDocValues > getSortedDocValues(String field) ? > > Paule. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Upgrade-of-Custom-value-source-code-from-v4-10-3-to-v6-1-0-tp4286236p4286387.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Port of Custom value source from v4.10.3 to v6.1.0
Use the docValues interface by calling getSortedSetDocValues on the leaf reader. That will either 1) use real docValues if you have indexed them 2) use the FieldCache to uninvert an indexed field and make it look like docValues. -Yonik On Thu, Jul 7, 2016 at 1:33 PM, paule_lecuyerwrote: > Hi all, > I wrote some time ago a ValueSourceParser + ValueSource to allow using > results produced by an external system as a facet query : > - in solrconfig.xml : added my parser : > http://lucene.472066.n3.nabble.com/Port-of-Custom-value-source-from-v4-10-3-to-v6-1-0-tp4286236.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 5: Mutable/Immutable interface of BitSet
On Sun, Sep 13, 2015 at 4:23 PM, Selva Kumarwrote: > Mutable, "Immutable" interface of BitSet seems to be defined based on > specific things like live docs and documents with DocValue etc. Any plan to > add general purpose readonly interface to BitSet? We already have the "Bits" interface: public interface Bits { public boolean get(int index); public int length(); } -Yonik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 5: Mutable/Immutable interface of BitSet
On Sun, Sep 13, 2015 at 5:55 PM, Selva Kumar <selva.kumar.at.w...@gmail.com> wrote: > BitSet has many more readonly method compared to Bits. Ah, I see what you're saying now. If you have a need/usecase for certain methods on Bits, perhaps open a JIRA issue and propose them. -Yonik > Similarly, BitSet > has many more write methods compared to MutableBits. So, as I said, this > seems to be based on internal requirement like live docs, documents with > DocValues etc. > > Thanks for your time, Yonik > > > On Sun, Sep 13, 2015 at 4:43 PM, Yonik Seeley <ysee...@gmail.com> wrote: > >> On Sun, Sep 13, 2015 at 4:23 PM, Selva Kumar >> <selva.kumar.at.w...@gmail.com> wrote: >> > Mutable, "Immutable" interface of BitSet seems to be defined based on >> > specific things like live docs and documents with DocValue etc. Any plan >> to >> > add general purpose readonly interface to BitSet? >> >> We already have the "Bits" interface: >> >> public interface Bits { >> public boolean get(int index); >> public int length(); >> } >> >> -Yonik >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene nrt
Yes, if you do a commit with waitSearcher=true (and it succeeds) then any adds before that point will be visible. -Yonik On Mon, Jul 20, 2015 at 8:25 PM, Bhawna Asnani bhawna.asn...@gmail.com wrote: Hi, I am using solr to update a document and read it back immediately through search. I do softCommit my changes which claims to use lucene's indexReader using indexWritter which was used to write teh document. But there are times when Itheget a stale document back even with waitSearcher=true. Does lucene's nrt (i.e. DirectoryReader open(IndexWriter writer, boolean applyAllDeletes)) guarantee's that the changes made through the wirtter will be visible to reader immediately? Thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene/Solr Revolution 2015 Voting
Hey Folks, If you're interested in going to Lucene/Solr Revolution this year in Austin, please vote for the sessions you would like to see! https://lucenerevolution.uservoice.com/ -Yonik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Query with many clauses
For queries with many terms, where each term matches few documents (actually a single document for ID filters in my tests), I saw speedups between 4x and 8x http://heliosearch.org/solr-terms-query/ (the 3rd chart) -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data On Wed, Oct 29, 2014 at 9:42 AM, Michael McCandless luc...@mikemccandless.com wrote: I suggested TermsFilter, not TermFilter :) Note the sneaky extra s Mike McCandless http://blog.mikemccandless.com On Wed, Oct 29, 2014 at 8:20 AM, Pawel Rog pawelro...@gmail.com wrote: Hi, I already tried to transform Queries to filter (TermQuery - TermFilter) but didn't see much speed up. I wrote that wrapped this filter into ConstantScoreQuery and in other test I used FilteredQuery with MatchAllDocsQuery and BooleanFilter. Both cases seems to work quite similar in terms of performance to simple BooleanQuery. But of course I'll also try to use TermsFilter. Maybe it will speedUp filters. Michael Sokolov I haven't prepared any statistics about number of BooleanClauses used and if there are some repeating sets of terms. I think I have to collect some stats for better understanding what can be improved. -- Paweł Róg On Wed, Oct 29, 2014 at 12:30 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: I'm curious to know more about your use case, because I have an idea for something that addresses this, but haven't found the opportunity to develop it yet - maybe somebody else wants to :). The basic idea is to reduce the number of terms needed to be looked up by collapsing commonly-occurring collections of terms into synthetic tiles. If your queries have a lot of overlap, this could greatly reduce the number of terms in a query rewritten to use tiles. It's sort of complex, requires indexing support, or a filter cache, and there's no working implementation as yet, so this is probably not really going to be helpful for you in the short term, but if you can share some information I'd love to know: what kind of things are you searching? how many terms do your larger queries have? do the query terms overlap among your queries? -Mike Sokolov On 10/28/14 9:40 PM, Pawel Rog wrote: Hi, I have to run query with a lot of boolean should clauses. Queries like these were of course slow so I decided to change query to filter wrapped by ConstantScoreQuery but it also didn't help. Profiler shows that most of the time is spent on seekExact in BlockTreeTermsReader$FieldReader$SegmentTermsEnum When I go deeper in trace I see that inside seekExact most time is spent on loadBlock and even deeper ByteBufferIndexInput.clone. Do you have any ideas how I can make it work faster or it is not possible and I have to live with it? -- Paweł Róg - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Square of Idf
On Thu, Mar 6, 2014 at 6:28 PM, Furkan KAMACI furkankam...@gmail.com wrote: Hi; Tf-Idf is explanation says that: *idf(t)* appears for *t* in both the query and the document, hence it is squared in the equation. DefaultSimilarity does not square it. What it the explanation of it? I think you explained it yourself. The similarity doesn't square it... what is returned from Similarity.idf(t) is used twice (and hence ends up effectively squared). The code has gotten more complex over time, but look at the class IDFStats to see the squaring of idf. There is an idf factor in the queryWeight, and then in normalize() it's multiplied by the idf factor again. -Yonik http://heliosearch.org - native off-heap filters and fieldcache for solr - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Natural Sort Order
On Mon, Oct 14, 2013 at 9:43 PM, Darren Hoffman dar...@jnamics.com wrote: Can anyone tell me if a search based on a ConstantScoreQuery should return the results in the order that the documents were added to the index? The order will be internal docid, which used to be the order that docs were added to the index. Non-contiguous segments can now be merged, so that is no longer the case. -Yonik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: sorting with lucene 4.3
On Wed, Jul 31, 2013 at 2:51 PM, Nicolas Guyot sfni...@gmail.com wrote: I have written a quick test to reproduce the slower sorting with numeric DV. In this test case, it happens only when reverse sorting. Right - I bet your numeric field is relatively ordered in the index. When this happens, there is always one sort order that is less efficient because the priority queue is constantly finding more competitive hits as we search through the index. If you index random numbers (or in a random order), the discrepancy between the sort order should disappear. -Yonik http://lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: About query result cache.
On Mon, Dec 17, 2012 at 12:58 AM, lukai lukai1...@gmail.com wrote: Hi, guys: Does queryplugin implementation impacts caching? I have implemented a new query parser which just take the input query string and return my own query object. But the problem is, when i apply this logic to solr, it seems it only works for the first time. Then even i change query, it still returns the same result as the first time result. Is it cached? If so, the cache key is based on what? The key is the query object. Implement equals and hashcode so that it won't match other other versions of your query. -Yonik http://lucidworks.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.0: Custom Query Parser newTermQuery(Term term) override
On Wed, Jul 11, 2012 at 9:34 AM, Jamie ja...@stimulussoft.com wrote: I am busying attempting to integrate Lucene 4.0 Alpha into my code base. I have a custom QueryParser that extends QueryParser and overrides newRangeQuery and newTermQuery Random pointer: for most special case field handling, one would want to override getFieldQuery or newFieldQuery rather than the lower level newTermQuery. -Yonik http://lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexReader.deleteDocument in Lucene 3.6
On Fri, May 25, 2012 at 5:23 AM, Nikolay Zamosenchuk nikolaz...@gmail.com wrote: IndexWriter.deleteDocument(..) is not final, but doesn't return any result. Deleted terms are buffered for good performance, so at the time of IndexWriter.deleteDocument(Term) we don't know how many documents match the term. Can anyone please suggest how to solve this issue? Can simply run term query before, but it seems to be absolutely inefficient. You could switch to an asynchronous design and use a custom query that keeps track of how many (or which) documents it matched. -Yonik http://lucidimagination.com -- Best regards, Nikolay Zamosenchuk - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: org.apache.lucene.index.MultiFields.getLiveDocs(IndexReader) returning null.
On Mon, Mar 5, 2012 at 1:53 PM, Benson Margulies bimargul...@gmail.com wrote: There's no javadoc on here yet, and I am a little puzzled by the fact that it is returning null for me. Does that imply that there can't be any deleted docs known to the reader? Right, see AtomicReader /** Returns the {@link Bits} representing live (not * deleted) docs. A set bit indicates the doc ID has not * been deleted. If this method returns null it means * there are no deleted documents (all documents are * live). * * The returned instance has been safely published for * use by multiple threads without additional * synchronization. */ public abstract Bits getLiveDocs(); -Yonik lucenerevolution.com - Lucene/Solr Open Source Search Conference. Boston May 7-10 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Spatial Search
On Sat, Dec 31, 2011 at 11:52 AM, Lance Java lance.j...@googlemail.com wrote: Hi, I am new to Lucene and I am trying to use spatial search. The old tier-based stuff in Lucene is broken and considered deprecated. For Lucene, this may currently be your best hope: http://code.google.com/p/lucene-spatial-playground/ Solr has also had built-in spatial for a little while too: http://wiki.apache.org/solr/SpatialSearch -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ElasticSearch
On Thu, Nov 17, 2011 at 2:53 PM, Simon Willnauer simon.willna...@googlemail.com wrote: dude, look at this query... its insane isn't it :) Sorry... what's the equivalent you'd like instead? Or if you're just unjustifiably bitching about Solr again, maybe I should take a stroll through Lucene land and bitch about incomprehensible code, APIs that are increasingly hard to use, APIs that keep changing on a whim w/o regard to existing users, etc. Your attitude is getting tiring. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ElasticSearch
On Thu, Nov 17, 2011 at 3:18 PM, Uwe Schindler u...@thetaphi.de wrote: Sorry, this query is really ununderstandable. Those complex queries should have a meaningful language, e.g. a JSON object structure There are upsides and downsides to that. A big JSON object graph would be easier to *read* but certainly not easier to write (much more nesting). These main Solr APIs are based around HTTP parameters... the upside being you can add another parameter w/o worrying about nesting it correctly. Like simply adding another filter for example: fq=instock:true -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ElasticSearch
On Thu, Nov 17, 2011 at 3:40 PM, Mark Harwood markharw...@yahoo.co.uk wrote: JSON or XML can reflect more closely the hierarchy in the underlying Lucene query objects. We normally use the Lucene QueryParser syntax itself for that (not HTTP parameters). Other parameters such as filters, faceting, highlighting, sorting, etc, don't normally have any hierarchy. I don't think JSON is always nicer either. How would you write this sort in JSON for example? sort=price desc, score desc A big plus to Solr's APIs is that it's relatively easy to type them in to a browser to try them out. As far as alternate query syntaxes (as opposed to alternate request syntaxes), Solr has good support for that and it would be simple to add in support for a JSON query syntax if someone wrote one. AFAIK, there's an issue open for adding the XML query syntax, but I'm not sure if it's ever had much traction. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ElasticSearch
On Thu, Nov 17, 2011 at 3:44 PM, Michael McCandless luc...@mikemccandless.com wrote: Maybe someone can post the equivalent query in ElasticSearch? I don't think it's possible. Hoss threw in the kitchen sink into his contrived' example. Here's a super simple example: JSON: { sort : [ { age : {order : asc} } ], query : { term : { user : jack } } } Solr's HTTP: q=user:jacksort=age asc -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ElasticSearch
On Wed, Nov 16, 2011 at 10:36 AM, Shashi Kant sk...@sloan.mit.edu wrote: I had posted this earlier on this list, hope this provides some answers http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/ Except it's an out of date comparison. We have NRT (near real time search) in Solr now. http://wiki.apache.org/solr/NearRealtimeSearch -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Please help me with a basic question...
On Fri, May 20, 2011 at 2:46 PM, Doron Cohen cdor...@gmail.com wrote: I stumbled upon the 'Explain' function yesterday though it returns a crowded message using debug in SOLR admin. Is there another method or interface which returns more or cleaner info? I am not familiar with the use of Solr for this, I hope someone else will answer this... Most browser's default XML display don't preserve the text formatting... hence the explain can look messed up. try viewing the source or original page (CTRL-U in firefox, CTRL-ALT-U or CMD-ALT-U in chrome I think)... and make sure indent=true http://localhost:8983/solr/select?q=solrdebugQuery=trueindent=true lst name=explain str name=SOLR1000 0.58961654 = (MATCH) fieldWeight(text:solr in 1), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.3353748 = idf(docFreq=2, maxDocs=31) 0.125 = fieldNorm(field=text, doc=1) /str /lst If email doesn't mess this up somewhere, you should see a properly indented block of explain text. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Retrieving the first document in a range
On Tue, Apr 5, 2011 at 10:06 AM, Shai Erera ser...@gmail.com wrote: Can we use TermEnum to skip to the first term 'after 3 weeks'? If so, we can pull the first doc that appears in the TermDocs of that Term (if it's a valid term). Yep. Try this to get the term you want to use to seek: BytesRef term = new BytesRef(); NumericUtils.longToPrefixCoded(longval, 0, term); -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: DocIdSet to represent small numberr of hits in large Document set
On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman a...@thorntothehorn.org wrote: Seems like SortedVIntList can be used to store the info, but it has no methods to build the list in the first place, requiring an array or bitset in the constructor. It has a constructor that takes DocIdSetIterator - so you can pass an iterator obtained from anywhere else (a Scorer actually is a DocIdSetIterator, and you can get a DocIdSet from a Filter), or implement your own. It's a simple iterator interface. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Undo hyphenation when indexing
Solr has a hyphenated word filter you could copy. http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html On trunk, this has been folded into the analysis module. -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschin bersc...@dosco.de wrote: Hi, for indexing PDF files we have to undo word hyphenation. The basic idea is simply to remove the hyphen when a new line and a small letter follows. Of course this approach isnt 100%-foolproofed but checking against a dictionary wouldnt be as well... Since we face this problem too when highlighting using HTMLCharStripper (yes, we do have hyphenation in our HTML docs...) it seems to me I have to adjust the JFlex generated StandardTokenizerImpl. Is this the right approach and hwo would I have to modify this script? Thanks Wulf PS: I see that there are changes made in the brand new 3.1.0 version we are using 3.0.3, but as far I understand no relevant changes in this respect. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: which unicode version is supported with lucene
On Sun, Feb 27, 2011 at 2:15 PM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Jepp, its back online. Just did a short test and reported my results to jira, but is the error from the xml output still a jetty problem or is it from XMLwriter? The patch has been committed, so you should just be able to try trunk (or 3x). I also just committed a char beyond the BMP to utf8-example.xml and the indexing and XML output works fine for me. Index the example docs, then do a query for BMP to bring up that document. -Yonik http://lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: which unicode version is supported with lucene
On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: So Solr trunk should already handle Unicode above BMP for field type string? Strange... One issue is that jetty doesn't support UTF-8 beyond the BMP: /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh Solr server is up. HTTP GET is accepting UTF-8 HTTP POST is accepting UTF-8 HTTP POST defaults to UTF-8 ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic multilingual plane -Yonik http://lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: which unicode version is supported with lucene
On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: Hi Yonik, good point, yes we are using Jetty. Do you know if Tomcat has this limitation? Tomcat's defaults are worse - you need to configure it to use UTF-8 by default for URLs. Once you do, it passes all those tests (last I checked). Those tests are really about UTF-8 working in GET/POST query arguments. Solr may still be able to handle indexing and returning full UTF-8, but you wouldn't be able to query for it w/o using surrogates if you're using Jetty. It would be good to test though - does anyone know how to add a char above the BMP to utf8-example.xml? -Yonik http://lucidimagination.com Regards, Bernd Am 25.02.2011 14:54, schrieb Yonik Seeley: On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: So Solr trunk should already handle Unicode above BMP for field type string? Strange... One issue is that jetty doesn't support UTF-8 beyond the BMP: /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh Solr server is up. HTTP GET is accepting UTF-8 HTTP POST is accepting UTF-8 HTTP POST defaults to UTF-8 ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic multilingual plane -Yonik http://lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Storing an ID alongside a document
That's exactly what the CSF feature is for, right? (docvalues branch) -Yonik http://lucidimagination.com On Wed, Feb 2, 2011 at 1:03 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm curious if there's a new way (using flex or term states) to store IDs alongside a document and retrieve the IDs of the top N results? The goal would be to minimize HD seeks, and not use field caches (because they consume too much heap space) or the doc stores (which require two seeks). One possible way using the pre-flex system is to place the IDs into a payload posting that would match all documents, and then [somehow] retrieve the payload only when needed. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Storing an ID alongside a document
On Wed, Feb 2, 2011 at 9:23 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Is it? I thought it would load the values into heap RAM like the field cache and in addition save the values to disk? Does it also read the values directly from disk? Loading into memory is a separate optional part (i.e. loading a fieldcache entry), that should use the APIs that read directly from the index. -Yonik http://lucidimagination.com
Re: WARNING: re-index all trunk indices!
On Fri, Dec 17, 2010 at 11:18 AM, Michael McCandless luc...@mikemccandless.com wrote: If you are using Lucene's trunk (nightly build) release, read on... I just committed a change (for LUCENE-2811) that changes the index format on trunk, thus breaking (w/ likely strange exceptions on reading the segments_N file) any trunk indices created in the past week or so. For reference, the exception I got trying to start Solr with an older index on Windows is below. -Yonik http://www.lucidimagination.com SEVERE: java.lang.RuntimeException: java.io.IOException: read past EOF at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095) at org.apache.solr.core.SolrCore.init(SolrCore.java:587) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:660) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:412) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.mortbay.start.Main.invokeMain(Main.java:194) at org.mortbay.start.Main.start(Main.java:534) at org.mortbay.start.Main.start(Main.java:441) at org.mortbay.start.Main.main(Main.java:119) Caused by: java.io.IOException: read past EOF at org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(MMapDirectory.java:242) at org.apache.lucene.store.ChecksumIndexInput.readBytes(ChecksumIndexInput.java:48) at org.apache.lucene.store.DataInput.readString(DataInput.java:121) at org.apache.lucene.store.DataInput.readStringStringMap(DataInput.java:148) at org.apache.lucene.index.SegmentInfo.init(SegmentInfo.java:192) at org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:57) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:220) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:90) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86) at org.apache.lucene.index.IndexReader.open(IndexReader.java:437) at org.apache.lucene.index.IndexReader.open(IndexReader.java:316) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1084) ... 31 more - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: The logic of QueryParser
On Mon, Dec 13, 2010 at 2:51 PM, Robert Muir rcm...@gmail.com wrote: On Mon, Dec 13, 2010 at 2:43 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Dec 13, 2010 at 2:10 PM, Brian Hurt bhur...@gmail.com wrote: I was just wondering what the logic was for defaulting to or instead of and. Largely historical. I think the original rational was that it probably fit better with the traditional vector space model. There's also not a good reason to change the default, given that QueryParser isn't meant for end users. Thats pretty misleading Yonik. In other words, the query parser is designed for human-entered text, not for program-generated text. http://lucene.apache.org/java/3_0_3/queryparsersyntax.html *shrugs*, I didn't recall that phrase... but I'm not clear if you disagree with what I'm saying, or if you just think that it's inconsistent with the documentation. I think of the Lucene QueryParser like SQL. SQL is text based and also meant for human entered text - but for either very expert users, or programmatically created queries. You normally don't want to pass text from a search box directly to an SQL database or to the Lucene QueryParser. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: The logic of QueryParser
On Mon, Dec 13, 2010 at 3:07 PM, Robert Muir rcm...@gmail.com wrote: On Mon, Dec 13, 2010 at 3:04 PM, Yonik Seeley yo...@lucidimagination.com wrote: I think of the Lucene QueryParser like SQL. SQL is text based and also meant for human entered text - but for either very expert users, or programmatically created queries. You normally don't want to pass text from a search box directly to an SQL database or to the Lucene QueryParser. Then why does solr use it by default? Because it's a decent default? It was also the only choice when Solr was first created. I don't see a compelling reason to change that. Solr fits about the same place a database does in many applications... it's certainly not meant for users to query directly. There's normally a web application that handles interaction with the user and creates/submits queries to Solr. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Webcast: Better Search Results Faster with Apache Solr and LucidWorks Enterprise
We're holding a free webinar about relevancy enhancements in our commercial version of Solr. Details below. -Yonik http://www.lucidimagination.com - Join us for a free technical webcast Better Search Results Faster with Apache Solr and LucidWorks Enterprise Thursday, December 16, 2010 11:00 AM PST / 2:00 PM EST / 20:00 CET Click here to sign up http://www.eventsvc.com/lucidimagination/121610?trk=AP In the key dimensions of search relevancy and query-targeted results, users have become accustomed to internet-search style facilities like page-rank, user-driven feedback, auto-suggest and more. Even with the power of Apache Lucene/Solr, building such features into your own search application is easier said than done. Now, with LucidWorks Enterprise, the search solution development platform built on the Solr/Lucene open source technology, developing killer search apps with these features and more is faster, simpler, and more powerful than ever before! Join Andrzej Bialecki, Lucene/Solr Committer and inventor of the Luke index utility, for a hands-on technical workshop that details how LucidWorks Enterprise puts powerful search and relevancy at your fingertips -- at a fraction of the time and effort required to program them yourself with native Apache Solr. Andrzej will discuss and present how you can use LucidWorks Enterprise for: * Click Scoring to automatically configure relevance for most popular results * Simplified implementation of auto-complete and did-you-mean functionality * Unsupervised feedback to automatically provide relevance improvement on every query Click here to sign up http://www.eventsvc.com/lucidimagination/121610?trk=AP -- About the presenter: Andrzej Bialecki is a committer of the Apache Lucene/Solr project, a Lucene PMC member, and chairman of the Apache Nutch project. He is also the author of Luke, the Lucene Index Toolbox. Andrzej participates in many commercial projects that use Lucene/Solr, Nutch and Hadoop to implement enterprise and vertical search. -- Presented by Lucid Imagination, the commercial entity exclusively dedicated to Apache Lucene/Solr open source search technology. LucidWorks Enterprise, our search solution development platform, helps you build better search application more quickly and productively, develop and We also offer solutions including SLA-based support, professional training, best practices consulting, free developer downloads free documentation. Follow us on Twitter:twitter.com/LucidImagineer. -- Apache Lucene and Apache Solr are trademarks of the Apache Software Foundation. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: best practice: 1.4 billions documents
On Mon, Nov 22, 2010 at 12:49 PM, Uwe Schindler u...@thetaphi.de wrote: (Fuzzy scores on MultiSearcher and Solr are totally wrong because each shard uses another rewritten query). Hmmm, really? I thought that fuzzy scoring should just rely on edit distance? Oh wait, I think I see - it's because we can use a hard cutoff for the number of expansions rather than an edit distance cutoff. If we used the latter, everything should be fine? The fuzzy issue I would classify as working as designed. Either that, or classify FuzzyQuery as broken. A cuttoff based on number of terms will yield strange results even on a single index. Consider this scenario: it's possible to add more docs to a single index and have the same fuzzy query return fewer docs than it did before! -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: best practice: 1.4 billions documents
On Mon, Nov 22, 2010 at 12:17 PM, Uwe Schindler u...@thetaphi.de wrote: The latest discussion was more about MultiReader vs. MultiSearcher. But you are right, 1.4 B documents is not easy to go, especially when you index grows and you get to the 2.1 B marker, then no MultiSearcher or whatever helps. On the other hand even distributed Solr has the same problems like MultiSearcher: scoring MultiTermQueries (Fuzzy) doesn't work correctly Are you referring to the idf being local to the shard instead of global to the whole colleciton? Andrzej has a patch in the works, but it's not committed yet. negative MTQ clauses may produce wrong results if the query rewriting is done like in MultiSearcher (which is unsolveable broken and broken and broken and again broken for some queries as Boolean clauses - see DeMorgan laws). I don't think this is a problem for Solr. Queries are executed on each shard as normal (no difference from a non-distributed query). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: best practice: 1.4 billions documents
On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini luca.rondan...@gmail.com wrote: Hi everybody, I really need some good advice! I need to index in lucene something like 1.4 billions documents. I had experience in lucene but I've never worked with such a big number of documents. Also this is just the number of docs at start-up: they are going to grow and fast. I don't have to tell you that I need the system to be fast and to support real time updates to the documents The first solution that came to my mind was to use ParallelMultiSearcher, splitting the index into many sub-index (how many docs per index? 100,000?) but I don't have experience with it and I don't know how well will scale while the number of documents grows! A more solid solution seems to build some kind of integration with hadoop. But I didn't find match about lucene and hadoop integration. Any idea? Which direction should I go (pure lucene or hadoop)? There seems to be a common misconception about hadoop regarding search. Map-reduce as implemented in hadoop is really for batch oriented jobs only (or those types of jobs where you don't need a quick response time). It's definitely not for normal queries (unless you have unusual requirements). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexWriter.close() performance issue
On Fri, Nov 19, 2010 at 5:41 PM, Mark Kristensson mark.kristens...@smartsheet.com wrote: Here's the changes I made to org.apache.lucene.util.StringHelper: //public static StringInterner interner = new SimpleStringInterner(1024,8); As Mike said, the real fix for trunk is to get rid of interning. But for your version, you could try making the string intern cache larger. StringHelper.interner = new SimpleStringInterner(30,8); -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
FAST ESP - Solr migration webinar
We're holding a free webinar on migration from FAST to Solr. Details below. -Yonik http://www.lucidimagination.com = Solr To The Rescue: Successful Migration From FAST ESP to Open Source Search Based on Apache Solr Thursday, Nov 18, 2010, 14:00 EST (19:00 GMT) Hosted by SearchDataManagement.com For anyone concerned about the future of their FAST ESP applications since the purchase of Fast Search and Transfer by Microsoft in 2008, this webinar will provide valuable insights on making the switch to Solr. A three-person rountable will discuss factors driving the need for FAST ESP alternatives, differences between FAST and Solr, a typical migration project lifecycle methodology, complementary open source tools, best practices, customer examples, and recommended next steps. The speakers for this webinar are: Helge Legernes, Founding Partner CTO of Findwise Michael McIntosh, VP Search Solutions for TNR Global Eric Gaumer, Chief Architect for ESR Technology. For more information and to register, please go to: http://SearchDataManagement.bitpipe.com/detail/RES/1288718603_527.html?asrc=CL_PRM_Lucid2 = - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexWriter.close() performance issue
It turns out that the prepareCommit() is the slow call here, taking several seconds to complete. I've done some reading about it, but have not found anything that might be helpful here. The fact that it is slow every single time, even when I'm adding exactly one document to the index, is perplexing and leads to me to think something must be corrupt with the index. prepareCommit() syncs the index files, making sure they are one stable storage. Some filesystems have issues with syncing individual files and essentially sync all files with unflushed data, leading to poor performance. -Yonik http://www.lucidimagination.com On Wed, Nov 3, 2010 at 2:53 PM, Mark Kristensson mark.kristens...@smartsheet.com wrote: I've successfully reproduced the issue in our lab with a copy from production and have broken the close() call into parts, as suggested, with one addition. Previously, the call was simply ... } finally { // Close if (indexWriter != null) { try { indexWriter.close(); ... Now, that is broken into the various parts, including a prepareCommit(); ... } finally { // Close if (indexWriter != null) { try { indexWriter.prepareCommit(); Logger.debug(prepareCommit() complete); indexWriter.commit(); Logger.debug(commit() complete); indexWriter.maybeMerge(); Logger.debug(maybeMerge() complete); indexWriter.waitForMerges(); Logger.debug(waitForMerges() complete); indexWriter.close(); ... It turns out that the prepareCommit() is the slow call here, taking several seconds to complete. I've done some reading about it, but have not found anything that might be helpful here. The fact that it is slow every single time, even when I'm adding exactly one document to the index, is perplexing and leads to me to think something must be corrupt with the index. Furthermore, I tried optimizing the index to see if that would have any impact (I wasn't expecting much) and it did not. I'm stumped at this point and am thinking I may have to rebuild the index, though I would definitely prefer to avoid doing that and would like to know why this is happening. Thanks for your help, Mark On Nov 2, 2010, at 9:26 AM, Mark Kristensson wrote: Wonderful information on what happens during indexWriter.close(), thank you very much! I've got some testing to do as a result. We are on Lucene 3.0.0 right now. One other detail that I neglected to mention is that the batch size does not seem to have any relation to the time it takes to close on the index where we are having issues. We've had batches add as few as 3 documents and batches add as many as 2500 documents in the last hour and every single close() call for that index takes 6 to 8 seconds. While I won't know until I am able to individually test the different pieces of the close() operation, I'd be very surprised if a batch that adds just 3 new documents results in very much merge work being done. It seems as if there is some task happening during merge that the indexWriter is never able to successfully complete and so it tries to complete that task every single time close() is called. So, my working theory until I can dig deeper is that something is mildly corrupt with the index (though not serious enough to affect most operations on the index). Are there any good utilities for examining the health of an index? I've dabbled with the experimental checkIndex object in the past (before we upgraded to 3.0), but have found it to be incredibly slow and of marginal value. Does anyone have any experience using CheckIndex to track down an issue with a production index? Thanks again! Mark On Nov 2, 2010, at 2:20 AM, Shai Erera wrote: When you close IndexWriter, it performs several operations that might have a connection to the problem you describe: * Commit all the pending updates -- if your update batch size is more or less the same (i.e., comparable # of docs and total # bytes indexed), then you should not see a performance difference between closes. * Consults the MergePolicy and runs any merges it returns as candidates. * Waits for the merges to finish. Roughly, IndexWriter.close() can be substituted w/: writer.commit(false); // commits the changes, but does not run merges. writer.maybeMerge(); // runs merges returned by MergePolicy. writer.waitForMerges(); // if you use ConcurrentMergeScheduler, the above call returns immediately, not waiting for the merges to finish. writer.close(); // at this point, commit +
Re: lucene norms cached twice
On Fri, Oct 29, 2010 at 3:32 PM, Cabansag, Ronald-Alvin R ronald-alvin.caban...@cengage.com wrote: We use a QueryWrapperFilter.getDocIdSet(indexReader) to get the DocIdSet and compute the hit count using its iterator. If you want to avoid double-caching of norms, then you should call getDocIdSet() for each segment reader, not the top level reader. Aside: presumably you're actually doing something more advanced than getting the hit count (and you just simplified your description because it wasn't pertinent)... since you can get the hit count from TopDocs. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Function Query, Required Clauses, and Matching
On Mon, Oct 25, 2010 at 7:00 PM, Dennis Kubes ku...@apache.org wrote: A curiosity. Some of the documentation for function queries says they match every document in the index. When running a query that has boolean required clauses and an optional ValueSourceQuery or function query is the function query still matched against every document in the index or is it only on those documents that match required clauses? It's only those that match the required clauses. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Checksum and transactional safety for lucene indexes
On Tue, Sep 21, 2010 at 12:53 AM, Lance Norskog goks...@gmail.com wrote: If an index file is not completely written to disk, it never become available. Lucene has a file describing the current active index segments. It writes all new files to the disk, and changes the description file (segments.gen) only after that. Right - but it's segments_n segments.gen is actually optional (IIRC, solr doesn't even replicate it to slaves). -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Filters do not work with MultiSearcher?
This is working as designed. Note this method: public DocIdSet getDocIdSet(IndexReader indexReader) throws IOException { return openBitSet; } You must pay attention to the IndexReader passed - and the DocIdSet returned must always be based on that reader (and the first document of every reader is always 0). So returning the same DocIdSet each time is not valid and will result in errors. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8 On Fri, Sep 10, 2010 at 12:23 PM, Nader, John P john.na...@cengage.com wrote: We are attempting to perform a filtered search on two indices joined by a MultiSearcher. Unfortunately, it appears there is an issue in the lucene code that is causing the filter to be simply reused at the starting ordinal for each individual index instead of being augmented by the starting document identifier. We are hoping there is an alternate API that will allow us to perform a filtered search on multiple indices. For example, we have two indices with three documents each, and a filter containing only doc ID 1. When we perform a filtered search on a MultiSearcher that joins these two indices, we get two documents back (1, 4), where we were expecting only the one. This is because the MultiSearcher, instead of starting at doc ID 3 for the second index, is interpreting the filter individually for each index. We are using Lucene 3.0.2. The API we see this behavior with is MultiSearcher.search(Query, Filter, nDocs) with a MatchAllDocsQuery and the filter code pasted below: public class OpenBitSetFilter extends Filter { private OpenBitSet openBitSet; public OpenBitSetFilter(OpenBitSet openBitSet) { this.openBitSet = openBitSet; } public DocIdSet getDocIdSet(IndexReader indexReader) throws IOException { return openBitSet; } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: API to retrieve search results without scoring or sorting
On Mon, Jul 19, 2010 at 6:14 AM, Naveen Kumar id.n...@gmail.com wrote: Is there any API using which I can retrieve search results, such that they are neither scored nor sorted (for performance reasons). I just need the results, don't need any extra computation on that. Use your own custom Collector class. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Get lengthNorm of a field
On Mon, Jul 19, 2010 at 9:53 AM, Philippe mailer.tho...@gmail.com wrote: is there a possibility to retrieve the lengthNorm for all (or a specific) fields in a specific document? See IndexReader: public abstract byte[] norms(String field) throws IOException; And Similarity: public float decodeNormValue(byte b) { The byte[] is indexed by document id, and you can decode that into a float value with a Similarity. -Yonik http://www.lucidimagination.com Regards, Philippe - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Could multiple indexers change same collections at the same time?
Yes, all of that still applies to Lucene 3x and 4x, and is unlikely to change any time soon. -Yonik http://www.lucidimagination.com On Thu, Jun 24, 2010 at 1:51 PM, Zhang, Lisheng lisheng.zh...@broadvision.com wrote: Hi, I remembered I tested earlier lucene 1.4 and 2.4, and found the following: # it is OK for multiple searchers to search the same collection. # it is OK for one IndexerWriter to edit and multiple searchers to search at the same time. # it is generally NOT OK for multiple IndexerWriter to change same collection at the same time. Could you confirm briefly if above are true and give me Yes/No answer whether in latest lucene 3x above conclusions are still OK? Thanks very much for helps, Lisheng - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: segment_N file is missed
On Tue, Jun 15, 2010 at 5:23 AM, Michael McCandless luc...@mikemccandless.com wrote: CheckIndex is not able to recover from this corruption (missing segments_N file); this would be a nice addition... But it sounds like you've worked out a way to write your own segmetns_N? Use oal.store.ChecksumIndexOutput (wraps any other IndexOutput) to properly write the checksum. BTW how did you lose your segments_N file...? Can this also be caused by the new behavior introduced here? https://issues.apache.org/jira/browse/LUCENE-2386 If you open a writer, add docs, and then crash before calling commit? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Docs with any score are collected in the Collector implementations
On Wed, Jun 2, 2010 at 1:10 PM, jan.kure...@nokia.com wrote: that's probably because I move from lucene to solr. We will need to filter them from the result manually then first. Solr has a function range query that can filter out any values outside of the given range. http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/ And of course, a function query can consist of a normal relevancy query... so here is a lucene query of text:solr with a lower bound of 0 exclusive: http://localhost:8983/solr/select?q={!frange l=0 incl=false}query($qq)qq=text:solr -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using JSON for index input and search output
On Sun, May 30, 2010 at 1:33 PM, Visual Logic visual.lo...@gmail.com wrote: JSON is the format used for all the configuration and property files in the RIA application we are developing. Is Lucene able to create a document from a given JSON file and index it? Is Lucene able to provide a JSON output response from a query made to an index? Does the Tika package provide this? No, and no. XML, JSON, etc, are out of scope for lucene, which is a core search library. Tika extracts text from documents like Word and PDF. Local indexing and searching is needed on the local client so Solr is not a solution even though it does provide a search response in JSON format. Solr is embeddable as well, so you can directly index/search. But why can't you run a separate server? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Using JSON for index input and search output
On Sun, May 30, 2010 at 2:27 PM, Visual Logic visual.lo...@gmail.com wrote: Solr is embeddable but does that not just mean that SolrJ only provides the ability to call Solr running on some server? Nope - embeddable as in running in the same JVM as your application. For some of my use cases using Solr on a remote server would work fine. For other cases it will not be quick enough, Running as a separate server can be on the same host and be very quick. Was it too slow when you tried it? It's a common misconception that HTTP is slow... it's really just a TCP socket (which can be reused with persistent connections) with some standardized headers. Solr also has a binary protocol that works just fine over HTTP, so it's really not more overhead than doing something like talking to a database. But the right solution probably depends on the details of your specific usecases - if you elaborate on them, people may be able to provide more specific recommendations. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to get the number of unique terms in the inverted index
It seems like there should be a formula for estimating the total number of unique terms given that you know the unique term counts for each segment, and make certain assumptions like random document distribution across segments. -Yonik http://www.lucidimagination.com On Thu, May 27, 2010 at 9:17 PM, kannan chandrasekaran ckanna...@yahoo.com wrote: I am just trying out a few experiments to calculate similarity between terms based on their co-occurences in the dataset... Basically I am trying to build contextual vectors and calculate similarity using a similarity measure ( say cosine similarity). I dont think this is an XY problem . The vectors I am trying to build are not the same as the TermVectors option ((term,freq) pairs per document) in the lucene ( if thats what u meant) Thanks Kannan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to get the number of unique terms in the inverted index
On Thu, May 27, 2010 at 2:32 PM, kannan chandrasekaran ckanna...@yahoo.com wrote: I was wondering if there is a way to retrieve the number of unique terms in the lucene ( version 2.4.0) ... I am aware of the terms() terms(Term) method that returns an enumeration (TermEnum) but that involves iterating through the terms and couting them. I looking for something similar to numdocs() in the IndexReader class. No there is not. In 4.0-dev, with the new flex APIs, you can retrieve the number of unique terms in a single segment (Terms.getUniqueTermCount()), but not a whole index. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT and Caching based on IndexReader
On Mon, May 17, 2010 at 5:00 PM, Shay Banon kim...@gmail.com wrote: I wanted to verify if my understanding is correct. Assuming that I use NRT, and refresh, say, every 1 second, caching based on IndexReader, such is what is used in the CachingWrapperFilter is basically useless No, it's fine. Searching in Lucene is now done per-segment, and so the readers that are passed to Filter.getDocIdSet are the segment readers, not the top-level readers. Caching is now per-segment. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT and Caching based on IndexReader
Yep, confirmed what you are seeing. I'll check into it and open an issue. -Yonik http://www.lucidimagination.com On Mon, May 17, 2010 at 5:54 PM, Shay Banon kim...@gmail.com wrote: Yea, I noticed that ;). Even so, I think that with NRT, even the lower level readers are cloned, meaning that you always get a new instance... . Here is a sample program that tests this behavior, am I doing something wrong? By the way, if what I say is correct, it affects field cache as well public static void main(String[] args) throws Exception { Directory dir = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.UNLIMITED); IndexReader reader = indexWriter.getReader(); SetIndexReader readers = new HashSetIndexReader(); // tracks all readers for (int i = 0; i 100; i++) { readers.add(reader); Document doc = new Document(); doc.add(new Field(id, Integer.toString(i), Field.Store.YES, Field.Index.NO)); indexWriter.addDocument(doc); IndexReader newReader = reader.reopen(true); if (reader == newReader) { System.err.println(Should not get the same reader...); } else { reader.close(); reader = newReader; } } reader.close(); // now, go and check that all are ref == 0 // and, that all readers, even sub readers, are unique instances (sadly...) SetIndexReader allReaders = new HashSetIndexReader(); for (IndexReader reader1 : readers) { if (reader1.getRefCount() != 0) { System.err.println(A reader is not closed); } if (allReaders.contains(reader1)) { System.err.println(Found an existing reader...); } allReaders.add(reader1); if (reader1.getSequentialSubReaders() != null) { for (IndexReader reader2 : reader1.getSequentialSubReaders()) { if (reader2.getRefCount() != 0) { System.err.println(A reader is not closed...); } if (allReaders.contains(reader2)) { System.err.println(Found an existing reader...); } allReaders.add(reader2); // there should not be additional readers... if (reader2.getSequentialSubReaders() != null) { System.err.println(Should not be more readers...); } } } } indexWriter.close(); } On Tue, May 18, 2010 at 12:30 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, May 17, 2010 at 5:00 PM, Shay Banon kim...@gmail.com wrote: I wanted to verify if my understanding is correct. Assuming that I use NRT, and refresh, say, every 1 second, caching based on IndexReader, such is what is used in the CachingWrapperFilter is basically useless No, it's fine. Searching in Lucene is now done per-segment, and so the readers that are passed to Filter.getDocIdSet are the segment readers, not the top-level readers. Caching is now per-segment. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT and Caching based on IndexReader
On Mon, May 17, 2010 at 9:00 PM, Shay Banon kim...@gmail.com wrote: Great, so I am not imagining things this late into the night ... ;), not so great, since using NRT with field cache (like sorting) or caching filters, or anything that caches based on IndexReader not really an option. This makes NRT very problematic to use in a real application. NRT is still pretty new :-) And I do believe this is a bug, so we'll get it fixed. It's not actually a problem for FieldCache though - it no longer keys on the reader directly (if deleted docs are the only things that have changed, the FieldCache entry can still be shared). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT and Caching based on IndexReader
On Mon, May 17, 2010 at 9:12 PM, Shay Banon kim...@gmail.com wrote: Just saw that you opened a case for that. I think that its important in your test case to also test for object identity, not just equals. This is because the IndexReader (or the FieldCacheKey) are used as keys in weak hash maps, which uses identity (==) equality for keys. Yeah, just me being lazy... I just knew that those objects don't implement equals and hence it ends up the same as ==. But I agree an explicit == would be better. If FieldCacheKey is supposed to represent the key by which index readers should be tested for equality (for example, it will be used in the CachingWrapperFilter), and not the index reader itself, then I think it should be renamed. What do you think? I am just looking now at what it does, its new... I don't think it's general purpose, since it ignores things like a change in deleted documents. I think we should use the same reader when the segment has not been changed. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT and Caching based on IndexReader
On Mon, May 17, 2010 at 9:14 PM, Shay Banon kim...@gmail.com wrote: Oh, and one more thing. Deleted docs is a sub case, with NRT, most people will almost always add docs as well... . So it is still not really usable for field cache, right? FieldCache should be fine for the general cases - the same entry will be used if the segment hasn't changed at all, or if the segment has only changed which documents are deleted. Adding new documents adds new segments and does affect (until merge) existing segments, so the entries will be reused. -Yonik http://www.lucidimagination.com On Tue, May 18, 2010 at 4:12 AM, Shay Banon kim...@gmail.com wrote: Just saw that you opened a case for that. I think that its important in your test case to also test for object identity, not just equals. This is because the IndexReader (or the FieldCacheKey) are used as keys in weak hash maps, which uses identity (==) equality for keys. If FieldCacheKey is supposed to represent the key by which index readers should be tested for equality (for example, it will be used in the CachingWrapperFilter), and not the index reader itself, then I think it should be renamed. What do you think? I am just looking now at what it does, its new... -shay.banon On Tue, May 18, 2010 at 4:04 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, May 17, 2010 at 9:00 PM, Shay Banon kim...@gmail.com wrote: Great, so I am not imagining things this late into the night ... ;), not so great, since using NRT with field cache (like sorting) or caching filters, or anything that caches based on IndexReader not really an option. This makes NRT very problematic to use in a real application. NRT is still pretty new :-) And I do believe this is a bug, so we'll get it fixed. It's not actually a problem for FieldCache though - it no longer keys on the reader directly (if deleted docs are the only things that have changed, the FieldCache entry can still be shared). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: FieldCache and 2.9
You are requesting the FieldCache entry from the top-level reader and hence a whole new FieldCache entry must be created. Lucene 2.9 sorting requests FieldCache entries at the segment level and hence reuses entries for those segments that haven't changed. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague On Tue, May 11, 2010 at 9:27 AM, Carl Austin carl.aus...@detica.com wrote: Hi, I have been using the FieldCache in lucene version 2.9 compared to that in 2.4. The load time is massively decreased, however I am not seeing any benefit in getting a field cache after re-open of an index reader when I have only added a few extra documents. A small test class is included below (based off one from Lucid Imagination), that creates 5Mil docs, gets a field cache, creates another few docs and gets the field cache again. I though the second get would be very very fast, as only 1 segment should have changed, however it takes more time for the reopen and cache get than it did the original. Am I doing something wrong here or have I misunderstood the new segment changes? Thanks Carl import java.io.File; import org.apache.lucene.analysis.SimpleAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.search.FieldCache; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; public class ContrivedFCTest { public static void main(String[] args) throws Exception { Directory dir = FSDirectory.open(new File(args[0])); IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); for (int i = 0; i 500; i++) { if (i % 10 == 0) { System.out.println(i); } Document doc = new Document(); doc.add(new Field(field, String + i, Field.Store.NO, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } writer.close(); IndexReader reader = IndexReader.open(dir, true); long start = System.currentTimeMillis(); FieldCache.DEFAULT.getStrings(reader, field); long end = System.currentTimeMillis(); System.out.println(load time for initial field cache: + (end - start) / 1000.0f + s); writer = new IndexWriter(dir, new SimpleAnalyzer(), false, IndexWriter.MaxFieldLength.LIMITED); for (int i = 501; i 505; i++) { if (i % 10 == 0) { System.out.println(i); } Document doc = new Document(); doc.add(new Field(field, String + i, Field.Store.NO, Field.Index.NOT_ANALYZED)); writer.addDocument(doc); } writer.close(); IndexReader reader2 = reader.reopen(true); System.out.println(reader size = + reader2.numDocs()); long start2 = System.currentTimeMillis(); FieldCache.DEFAULT.getStrings(reader2, field); long end2 = System.currentTimeMillis(); System.out.println(load time for re-opened field cache: + (end2 - start2) / 1000.0f + s); } } This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Limited group of companies. Detica Limited is registered in England under No: 1337451. Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: MatchAllDocsQuery and MatchNoDocsQuery
Yes on all counts. Lucene doesn't modify query objects, so they are save for reuse among multiple threads. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague 2010/5/10 Mindaugas Žakšauskas min...@gmail.com: Hi, Can anybody confirm whether MatchAllDocsQuery can be used as an immutable singletone? By this I mean creating a single instance and sharing it whenever I need to either use it on its own or in cojunction with other queries put into a BooleanQuery; to result all documents in a search result. Can the same instance even be reused among different threads? What would be the best way implementing MatchNoDocsQuery? My initial tests show that a new BooleanQuery() without any additional clauses would just do the job, but I just wanted to double check whether this is be a reliable assumption. Above questions also apply - could this be reused among different contexts, threads? Thanks in advance. Regards, Mindaugas - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: problem in Lucene's ranking function
2010/5/5 José Ramón Pérez Agüera jose.agu...@gmail.com: [...] The consequence is that a document matching a single query term over several fields could score much higher than a document matching several query terms in one field only, One partial workaround that people use is DisjunctionMaxQuery (used by dismax query parser in Solr). http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Fwd: Apache Lucene EuroCon Call For Participation: Prague, Czech Republic May 20 21, 2010
Forwarding to lucene only - the big cross-post caused my gmail filters to file it. -Yonik -- Forwarded message -- From: Grant Ingersoll gsing...@apache.org Date: Wed, Mar 24, 2010 at 8:03 PM Subject: Apache Lucene EuroCon Call For Participation: Prague, Czech Republic May 20 21, 2010 To: Lucene mailing list gene...@lucene.apache.org, solr-u...@lucene.apache.org, java-user@lucene.apache.org, mahout-u...@lucene.apache.org, nutch-u...@lucene.apache.org, openrelevance-u...@lucene.apache.org, tika-u...@lucene.apache.org, pylucene-u...@lucene.apache.org, connectors-...@incubator.apache.org, lucene-net-...@lucene.apache.org Apache Lucene EuroCon Call For Participation - Prague, Czech Republic May 20 21, 2010 All submissions must be received by Tuesday, April 13, 2010, 12 Midnight CET/6 PM US EDT The first European conference dedicated to Lucene and Solr is coming to Prague from May 18-21, 2010. Apache Lucene EuroCon is running on on not-for-profit basis, with net proceeds donated back to the Apache Software Foundation. The conference is sponsored by Lucid Imagination with additional support from community and other commercial co-sponsors. Key Dates: 24 March 2010: Call For Participation Open 13 April 2010: Call For Participation Closes 16 April 2010: Speaker Acceptance/Rejection Notification 18-19 May 2010: Lucene and Solr Pre-conference Training Sessions 20-21 May 2010: Apache Lucene EuroCon This conference creates a new opportunity for the Apache Lucene/Solr community and marketplace, providing the chance to gather, learn and collaborate on the latest in Apache Lucene and Solr search technologies and what's happening in the community and ecosystem. There will be two days of Lucene and Solr training offered May 18 19, and followed by two days packed with leading edge Lucene and Solr Open Source Search content and talks by search and open source thought leaders. We are soliciting 45-minute presentations for the conference, 20-21 May 2010 in Prague. The conference and all presentations will be in English. Topics of interest include: - Lucene and Solr in the Enterprise (case studies, implementation, return on investment, etc.) - “How We Did It” Development Case Studies - Spatial/Geo search - Lucene and Solr in the Cloud - Scalability and Performance Tuning - Large Scale Search - Real Time Search - Data Integration/Data Management - Tika, Nutch and Mahout - Lucene Connectors Framework - Faceting and Categorization - Relevance in Practice - Lucene Solr for Mobile Applications - Multi-language Support - Indexing and Analysis Techniques - Advanced Topics in Lucene Solr Development All accepted speakers will qualify for discounted conference admission. Financial assistance is available for speakers that qualify. To submit a 45-minute presentation proposal, please send an email to c...@lucene-eurocon.org containing the following information in plain text: 1. Your full name, title, and organization 2. Contact information, including your address, email, phone number 3. The name of your proposed session (keep your title simple and relevant to the topic) 4. A 75-200 word overview of your presentation (in English); in addition to the topic, describe whether your presentation is intended as a tutorial, description of an implementation, an theoretical/academic discussion, etc. 5. A 100-200-word speaker bio that includes prior conference speaking or related experience (in English) To be considered, proposals must be received by 12 Midnight CET Tuesday, 13 April 2010 (Tuesday 13 April 6 PM US Eastern time, 3 PM US Pacific Time). Please email any questions regarding the conference to i...@lucene-eurocon.org. To be added to the conference mailing list, please email sig...@lucene-eurocon.org. If your organization is interested in sponsorship opportunities, email spon...@lucene-eurocon.org Key Dates 24 March 2010: Call For Participation Open 13 April 2010: Call For Participation Closes 16 April 2010: Speaker Acceptance/Rejection Notification 18-19 May 2010 Lucene and Solr Pre-conference Training Sessions 20-21 May 2010: Apache Lucene EuroCon We look forward to seeing you in Prague! Grant Ingersoll Apache Lucene EuroCon Program Chair www.lucene-eurocon.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Combining TopFieldCollector with custom Collector
On Thu, Mar 11, 2010 at 4:10 PM, Peter Keegan peterlkee...@gmail.com wrote: I want the TFC to do all the cool things it does like custom sorting, saving the field values, max score, etc. I suppose the custom Collector could explicitly delegate all TFC's methods, but this doesn't seem right. No need to delegate the TFC specific methods... just wrap the TFC in your own collector, do the search, and then directly access the TFC to get what you need. This is what Solr does. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NumericField exact match
On Fri, Feb 26, 2010 at 3:33 PM, Ivan Vasilev ivasi...@sirma.bg wrote: Does it matter precision step when I use NumericRangeQuery for exact matches? No. There is a full-precision version of the value indexed regardless of the precision step, and that's used for an exact match query. I mean if I use the default precision step when indexing that fields it is guaranteed that: 1. With this query I will always hit the docs that contain val for the field; 2. I will never hit docs with different that have diferent val for the field; Correct. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sort and Collector
On Wed, Feb 3, 2010 at 1:40 PM, tsuraan tsur...@gmail.com wrote: Is there any way to run a search where I provide a Query, a Sort, and a Collector? I have a case where it is sometimes, but rarely, necessary to get all the results from a query, but usually I'm satisfied with a smaller amount. That part I can do with just a query and a collector, but I'd like the results to be sorted as they are submitted to the collector's collect method. Is that possible? It's not really possible. Lucene must iterate over all of the hits before it knows for sure that it has the top sorted by any criteria (other than docid). A Collector is called for every hit as it happens, and thus one can't specify a sort order (sorting itself is actually implemented with a sorting Collector). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NumericRangeQuery performance with 1/2 billion documents in the index
Perhaps this is just a huge index, and not enough of it can be cached in RAM. Adding additional clauses to a boolean query incrementally destroys locality. 104GB of index and 4GB of RAM means you're going to be hitting the disk constantly. You need more hardware - if you're requirements are low (low query volume, high query latency of a few seconds OK) then you can probably get away with a single box... just either get a SSD or get more RAM (like 32G or more). If you want higher query volumes or consistent sub-second search, you're going to have to go distributed. Roll your own or look at Solr. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NumericRangeQuery performance with 1/2 billion documents in the index
On Sun, Jan 3, 2010 at 10:42 AM, Karl Wettin karl.wet...@gmail.com wrote: 3 jan 2010 kl. 16.32 skrev Yonik Seeley: Perhaps this is just a huge index, and not enough of it can be cached in RAM. Adding additional clauses to a boolean query incrementally destroys locality. 104GB of index and 4GB of RAM means you're going to be hitting the disk constantly. You need more hardware - if you're requirements are low (low query volume, high query latency of a few seconds OK) then you can probably get away with a single box... just either get a SSD or get more RAM (like 32G or more). If you want higher query volumes or consistent sub-second search, you're going to have to go distributed. Roll your own or look at Solr. I'm not sure I agree. A 104GB index says nothing about the date field. And it says nothing about the range of the query. Given that there are 500M docs, one can make an educated guess that much of this 104GB is index and not just stored fields. IMO, it's simply too many docs and too big of a ratio between RAM and index size for good query performance. But I don't think we've heard what the requirements for this index are. A quick ls -l of the index directory would be revealing though. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Finding the highest term in a field
On Thu, Nov 19, 2009 at 1:04 AM, Daniel Noll dan...@nuix.com wrote: I take it the existing numeric fields can't already do stuff like this? Nope, it's a fundamental limitation of the current TermEnums. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Finding the highest term in a field
On Wed, Nov 18, 2009 at 10:48 PM, Daniel Noll dan...@nuix.com wrote: But what if I want to find the highest? TermEnum can't step backwards. I've also wanted to do the same. It's coming with the new flexible indexing patch: https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764020#action_12764020 -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sort fields shouldn't be tokenized
On Mon, Nov 16, 2009 at 11:38 AM, Jeff Plater jpla...@healthmarketscience.com wrote: Thanks - so if my sort field is a single term then I should be ok with using an analyzer (to lowercase it for example). Correct - the key is that there is not more than one token per document for the field being sorted on. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: share some numbers for range queries
On Mon, Nov 16, 2009 at 1:02 AM, John Wang john.w...@gmail.com wrote: I did some performance analysis for different ways of doing numeric ranging with lucene. Thought I'd share: FYI, the second approach is already implemented in both Lucene and Solr. http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldCacheRangeFilter.html -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Equality Numeric Query
On Wed, Nov 11, 2009 at 8:54 AM, Shai Erera ser...@gmail.com wrote: I index documents with numeric fields using the new Numeric package. I execute two types of queries: range queries (for example, [1 TO 20}) and equality queries (for example 24.75). Don't mind the syntax. Currently, to execute the equality query, I create a NumericRangeQuery with the lower/upper value being 24.75 and both limits are set to inclusive. Two questions: 1) Is there a better approach? For example, if I had indexed the values as separate terms, I could create a TermQuery. Create a term query on NumericUtils.floatToPrefixCoded(24.75f) 2) Can I run into precision issues such that 24.751 will be matched as well? Nope... every numeric indexed value has it's precision indexed along with it as a prefix, so there will be no false matches. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene index write performance optimization
On Tue, Nov 10, 2009 at 11:43 AM, Jamie Band ja...@stimulussoft.com wrote: As an aside note, is there any way for Lucene to support simultaneous writes to an index? The indexing process is highly parallelized... just use multiple threads to add documents to the same IndexWriter. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Proposal for changing Lucene's backwards-compatibility policy
On Tue, Oct 27, 2009 at 9:07 PM, Luis Alves lafa...@gmail.com wrote: But there needs to be some forced push for these shorter major release cycles, to allow for code clean cycles to also be sorter. Maybe... or maybe not. There's also value in a more stable API over a longer period of time. Different people will pick a different balance, and it's not as simple as declaring that we need to be able to remove older APIs faster. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: help needed improving lucene concurret search performance
How many processors do you have on this system? If you are CPU bound, 100 threads is going to be 10 times slower (at a minimum) than 10 threads (unless you have more than 10 CPUs). -Yonik http://www.lucidimagination.com On Fri, Oct 23, 2009 at 2:18 AM, Wilson Wu songzi0...@gmail.com wrote: Dear Friend, I have encountered some performance problems recently in lucene search 2.9. I use a single IndexSearcher in the whole system, It seems perfect when there is less than 10 threads doing search concurrenty. Bu if there is more than 100 threads doing concurrent search,the average response time is becoming bigger(1s),and the max response time reaches 299s. I really don't know how to improve,can you help me? Thanks a lot ! Wilson 2009.10.23 The profiling result about 400 concurret search is at: http://i3.6.cn/cvbnm/aa/f5/00/63521d982a469f5063b82268eee91d08.gif it seems a lot of time consumed by TermScorer.score. Follewing is my servlet class which is reponse to search request: public final class DispatchServlet extends javax.servlet.http.HttpServlet implements javax.servlet.Servlet { private static final long serialVersionUID = -5547647006004900451L; protected final Log log = LogFactory.getLog(getClass()); protected Searcher searcher; protected Directory dir; protected RAMDirectory ram; public DispatchServlet() { super(); } public void init() throws ServletException { super.init(); try { dir = FSDirectory.open(new File(/usr/bestv/search_engin_index/index/program)); ram = new RAMDirectory(dir); searcher = new IndexSearcher(ram,true); int h = searcher.search(tq,null,1).totalHits; System.out.println(the searcher has warmed and searched + h + docs ); } } catch (IOException e) { log.error(e); } } protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.setContentType(text/html); doExecute(request.getParameter(q),response); } protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.setContentType(text/html); try{ String schCon = URLDecoder.decode(request.getParameter(q),UTF-8); doExecute(schCon,response); }catch(Exception e){ response.getWriter().write(Parameter Error,please send param 'q'); } } public void doExecute(String schCon,HttpServletResponse response) throws ServletException,IOException{ response.getWriter().write(new SearchCommand().search(searcher)); } } - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Clarification on TokenStream.close() needed
2009/10/20 Teruhiko Kurosaka k...@basistech.com: My Tokenizer started showing an error when I switched to Solr 1.4 dev version. I am not too confident but it seems that Solr 1.4 calls close() on my Tokenizer before calling reset(Reader) in order to reuse the Tokenizer. That is, close() is called more than once. Is this when indexing a document, or querying a document. close() should only be called once. If indexing, it would be closed in Lucene at DocInverterPerField.java:197 -Yonik http://www.lucidimagination.com The API doc of close() reads: Releases resources associated with this stream. So I thought close() shold be called only once, and the Tokenizer objects cannot be reused after close() is called. Is my interpretation correct? If my interpretation is wrong and it is legal to call close() more than once, where is the best place to free per-instance resources? T. Kuro Kurosaka - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hits and TopDoc
On Tue, Oct 20, 2009 at 5:03 PM, Nathan Howard natehowa...@gmail.com wrote: This is sort of related to the above question, but I'm trying to update some (now depricated) Java/Lucene code that I've become aware of once we started using 2.4.1 (we were previously using 2.3.2): Hits results = MultiSearcher.search(Query)); int start = currentPage * resultsPerPage; int stop = (currentPage + 1) * resultsPerPage(); for(int x = start; (x searchResults.length()) (x stop); x++) { Document doc = searchResults.doc(x); // do search post-processing with the Document } Results per page is normally small (10ish or so). I'm having difficulty figuring out how to get TopDocs to replicate this paging functionality (which the application must maintain). You do it tthe same way basically... calculate the biggest doc you need (stop-1 in your code), ask for that many TopDocs, and then iterate over the page you want, calling searcher.doc(topDocs.scoreDocs[x].doc) -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hits and TopDoc
Hmm, yes, I should have thought of quoting the havadoc :-) The Hits javadoc has been udpated though... we shouldn't be pushing people toward collectors unless they really need them: * TopDocs topDocs = searcher.search(query, numHits); * ScoreDoc[] hits = topDocs.scoreDocs; * for (int i = 0; i hits.length; i++) { * int docId = hits[i].doc; * Document d = searcher.doc(docId); * // do something with current hit -Yonik http://www.lucidimagination.com On Tue, Oct 20, 2009 at 5:27 PM, Steven A Rowe sar...@syr.edu wrote: Hi Nathan, On 10/20/2009 at 5:03 PM, Nathan Howard wrote: This is sort of related to the above question, but I'm trying to update some (now depricated) Java/Lucene code that I've become aware of once we started using 2.4.1 (we were previously using 2.3.2): Hits results = MultiSearcher.search(Query)); int start = currentPage * resultsPerPage; int stop = (currentPage + 1) * resultsPerPage(); for(int x = start; (x searchResults.length()) (x stop); x++) { Document doc = searchResults.doc(x); // do search post-processing with the Document } Results per page is normally small (10ish or so). I'm having difficulty figuring out how to get TopDocs to replicate this paging functionality (which the application must maintain). From http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Hits.html: = Deprecated. Hits will be removed in Lucene 3.0. Instead e. g. TopDocCollector and TopDocs can be used: TopDocCollector collector = new TopDocCollector(hitsPerPage); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs; for (int i = 0; i hits.length; i++) { int docId = hits[i].doc; Document d = searcher.doc(docId); // do something with current hit ... = Construct the TopDocCollector with your stop variable instead of hitsPerPage, initialize the loop control variable with the value of your start variable instead of 0, and you should be good to go. Steve - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Proposal for changing Lucene's backwards-compatibility policy
On Fri, Oct 16, 2009 at 4:54 AM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Fri, Oct 16, 2009 at 10:23 AM, Danil ŢORIN torin...@gmail.com wrote: What about creating major version more often? +1 We're not going to run out of version numbers, so I don't see a reason not to upgrade the major version number when making backwards-incompatible changes. +1 (Option A). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NPE in NearSpansUnordered
Are you using any custom query types? Anything to help us reproduce (like the acutal query this happened on) would be greatly appreciated. -Yonik http://www.lucidimagination.com On Thu, Oct 15, 2009 at 1:17 PM, Peter Keegan peterlkee...@gmail.com wrote: I'm using Lucene 2.9 and sometimes get a NPE in NearSpansUnordered: java.lang.NullPointerExceptionjava.lang.NullPointerException at org.apache.lucene.search.spans.NearSpansUnordered.start(NearSpansUnordered.java:219) at org.apache.lucene.search.payloads.PayloadNearQuery$PayloadNearSpanScorer.processPayloads(PayloadNearQuery.java:201) at org.apache.lucene.search.payloads.PayloadNearQuery$PayloadNearSpanScorer.getPayloads(PayloadNearQuery.java:180) at org.apache.lucene.search.payloads.PayloadNearQuery$PayloadNearSpanScorer.getPayloads(PayloadNearQuery.java:183) at org.apache.lucene.search.payloads.PayloadNearQuery$PayloadNearSpanScorer.setFreqCurrentDoc(PayloadNearQuery.java:214) at org.apache.lucene.search.spans.SpanScorer.nextDoc(SpanScorer.java:64) at org.apache.lucene.search.Scorer.score(Scorer.java:74) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:247) at org.apache.lucene.search.Searcher.search(Searcher.java:152) The CellQueue pq is empty when this occurs. Are there any conditions in which the queue might be expected to be empty? Peter - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Realtime search best practices
Guys, please - you're not new at this... this is what JavaDoc is for: /** * Returns a readonly reader containing all * current updates. Flush is called automatically. This * provides near real-time searching, in that changes * made during an IndexWriter session can be made * available for searching without closing the writer. * * pIt's near real-time because there is no hard * guarantee on how quickly you can get a new reader after * making changes with IndexWriter. You'll have to * experiment in your situation to determine if it's * fast enough. As this is a new and experimental * feature, please report back on your findings so we can * learn, improve and iterate./p * * pThe resulting reader supports {...@link * IndexReader#reopen}, but that call will simply forward * back to this method (though this may change in the * future)./p * * pThe very first time this method is called, this * writer instance will make every effort to pool the * readers that it opens for doing merges, applying * deletes, etc. This means additional resources (RAM, * file descriptors, CPU time) will be consumed./p * * pFor lower latency on reopening a reader, you should * call {...@link #setMergedSegmentWarmer} to * pre-warm a newly merged segment before it's committed * to the index. This is important for minimizing * index-to-search delay after a large merge. /p * * pIf an addIndexes* call is running in another thread, * then this reader will only search those segments from * the foreign index that have been successfully copied * over, so far/p. * * pbNOTE/b: Once the writer is closed, any * outstanding readers may continue to be used. However, * if you attempt to reopen any of those readers, you'll * hit an {...@link AlreadyClosedException}./p * * pbNOTE:/b This API is experimental and might * change in incompatible ways in the next release./p * * @return IndexReader that covers entire index plus all * changes made so far by this IndexWriter instance * * @throws IOException */ public IndexReader getReader() throws IOException { -Yonik http://www.lucidimagination.com On Mon, Oct 12, 2009 at 4:18 PM, John Wang john.w...@gmail.com wrote: Oh, that is really good to know! Is this deterministic? e.g. as long as writer.addDocument() is called, next getReader reflects the change? Does it work with deletes? e.g. writer.deleteDocuments()? Thanks Mike for clarifying! -John On Mon, Oct 12, 2009 at 12:11 PM, Michael McCandless luc...@mikemccandless.com wrote: Just to clarify: IndexWriter.newReader returns a reader that searches uncommitted changes as well. Ie, you need not call IndexWriter.commit to make the changes visible. However, if you're opening a reader the normal way (IndexReader.open) then it is necessary to first call IndexWriter.commit. Mike On Mon, Oct 12, 2009 at 5:24 AM, melix cedric.champ...@lingway.com wrote: Hi, I'm going to replace an old reader/writer synchronization mechanism we had implemented with the new near realtime search facilities in Lucene 2.9. However, it's still a bit unclear on how to efficiently do it. Is the following implementation the good way to do achieve it ? The context is concurrent read/writes on an index : 1. create a Directory instance 2. create a writer on this directory 3. on each write request, add document to the writer 4. on each read request, a. use writer.getReader() to obtain an up-to-date reader b. create an IndexSearcher with that reader c. perform Query d. close IndexSearcher 5. on application close a. close writer b. close directory While this seems to be ok, I'm really wondering about the performance of opening a searcher for each request. I could introduce some kind of delay and cache a searcher for some seconds, but I'm not sure it's the best thing to do. Thanks, Cedric -- View this message in context: http://www.nabble.com/Realtime-search-best-practices-tp25852756p25852756.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Realtime search best practices
On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix jake.man...@gmail.com wrote: It may be surprising, but in fact I have read that javadoc. It was not your email I responded to. It talks about not needing to close the writer, but doesn't specifically talk about the what the relationship between commit() calls and getReader() calls is. Do you have a suggestion of how to update the JavaDoc? I'm not sure I understand the relationship between commit and getReader that you refer to. , but why is it so obvious that what could be happening is that it only returns all changes since the last commit, but without touching disk because it has docs in memory as well? Sorry, this seems confusing - I'm not sure what you're trying to say. Perhaps we should approach this as proposed javadoc changes? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Realtime search best practices
Good point on isCurrent - I think it should only be with respect to the latest index commit point? and we should clarify that in the javadoc. [...] // but what does the nrtReader say? // it does not have access to the most recent commit // state, as there's been a commit (with documents) // since it was opened. But the nrtReader *has* those // documents. I think we keep it simple - the nrtReader.isCurrent() would return false after a commit is called. Yes, isCurrent() is no longer such a great name. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene 2.9.0RC4 slower than 2.4.1?
On Wed, Sep 16, 2009 at 12:33 PM, Uwe Schindler u...@thetaphi.de wrote: How should we proceed? Stop the final artifact build and voting or proceed with the release of 2.9? We waited so long and for most people it is faster than slower! I think we know that 2.9 will not be faster for everyone: - Per-segment searching and the new comparatores are a general win, but will be slower for some people. - Query parsing and small document indexing will be somewhat slower due to the new token APIs (the workarounds for back compatibility) if token streams aren't reused. I don't see any indication of any bugs in Lucene in this thread either. -Yonik - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene 2.9.0RC4 slower than 2.4.1?
It's been a while since I wrote that benchmarker... is it OK that the answer is different? Did you use the same test file? -Yonik http://www.lucidimagination.com On Tue, Sep 15, 2009 at 2:18 PM, Mark Miller markrmil...@gmail.com wrote: The results: config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295611, ms=173550, MB/sec=1683.7899579371938 config: impl=ChannelFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=1377768, MB/sec=212.09793463050383 config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=632253, MB/sec=462.19115955163517 config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=774664, MB/sec=377.2238637654518 ClassicFile was heading for the same fate as ChannelFile. I'll have to check what its like on the file system - but it appears just ridiculously slower. Even with SeparateFile, All 4 cores are bouncing from 0-12% independently and really favoring the low end of that. ChannelPread appears no better. There are results from other OS's/setups in the JIRA issue. I'm using ext4. Uwe Schindler wrote: How does a conventional file system compare? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, September 15, 2009 7:15 PM To: java-user@lucene.apache.org Subject: Re: lucene 2.9.0RC4 slower than 2.4.1? Mark Miller wrote: Indeed - I just ran the FileReaderTest on a Linux tmpfs ramdisk - with SeparateFile all 4 of my cores are immediately pinned and remain so. With ChannelFile, all 4 cores hover 20-30%. It would appear it may not be a good idea to use NIOFSDirectory on ramdisks. Even still though - it looks like you have a further issue - your Lucene 2.9 old-api results don't use it, and are still not good. The quick results: ramdisk: sudo mount -t tmpfs tmpfs /tmp/space -o size=1G,nr_inodes=200k,mode=01777 config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295611, ms=173550, MB/sec=1683.7899579371938 config: impl=ChannelFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=1377768, MB/sec=212.09793463050383 -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene 2.9.0RC4 slower than 2.4.1?
Remember to disable CPU frequency scaling when benchmarking... some things with IO cause the freq to drop, and when it's CPU bound again it takes a while for Linux to scale up the freq again. For example, on my ubuntu box, ChannelFile went from 100MB/sec to 388MB/sec. This effect probably won't be uniform across implementations either. -Yonik http://www.lucidimagination.com On Tue, Sep 15, 2009 at 3:25 PM, Mark Miller markrmil...@gmail.com wrote: I just really I hadn't sent this one. Here are results from the harddrive: It looks like its closer to the same speed on the hardrive once everything is loaded in the system cache (as you'd expect). SeparateFile was 1200 vs almost 1700 on RAMDISK. ChannelPread looked a lot closer though. - Mark Tests from harddisk (filesystem cache warmed): config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282293977, ms=238230, MB/sec=1226.6370616630988 config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=766340, MB/sec=381.3212767179059 Mark Miller wrote: Michael McCandless wrote: I don't like that the answer is different... but it's really really odd that it's different-yet-almost-the-same. Mark, were these 4 results on a normal (ext4) filesystem, or tmpfs? (Because the top 2 entries of your 4 results match the first set of 2 entries you sent... so I'm thinking these 4 were actually tmpfs not ext4). Those 4 were tmpfs - I mention ext4 at the end because I had just given a feel for the hardrive tests and wanted to note it was from ext4 - the results are def ramdisk though. What JRE/OS, linux, kernel versions, and hardware, are you running on? These are on: Ubuntu Karmic Koala 9.10, currently updated java-1.5.0-sun-1.5.0.20 2.6.31-10-generic RAM: 3.9 Gig 4 core Intel Core2 duo 2.0GHz Slow 5200 rpm laptop drives. The gains of SeparateFile over all else are stunning. And, quite different from the linux tests I had run under LUCENE-753. Maybe we need to revert FSDir.open to return SimpleFSDir again, on non-Windows hosts. But then we don't have good concurrency... Mike On Tue, Sep 15, 2009 at 2:59 PM, Yonik Seeley yonik.see...@lucidimagination.com wrote: It's been a while since I wrote that benchmarker... is it OK that the answer is different? Did you use the same test file? -Yonik http://www.lucidimagination.com On Tue, Sep 15, 2009 at 2:18 PM, Mark Miller markrmil...@gmail.com wrote: The results: config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295611, ms=173550, MB/sec=1683.7899579371938 config: impl=ChannelFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=1377768, MB/sec=212.09793463050383 config: impl=ChannelPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=632253, MB/sec=462.19115955163517 config: impl=PooledPread serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=774664, MB/sec=377.2238637654518 ClassicFile was heading for the same fate as ChannelFile. I'll have to check what its like on the file system - but it appears just ridiculously slower. Even with SeparateFile, All 4 cores are bouncing from 0-12% independently and really favoring the low end of that. ChannelPread appears no better. There are results from other OS's/setups in the JIRA issue. I'm using ext4. Uwe Schindler wrote: How does a conventional file system compare? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Tuesday, September 15, 2009 7:15 PM To: java-user@lucene.apache.org Subject: Re: lucene 2.9.0RC4 slower than 2.4.1? Mark Miller wrote: Indeed - I just ran the FileReaderTest on a Linux tmpfs ramdisk - with SeparateFile all 4 of my cores are immediately pinned and remain so. With ChannelFile, all 4 cores hover 20-30%. It would appear it may not be a good idea to use NIOFSDirectory on ramdisks. Even still though - it looks like you have a further issue - your Lucene 2.9 old-api results don't use it, and are still not good. The quick results: ramdisk: sudo mount -t tmpfs tmpfs /tmp/space -o size=1G,nr_inodes=200k,mode=01777 config: impl=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295611, ms=173550, MB/sec=1683.7899579371938 config: impl=ChannelFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=730554368 answer=-282295361, ms=1377768, MB/sec=212.09793463050383 -- - Mark http://www.lucidimagination.com
Re: lucene 2.9.0RC4 slower than 2.4.1?
Here's my results in my quad core phenom, with ondemand CPU freq scaling disabled (clocks locked at 3GHz) Ubuntu 9.04, filesystem=ext4 on 7200RPM IDE drive, testfile=95MB fully cached. Linux odin 2.6.28-15-generic #49-Ubuntu SMP Tue Aug 18 19:25:34 UTC 2009 x86_64 GNU/Linux Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode) config: impl=ClassicFile serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=15610, MB/sec=489.99482383087764 config: impl=SeparateFile serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427672, ms=4115, MB/sec=1858.7652976913728 config: impl=PooledPread serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=6352, MB/sec=1204.15919395466 config: impl=ChannelFile serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=20347, MB/sec=375.91876935174713 config: impl=ChannelPread serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=5189, MB/sec=1474.0449412218154 config: impl=ChannelTransfer serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=14794, MB/sec=517.021711504664 -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene 2.9.0RC4 slower than 2.4.1?
On Tue, Sep 15, 2009 at 4:12 PM, Yonik Seeley yo...@lucidimagination.com wrote: Note that when nthreads1 I sometimes get wrong answers for SimpleFile... s/SimpleFile/SingleFile/g - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene 2.9.0RC4 slower than 2.4.1?
Note that when nthreads1 I sometimes get wrong answers for SimpleFile... hopefully it's just a bug in the test... I'll look into it a little. -Yonik http://www.lucidimagination.com On Tue, Sep 15, 2009 at 4:00 PM, Mark Miller markrmil...@gmail.com wrote: I'm jealous of your 4 3.0Ghz to my 2.0Ghz. I was on dynamic scaling frequency and switched to 2.0Ghz hard. On ramdisk, my puny 2.0's almost catch you and get a bit over 1800MB/s with SeparateFile. I'm smoked on PooledPread and ChannelPread though. Still sub 500 for both, even on the ramdisk. Its an absurd comparison though - everyone knows a jackalope is faster than a koala. - Mark Yonik Seeley wrote: Here's my results in my quad core phenom, with ondemand CPU freq scaling disabled (clocks locked at 3GHz) Ubuntu 9.04, filesystem=ext4 on 7200RPM IDE drive, testfile=95MB fully cached. Linux odin 2.6.28-15-generic #49-Ubuntu SMP Tue Aug 18 19:25:34 UTC 2009 x86_64 GNU/Linux Java(TM) SE Runtime Environment (build 1.6.0_14-b08) Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode) config: impl=ClassicFile serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=15610, MB/sec=489.99482383087764 config: impl=SeparateFile serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427672, ms=4115, MB/sec=1858.7652976913728 config: impl=PooledPread serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=6352, MB/sec=1204.15919395466 config: impl=ChannelFile serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=20347, MB/sec=375.91876935174713 config: impl=ChannelPread serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=5189, MB/sec=1474.0449412218154 config: impl=ChannelTransfer serial=false nThreads=4 iterations=20 bufsize=1024 poolsize=2 filelen=95610240 answer=1165427971, ms=14794, MB/sec=517.021711504664 -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene 2.9.0RC4 slower than 2.4.1?
OK, I see the issue - SingleFile doesn't have it's own filepointer. I'll update the original issue. (for large files, this shouldn't change the times any). -Yonik http://www.lucidimagination.com On Tue, Sep 15, 2009 at 4:13 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Tue, Sep 15, 2009 at 4:12 PM, Yonik Seeley yo...@lucidimagination.com wrote: Note that when nthreads1 I sometimes get wrong answers for SimpleFile... s/SimpleFile/SingleFile/g - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 2.9 RC2 now available for testing
On Wed, Sep 9, 2009 at 8:57 AM, Peter Keeganpeterlkee...@gmail.com wrote: Using JProfiler, I observe that the improvement is due to a huge reduction in the number of calls to TermDocs.next and TermDocs.skipTo (about 65% fewer calls). Indexes are searched per-segment now (i.e. MultiTermDocs isn't normally used). Off the top of my head, I'm not sure how this can lead to fewer TermDocs.skipTo() calls though. Are you sure you weren't also counting Scorer.skipTo()... which would now be Scorer.advance()? Have you verified that your custom scorer is working correctly with 2.9 and that you're getting the same number of hits on the overall query as you were with previous versions? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 2.9 RC2 now available for testing
On Wed, Sep 9, 2009 at 9:17 AM, Yonik Seeleyyonik.see...@lucidimagination.com wrote: On Wed, Sep 9, 2009 at 8:57 AM, Peter Keeganpeterlkee...@gmail.com wrote: Using JProfiler, I observe that the improvement is due to a huge reduction in the number of calls to TermDocs.next and TermDocs.skipTo (about 65% fewer calls). Indexes are searched per-segment now (i.e. MultiTermDocs isn't normally used). Off the top of my head, I'm not sure how this can lead to fewer TermDocs.skipTo() calls though. Wait... perhaps it's just that accounting for the skipTo() decrease? Instead of MultiTermDocs.skipTo() delegating to SegmentTermDocs.skipTo() (2 calls since they both inherit from TermDocs), it's now just SegmentTermDocs.skipTo() directly. -Yonik http://www.lucidimagination.com Are you sure you weren't also counting Scorer.skipTo()... which would now be Scorer.advance()? Have you verified that your custom scorer is working correctly with 2.9 and that you're getting the same number of hits on the overall query as you were with previous versions? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extending Sort/FieldCache
On Sun, Sep 6, 2009 at 4:42 AM, Shai Ereraser...@gmail.com wrote: I've resisted using payloads for this purpose in Solr because it felt like an interim hack until CSF is implemented. I don't see it as a hack, but as a proper use of a great feature in Lucene. It's proper use for an application perhaps, but not for core Lucene. Applications are pretty much required to work with what's given in Lucene... but Lucene developers can make better choices. Hence if at all possible, work should be put into implementing CSF rather than sorting by payloads. CSF and this are essentially the same. In which case we wouldn't need CSF? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extending Sort/FieldCache
On Fri, Sep 4, 2009 at 12:33 AM, Shai Ereraser...@gmail.com wrote: 2) Contribute my payload-based sorting package. Currently it only reads from disk during searches, and I'd like to enhance it to use in-memory cache as well. It's a moderate-size package, so this one will need to wait until (1) is done, and I get enough time to adapt it to 2.9 and work on the issue. I've resisted using payloads for this purpose in Solr because it felt like an interim hack until CSF is implemented. It feels like payloads are properly used when one actually cares what the term or position is. Thoughts? Do we think CSF will make it in 3.1? -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is there a way to check for field uniqueness when indexing?
On Wed, Aug 26, 2009 at 12:47 PM, Daniel Shanesha...@lexum.umontreal.ca wrote: Humm... there is something I dont catch.. When you open up an index writer, you batch up add and deletes. Now if you create a signature for the document, as long as you add it works, but what happens if you delete stuff from the index using a query as well as adding? Does Solr also remember the deletions as well? It used to - but now it delegates all that to IndexWriter as well (and lucene buffers them instead). -Yonik http://www.lucidimagination.com Daniel Shane Yonik Seeley wrote: On Fri, Aug 21, 2009 at 12:49 AM, Chris Hostetterhossman_luc...@fucit.org wrote: : But in that case, I assume Solr does a commit per document added. not at all ... it computes a signature and then uses that as a unique key. IndexWriter.updateDocument does all the hard work. Right - Solr used to do that hard work, but we handed that over to Lucene when that capability was added. It involves batching either way (but letting Lucene handle it at a lower level is better since it can prevent inconsistencies from crashes). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Is there a way to check for field uniqueness when indexing?
On Fri, Aug 21, 2009 at 12:49 AM, Chris Hostetterhossman_luc...@fucit.org wrote: : But in that case, I assume Solr does a commit per document added. not at all ... it computes a signature and then uses that as a unique key. IndexWriter.updateDocument does all the hard work. Right - Solr used to do that hard work, but we handed that over to Lucene when that capability was added. It involves batching either way (but letting Lucene handle it at a lower level is better since it can prevent inconsistencies from crashes). -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
trie* space-time tradeoff
Anyone have any numbers? I couldn't find complete info in the Trie* JIRA issues, esp relating to size increase in the index. There was this: The indexes each contain 13 numeric, tree encoded fields (doubles and Dates). Index size (including the normal fields) was: * 8bit: 4.8 GiB * 4bit: 5.1 GiB * 2bit: 5.7 GiB But no info on baselines... for example, what's the index size with 1) those numeric fields not indexed at all 2) those numeric fields indexed with no precision step -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: speed of BooleanQueries on 2.9
Could this perhaps have anything to do with the changes to DocIdSetIterator? Glancing at the default implementation of advance makes me wince a bit: public int advance(int target) throws IOException { while (nextDoc() target) {} return doc; } IMO, this is a back-compatibility anti-pattern. It would be better to throw an exception then quietly slow down some of the users queries by an order of magnitude. Actually, I don't think I would count it as back compatible because of that. -Yonik http://www.lucidimagination.com On Wed, Jul 15, 2009 at 2:54 PM, Michael McCandlessluc...@mikemccandless.com wrote: On Wed, Jul 15, 2009 at 2:30 PM, eks deveks...@yahoo.co.uk wrote: Weird. Have you run CheckIndex? nope, I guess it brings nothing: two times built index; Bug provoked by changing one parameter that controls only search caused it = no corrupt index? You think we should give it a try? Hell, why not :) Yah it's quite a long shot but if it is corrupt, we'll be kicking ourselves about 30 emails from now... What do you mean by Can you do a binary search to locate the term(s) that's causing it? I know exactly which term combination causes it, last Query.toString() I have sent if I simplify Query by dropping one term with its expansions, it runs fine... or if I replace any of these terms it works fine,We tried with higer freq. terms, lower... everything fine... bizzar Right I meant try to whittle down the query that tickles the infinite loop. Sounds like any whittling causes the issue to scurry away. If I make a patch that adds verbosity to what BS is doing, can you run it post the output? Mike - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: speed of BooleanQueries on 2.9
On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindleru...@thetaphi.de wrote: And the fix only affects custom DocIdSetIterators. And custom Queries (via Scorer) since Scorer inherits from DISI. But as Mike says, it shouldn't be the issue behind in this thread. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org