Re: Difference between SortedDocValues and SortedSetDocValues

2017-10-12 Thread Yonik Seeley
On Thu, Oct 12, 2017 at 8:53 AM, Chellasamy G  wrote:
> Could anyone please explain the difference between  SortedDocValues and 
> SortedSetDocValues.

SortedDocValues has at most 1 value per document (single-valued).
SortedSetDocValues supports a set of values per document
(multi-valued).

-Yonik

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: no concurrent merging?

2016-08-09 Thread Yonik Seeley
On Thu, Aug 4, 2016 at 9:35 AM, Michael McCandless
 wrote:
> Lucene's merging is concurrent, but Solr unfortunately uses
> UninvertingReader on each DBQ ... I'm not sure why.

It looks like DeleteByQueryWrapper was added by
https://issues.apache.org/jira/browse/LUCENE-5666

But other than perhaps changing how long a DBQ takes to execute, it
should be unrelated to the question of if other merges can proceed in
parallel.

A quick look at the lucene IndexWriter code says, no... Lucene DBQ
processing cannot proceed in parallel.
IndexWriter.mergeInit is synchronized (on IW).  The DBQ processing is
called from there and thus anything else that needs the IW monitor
will block.

-Yonik

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Port of Custom value source from v4.10.3 to v6.1.0

2016-07-08 Thread Yonik Seeley
Use getSortedDocValues for a single-valued field, or
getSortedSetDocValues for multi-valued.

-Yonik


On Fri, Jul 8, 2016 at 12:29 PM, paule_lecuyer  wrote:
> Many Thanks Yonik,  I will try that.
>
> For my understanding, what is the difference between SortedSetDocValues
> getSortedSetDocValues(String field) and SortedDocValues
> getSortedDocValues(String field) ?
>
> Paule.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Upgrade-of-Custom-value-source-code-from-v4-10-3-to-v6-1-0-tp4286236p4286387.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Port of Custom value source from v4.10.3 to v6.1.0

2016-07-08 Thread Yonik Seeley
Use the docValues interface by calling getSortedSetDocValues on the
leaf reader.  That will either
1) use real docValues if you have indexed them
2) use the FieldCache to uninvert an indexed field and make it look
like docValues.

-Yonik


On Thu, Jul 7, 2016 at 1:33 PM, paule_lecuyer  wrote:
> Hi all,
> I wrote some time ago a ValueSourceParser + ValueSource to allow using
> results produced by an external system as a facet query :
> - in solrconfig.xml : added my parser :
> http://lucene.472066.n3.nabble.com/Port-of-Custom-value-source-from-v4-10-3-to-v6-1-0-tp4286236.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 5: Mutable/Immutable interface of BitSet

2015-09-13 Thread Yonik Seeley
On Sun, Sep 13, 2015 at 4:23 PM, Selva Kumar
 wrote:
> Mutable, "Immutable" interface of BitSet seems to be defined based on
> specific things like live docs and documents with DocValue etc. Any plan to
> add general purpose readonly interface to BitSet?

We already have the "Bits" interface:

public interface Bits {
  public boolean get(int index);
  public int length();
}

-Yonik

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 5: Mutable/Immutable interface of BitSet

2015-09-13 Thread Yonik Seeley
On Sun, Sep 13, 2015 at 5:55 PM, Selva Kumar
<selva.kumar.at.w...@gmail.com> wrote:
> BitSet has many more readonly method compared to Bits.

Ah, I see what you're saying now.
If you have a need/usecase for certain methods on Bits, perhaps open a
JIRA issue and propose them.

-Yonik


> Similarly, BitSet
> has many more write methods compared to MutableBits. So, as I said, this
> seems to be based on internal requirement like live docs, documents with
> DocValues etc.
>
> Thanks for your time, Yonik
>
>
> On Sun, Sep 13, 2015 at 4:43 PM, Yonik Seeley <ysee...@gmail.com> wrote:
>
>> On Sun, Sep 13, 2015 at 4:23 PM, Selva Kumar
>> <selva.kumar.at.w...@gmail.com> wrote:
>> > Mutable, "Immutable" interface of BitSet seems to be defined based on
>> > specific things like live docs and documents with DocValue etc. Any plan
>> to
>> > add general purpose readonly interface to BitSet?
>>
>> We already have the "Bits" interface:
>>
>> public interface Bits {
>>   public boolean get(int index);
>>   public int length();
>> }
>>
>> -Yonik
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene nrt

2015-07-20 Thread Yonik Seeley
Yes, if you do a commit with waitSearcher=true (and it succeeds) then
any adds before that point will be visible.

-Yonik


On Mon, Jul 20, 2015 at 8:25 PM, Bhawna Asnani bhawna.asn...@gmail.com wrote:
 Hi,
 I am using solr to update a document and read it back immediately through
 search.


 I do softCommit my changes which claims to use lucene's indexReader using
 indexWritter which was used to write teh document.
 But there are times when Itheget a stale document back even with
 waitSearcher=true.

 Does lucene's nrt (i.e. DirectoryReader open(IndexWriter writer, boolean
 applyAllDeletes)) guarantee's that the changes made through the wirtter
 will be visible to reader immediately?

 Thanks.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene/Solr Revolution 2015 Voting

2015-06-11 Thread Yonik Seeley
Hey Folks,

If you're interested in going to Lucene/Solr Revolution this year in Austin,
please vote for the sessions you would like to see!

https://lucenerevolution.uservoice.com/

-Yonik

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Query with many clauses

2014-10-29 Thread Yonik Seeley
For queries with many terms, where each term matches few documents
(actually a single document for ID filters in my tests), I saw
speedups between 4x and 8x
http://heliosearch.org/solr-terms-query/  (the 3rd chart)

-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data


On Wed, Oct 29, 2014 at 9:42 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 I suggested TermsFilter, not TermFilter :)  Note the sneaky extra s 

 Mike McCandless

 http://blog.mikemccandless.com


 On Wed, Oct 29, 2014 at 8:20 AM, Pawel Rog pawelro...@gmail.com wrote:
 Hi,
 I already tried to transform Queries to filter (TermQuery - TermFilter)
 but didn't see much speed up. I wrote that  wrapped this filter into
 ConstantScoreQuery and in other test I used FilteredQuery with
 MatchAllDocsQuery and BooleanFilter. Both cases seems to work quite similar
 in terms of performance to simple BooleanQuery.
 But of course I'll also try to use TermsFilter. Maybe it will speedUp
 filters.

 Michael Sokolov I haven't prepared any statistics about number of
 BooleanClauses used and if there are some repeating sets of terms. I think
 I have to collect some stats for better understanding what can be improved.

 --
 Paweł Róg


 On Wed, Oct 29, 2014 at 12:30 PM, Michael Sokolov 
 msoko...@safaribooksonline.com wrote:

 I'm curious to know more about your use case, because I have an idea for
 something that addresses this, but haven't found the opportunity to develop
 it yet - maybe somebody else wants to :).  The basic idea is to reduce the
 number of terms needed to be looked up by collapsing commonly-occurring
 collections of terms into synthetic tiles.  If your queries have a lot of
 overlap, this could greatly reduce the number of terms in a query rewritten
 to use tiles. It's sort of complex, requires indexing support, or a filter
 cache, and there's no working implementation as yet, so this is probably
 not really going to be helpful for you in the short term, but if you can
 share some information I'd love to know:

 what kind of things are you searching?
 how many terms do your larger queries have?
 do the query terms overlap among your queries?

 -Mike Sokolov


 On 10/28/14 9:40 PM, Pawel Rog wrote:

 Hi,
 I have to run query with a lot of boolean should clauses. Queries like
 these were of course slow so I decided to change query to filter wrapped
 by
 ConstantScoreQuery but it also didn't help.

 Profiler shows that most of the time is spent on seekExact in
 BlockTreeTermsReader$FieldReader$SegmentTermsEnum

 When I go deeper in trace I see that inside seekExact most time is spent
 on
 loadBlock and even deeper ByteBufferIndexInput.clone.

 Do you have any ideas how I can make it work faster or it is not possible
 and I have to live with it?

 --
 Paweł Róg



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Square of Idf

2014-03-07 Thread Yonik Seeley
On Thu, Mar 6, 2014 at 6:28 PM, Furkan KAMACI furkankam...@gmail.com wrote:
 Hi;

 Tf-Idf is explanation says that:

 *idf(t)* appears for *t* in both the query and the document, hence it is
 squared in the equation.

 DefaultSimilarity does not square it. What it the explanation of it?

I think you explained it yourself.
The similarity doesn't square it... what is returned from
Similarity.idf(t) is used twice (and hence ends up effectively
squared).

The code has gotten more complex over time, but look at the class
IDFStats to see the squaring of idf.  There is an idf factor in the
queryWeight, and then in normalize() it's multiplied by the idf factor
again.

-Yonik
http://heliosearch.org - native off-heap filters and fieldcache for solr

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Natural Sort Order

2013-10-14 Thread Yonik Seeley
On Mon, Oct 14, 2013 at 9:43 PM, Darren Hoffman dar...@jnamics.com wrote:
 Can anyone tell me if a search based on a ConstantScoreQuery should return
 the results in the order that the documents were added to the index?

The order will be internal docid, which used to be the order that docs
were added to the index.
Non-contiguous segments can now be merged, so that is no longer the case.

-Yonik

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: sorting with lucene 4.3

2013-07-31 Thread Yonik Seeley
On Wed, Jul 31, 2013 at 2:51 PM, Nicolas Guyot sfni...@gmail.com wrote:
 I have written a quick test to reproduce the slower sorting with numeric DV.
 In this test case, it happens only when reverse sorting.

Right - I bet your numeric field is relatively ordered in the index.
When this happens, there is always one sort order that is less
efficient because the priority queue is constantly finding more
competitive hits as we search through the index.  If you index random
numbers (or in a random order), the discrepancy between the sort order
should disappear.

-Yonik
http://lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: About query result cache.

2012-12-16 Thread Yonik Seeley
On Mon, Dec 17, 2012 at 12:58 AM, lukai lukai1...@gmail.com wrote:
 Hi, guys:
   Does queryplugin implementation impacts caching? I have implemented a new
 query parser which just take the input query string and return my own query
 object. But the problem is, when i apply this logic to solr, it seems it
 only works for the first time. Then even i change query, it still returns
 the same result as the first time result. Is it cached? If so, the cache
 key is based on what?

The key is the query object.  Implement equals and hashcode so that it
won't match other other versions of your query.

-Yonik
http://lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 4.0: Custom Query Parser newTermQuery(Term term) override

2012-07-11 Thread Yonik Seeley
On Wed, Jul 11, 2012 at 9:34 AM, Jamie ja...@stimulussoft.com wrote:
 I am busying attempting to integrate Lucene 4.0 Alpha into my code base. I
 have a custom QueryParser that extends QueryParser and overrides
 newRangeQuery and newTermQuery

Random pointer: for most special case field handling, one would want
to override getFieldQuery or newFieldQuery rather than the lower level
newTermQuery.

-Yonik
http://lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexReader.deleteDocument in Lucene 3.6

2012-05-25 Thread Yonik Seeley
On Fri, May 25, 2012 at 5:23 AM, Nikolay Zamosenchuk
nikolaz...@gmail.com wrote:
 IndexWriter.deleteDocument(..) is not final,
 but doesn't return any result.

Deleted terms are buffered for good performance, so at the time of
IndexWriter.deleteDocument(Term) we don't know how many documents
match the term.

 Can anyone please suggest how to solve this issue? Can simply run term
 query before, but it seems to be absolutely inefficient.

You could switch to an asynchronous design and use a custom query that
keeps track of how many (or which) documents it matched.

-Yonik
http://lucidimagination.com




 --
 Best regards, Nikolay Zamosenchuk

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: org.apache.lucene.index.MultiFields.getLiveDocs(IndexReader) returning null.

2012-03-05 Thread Yonik Seeley
On Mon, Mar 5, 2012 at 1:53 PM, Benson Margulies bimargul...@gmail.com wrote:
 There's no javadoc on here yet, and I am a little puzzled by the fact
 that it is returning null for me. Does that imply that there can't be
 any deleted docs known to the reader?

Right, see AtomicReader

  /** Returns the {@link Bits} representing live (not
   *  deleted) docs.  A set bit indicates the doc ID has not
   *  been deleted.  If this method returns null it means
   *  there are no deleted documents (all documents are
   *  live).
   *
   *  The returned instance has been safely published for
   *  use by multiple threads without additional
   *  synchronization.
   */
  public abstract Bits getLiveDocs();


-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Spatial Search

2011-12-31 Thread Yonik Seeley
On Sat, Dec 31, 2011 at 11:52 AM, Lance Java lance.j...@googlemail.com wrote:
 Hi, I am new to Lucene and I am trying to use spatial search.

The old tier-based stuff in Lucene is broken and considered deprecated.

For Lucene, this may currently be your best hope:
http://code.google.com/p/lucene-spatial-playground/

Solr has also had built-in spatial for a little while too:
http://wiki.apache.org/solr/SpatialSearch


-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ElasticSearch

2011-11-17 Thread Yonik Seeley
On Thu, Nov 17, 2011 at 2:53 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 dude, look at this query... its insane isn't it :)

Sorry... what's the equivalent you'd like instead?
Or if you're just unjustifiably bitching about Solr again, maybe I
should take a stroll through Lucene land and bitch about
incomprehensible code, APIs that are increasingly hard to use, APIs
that keep changing on a whim w/o regard to existing users, etc.  Your
attitude is getting tiring.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ElasticSearch

2011-11-17 Thread Yonik Seeley
On Thu, Nov 17, 2011 at 3:18 PM, Uwe Schindler u...@thetaphi.de wrote:
 Sorry, this query is really ununderstandable. Those complex queries should
 have a meaningful language, e.g. a JSON object structure

There are upsides and downsides to that.  A big JSON object graph
would be easier to *read* but certainly not easier to write (much more
nesting).
These main Solr APIs are based around HTTP parameters... the upside
being you can add another parameter w/o worrying about nesting it
correctly.

Like simply adding another filter for example:
fq=instock:true

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ElasticSearch

2011-11-17 Thread Yonik Seeley
On Thu, Nov 17, 2011 at 3:40 PM, Mark Harwood markharw...@yahoo.co.uk wrote:
 JSON or XML can reflect more closely the hierarchy in the underlying Lucene 
 query objects.

We normally use the Lucene QueryParser syntax itself for that (not
HTTP parameters).

Other parameters such as filters, faceting, highlighting, sorting,
etc, don't normally have any hierarchy.

I don't think JSON is always nicer either.  How would you write this
sort in JSON for example?
sort=price desc, score desc

A big plus to Solr's APIs is that it's relatively easy to type them in
to a browser to try them out.

As far as alternate query syntaxes (as opposed to alternate request
syntaxes), Solr has good support for that and
it would be simple to add in support for a JSON query syntax if
someone wrote one.
AFAIK, there's an issue open for adding the XML query syntax, but I'm
not sure if it's ever had much traction.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ElasticSearch

2011-11-17 Thread Yonik Seeley
On Thu, Nov 17, 2011 at 3:44 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 Maybe someone can post the equivalent query in ElasticSearch?

I don't think it's possible.  Hoss threw in the kitchen sink into his
contrived' example.
Here's a super simple example:

JSON:

{
sort : [
{ age : {order : asc} }
],
query : {
term : { user : jack }
}
}

Solr's HTTP:

q=user:jacksort=age asc

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: ElasticSearch

2011-11-16 Thread Yonik Seeley
On Wed, Nov 16, 2011 at 10:36 AM, Shashi Kant sk...@sloan.mit.edu wrote:
 I had posted this earlier on this list, hope this provides some answers

 http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/

Except it's an out of date comparison.
We have NRT (near real time search) in Solr now.

http://wiki.apache.org/solr/NearRealtimeSearch

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Please help me with a basic question...

2011-05-20 Thread Yonik Seeley
On Fri, May 20, 2011 at 2:46 PM, Doron Cohen cdor...@gmail.com wrote:
 I stumbled upon the 'Explain' function yesterday though it returns a crowded
 message using debug in SOLR admin. Is there another method or interface
 which returns more or cleaner info?


 I am not familiar with the use of Solr for this, I hope someone else will
 answer this...

Most browser's default XML display don't preserve the text
formatting... hence the explain can look messed up.
try viewing the source or original page (CTRL-U in firefox,
CTRL-ALT-U or CMD-ALT-U in chrome I think)... and make sure
indent=true


http://localhost:8983/solr/select?q=solrdebugQuery=trueindent=true

  lst name=explain
str name=SOLR1000
0.58961654 = (MATCH) fieldWeight(text:solr in 1), product of:
   1.4142135 = tf(termFreq(text:solr)=2)
   3.3353748 = idf(docFreq=2, maxDocs=31)
   0.125 = fieldNorm(field=text, doc=1)
/str
  /lst

If email doesn't mess this up somewhere, you should see a properly
indented block of explain text.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Retrieving the first document in a range

2011-04-05 Thread Yonik Seeley
On Tue, Apr 5, 2011 at 10:06 AM, Shai Erera ser...@gmail.com wrote:
 Can we use TermEnum to skip to the first term 'after 3 weeks'? If so, we can
 pull the first doc that appears in the TermDocs of that Term (if it's a
 valid term).

Yep.  Try this to get the term you want to use to seek:
BytesRef term = new BytesRef();
NumericUtils.longToPrefixCoded(longval, 0, term);


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: DocIdSet to represent small numberr of hits in large Document set

2011-04-05 Thread Yonik Seeley
On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman a...@thorntothehorn.org wrote:
 Seems like SortedVIntList can be used to store the info, but it has no
 methods to build the list in the first place, requiring an array or bitset
 in the constructor.

It has a constructor that takes DocIdSetIterator - so you can pass an
iterator obtained from anywhere else (a Scorer actually is a
DocIdSetIterator, and you can get a DocIdSet from a Filter), or
implement your own.  It's a simple iterator interface.


-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Undo hyphenation when indexing

2011-04-01 Thread Yonik Seeley
Solr has a hyphenated word filter you could copy.
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html

On trunk, this has been folded into the analysis module.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschin bersc...@dosco.de wrote:
 Hi,

 for indexing PDF files we have to undo word hyphenation. The basic idea is
 simply to remove the hyphen when a new line and a small letter follows. Of
 course this approach isnt 100%-foolproofed but checking against a dictionary
 wouldnt be as well...

 Since we face this problem too when highlighting using HTMLCharStripper
 (yes, we do have hyphenation in our HTML docs...) it seems to me I have to
 adjust the JFlex generated StandardTokenizerImpl.

 Is this the right approach and hwo would I have to modify this script?

 Thanks
 Wulf


 PS: I see that there are changes made in the brand new 3.1.0 version we are
 using 3.0.3, but as far I understand no relevant changes in this respect.


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: which unicode version is supported with lucene

2011-02-27 Thread Yonik Seeley
On Sun, Feb 27, 2011 at 2:15 PM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Jepp, its back online.
 Just did a short test and reported my results to jira, but is the
 error from the xml output still a jetty problem or is it from XMLwriter?

The patch has been committed, so you should just be able to try trunk (or 3x).

I also just committed a char beyond the BMP to utf8-example.xml
and the indexing and XML output works fine for me.

Index the example docs, then do a query for BMP to bring up that document.

-Yonik
http://lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: which unicode version is supported with lucene

2011-02-25 Thread Yonik Seeley
On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 So Solr trunk should already handle Unicode above BMP for field type string?
 Strange...

One issue is that jetty doesn't support UTF-8 beyond the BMP:

/opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
Solr server is up.
HTTP GET is accepting UTF-8
HTTP POST is accepting UTF-8
HTTP POST defaults to UTF-8
ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
multilingual plane

-Yonik
http://lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: which unicode version is supported with lucene

2011-02-25 Thread Yonik Seeley
On Fri, Feb 25, 2011 at 9:09 AM, Bernd Fehling
bernd.fehl...@uni-bielefeld.de wrote:
 Hi Yonik,

 good point, yes we are using Jetty.
 Do you know if Tomcat has this limitation?

Tomcat's defaults are worse - you need to configure it to use UTF-8 by
default for URLs.
Once you do, it passes all those tests (last I checked).  Those tests
are really about UTF-8 working in GET/POST query arguments.  Solr may
still be able to handle indexing and returning full UTF-8, but you
wouldn't be able to query for it w/o using surrogates if you're using
Jetty.

It would be good to test though - does anyone know how to add a char
above the BMP to utf8-example.xml?

-Yonik
http://lucidimagination.com


 Regards,
 Bernd

 Am 25.02.2011 14:54, schrieb Yonik Seeley:
 On Fri, Feb 25, 2011 at 8:48 AM, Bernd Fehling
 bernd.fehl...@uni-bielefeld.de wrote:
 So Solr trunk should already handle Unicode above BMP for field type string?
 Strange...

 One issue is that jetty doesn't support UTF-8 beyond the BMP:

 /opt/code/lusolr/solr/example/exampledocs$ ./test_utf8.sh
 Solr server is up.
 HTTP GET is accepting UTF-8
 HTTP POST is accepting UTF-8
 HTTP POST defaults to UTF-8
 ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
 ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
 ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
 multilingual plane

 -Yonik
 http://lucidimagination.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Storing an ID alongside a document

2011-02-02 Thread Yonik Seeley
That's exactly what the CSF feature is for, right?  (docvalues branch)

-Yonik
http://lucidimagination.com


On Wed, Feb 2, 2011 at 1:03 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:

 I'm curious if there's a new way (using flex or term states) to store
 IDs alongside a document and retrieve the IDs of the top N results?
 The goal would be to minimize HD seeks, and not use field caches
 (because they consume too much heap space) or the doc stores (which
 require two seeks).  One possible way using the pre-flex system is to
 place the IDs into a payload posting that would match all documents,
 and then [somehow] retrieve the payload only when needed.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Storing an ID alongside a document

2011-02-02 Thread Yonik Seeley
On Wed, Feb 2, 2011 at 9:23 PM, Jason Rutherglen jason.rutherg...@gmail.com
 wrote:

 Is it?  I thought it would load the values into heap RAM like the
 field cache and in addition save the values to disk? Does it also
 read the values directly from disk?


Loading into memory is a separate optional part (i.e. loading a fieldcache
entry), that should use the APIs that read directly from the index.

-Yonik
http://lucidimagination.com


Re: WARNING: re-index all trunk indices!

2010-12-17 Thread Yonik Seeley
On Fri, Dec 17, 2010 at 11:18 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 If you are using Lucene's trunk (nightly build) release, read on...

 I just committed a change (for LUCENE-2811) that changes the index
 format on trunk, thus breaking (w/ likely strange exceptions on
 reading the segments_N file) any trunk indices created in the past
 week or so.

For reference, the exception I got trying to start Solr with an older
index on Windows is below.

-Yonik
http://www.lucidimagination.com


SEVERE: java.lang.RuntimeException: java.io.IOException: read past EOF
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1095)
at org.apache.solr.core.SolrCore.init(SolrCore.java:587)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:660)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:412)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:294)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:86)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:224)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.mortbay.start.Main.invokeMain(Main.java:194)
at org.mortbay.start.Main.start(Main.java:534)
at org.mortbay.start.Main.start(Main.java:441)
at org.mortbay.start.Main.main(Main.java:119)
Caused by: java.io.IOException: read past EOF
at 
org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(MMapDirectory.java:242)
at 
org.apache.lucene.store.ChecksumIndexInput.readBytes(ChecksumIndexInput.java:48)
at org.apache.lucene.store.DataInput.readString(DataInput.java:121)
at 
org.apache.lucene.store.DataInput.readStringStringMap(DataInput.java:148)
at org.apache.lucene.index.SegmentInfo.init(SegmentInfo.java:192)
at 
org.apache.lucene.index.codecs.DefaultSegmentInfosReader.read(DefaultSegmentInfosReader.java:57)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:220)
at 
org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:90)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:623)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:86)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:437)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
at 
org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1084)
... 31 more

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: The logic of QueryParser

2010-12-13 Thread Yonik Seeley
On Mon, Dec 13, 2010 at 2:51 PM, Robert Muir rcm...@gmail.com wrote:
 On Mon, Dec 13, 2010 at 2:43 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 On Mon, Dec 13, 2010 at 2:10 PM, Brian Hurt bhur...@gmail.com wrote:
  I was just wondering what the logic was for defaulting to or instead of 
 and.

 Largely historical.  I think the original rational was that it
 probably fit better with the traditional vector space model.
 There's also not a good reason to change the default, given that
 QueryParser isn't meant for end users.

 Thats pretty misleading Yonik.

 In other words, the query parser is designed for human-entered text,
 not for program-generated text.
 http://lucene.apache.org/java/3_0_3/queryparsersyntax.html

*shrugs*, I didn't recall that phrase... but I'm not clear if you
disagree with what I'm saying, or if you just think that it's
inconsistent with the documentation.

I think of the Lucene QueryParser like SQL. SQL is text based and also
meant for human entered text - but for either very expert users, or
programmatically created queries.  You normally don't want to pass
text from a search box directly to an SQL database or to the Lucene
QueryParser.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: The logic of QueryParser

2010-12-13 Thread Yonik Seeley
On Mon, Dec 13, 2010 at 3:07 PM, Robert Muir rcm...@gmail.com wrote:
 On Mon, Dec 13, 2010 at 3:04 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 I think of the Lucene QueryParser like SQL. SQL is text based and also
 meant for human entered text - but for either very expert users, or
 programmatically created queries.  You normally don't want to pass
 text from a search box directly to an SQL database or to the Lucene
 QueryParser.


 Then why does solr use it by default?

Because it's a decent default?  It was also the only choice when Solr
was first created.  I don't see a compelling reason to change that.

Solr fits about the same place a database does in many applications...
it's certainly not meant for users to query directly.  There's
normally a web application that handles interaction with the user and
creates/submits queries to Solr.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Webcast: Better Search Results Faster with Apache Solr and LucidWorks Enterprise

2010-12-08 Thread Yonik Seeley
We're holding a free webinar about relevancy enhancements in our
commercial version of Solr.  Details below.

-Yonik
http://www.lucidimagination.com

-
Join us for a free technical webcast
Better Search Results Faster with Apache Solr and LucidWorks Enterprise
Thursday, December 16, 2010
11:00 AM PST / 2:00 PM EST / 20:00 CET


Click here to sign up
http://www.eventsvc.com/lucidimagination/121610?trk=AP


In the key dimensions of search relevancy and query-targeted results,
users have become accustomed to internet-search style facilities like
page-rank, user-driven feedback, auto-suggest and more. Even with the
power of Apache Lucene/Solr, building such features into your own
search application is easier said than done.


Now, with LucidWorks Enterprise, the search solution development
platform built on the Solr/Lucene open source technology, developing
killer search apps with these features and more is faster, simpler,
and more powerful than ever before!


Join Andrzej Bialecki, Lucene/Solr Committer and inventor of the Luke
index utility, for a hands-on technical workshop that details how
LucidWorks Enterprise puts powerful search and relevancy at your
fingertips -- at a fraction of the time and effort required to program
them yourself with native Apache Solr. Andrzej will discuss and
present how you can use LucidWorks Enterprise for:
* Click Scoring to automatically configure relevance for most popular results
* Simplified implementation of auto-complete and did-you-mean functionality
* Unsupervised feedback  to automatically provide relevance
improvement on every query


Click here to sign up
http://www.eventsvc.com/lucidimagination/121610?trk=AP


--
About the presenter:
Andrzej Bialecki is a committer of the Apache Lucene/Solr project, a
Lucene PMC member, and chairman of the Apache Nutch project. He is
also the author of Luke, the Lucene Index Toolbox. Andrzej
participates in many commercial projects that use Lucene/Solr, Nutch
and Hadoop to implement enterprise and vertical search.
--
Presented by Lucid Imagination, the commercial entity exclusively
dedicated to Apache Lucene/Solr open source search technology.
LucidWorks Enterprise, our search solution development platform, helps
you build better search application more quickly and productively,
develop and We also offer solutions including SLA-based support,
professional training, best practices consulting, free developer
downloads free documentation.
Follow us on Twitter:twitter.com/LucidImagineer.
--
Apache Lucene and Apache Solr are trademarks of the Apache
Software Foundation.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: best practice: 1.4 billions documents

2010-11-26 Thread Yonik Seeley
On Mon, Nov 22, 2010 at 12:49 PM, Uwe Schindler u...@thetaphi.de wrote:
  (Fuzzy scores on
 MultiSearcher and Solr are totally wrong because each shard uses another
 rewritten query).

Hmmm, really?  I thought that fuzzy scoring should just rely on edit distance?
Oh wait, I think I see - it's because we can use a hard cutoff for the
number of expansions rather than an edit distance cutoff.  If we used
the latter, everything should be fine?

The fuzzy issue I would classify as working as designed.  Either
that, or classify FuzzyQuery as broken.  A cuttoff based on number of
terms will yield strange results even on a single index.  Consider
this scenario: it's possible to add more docs to a single index and
have the same fuzzy query return fewer docs than it did before!

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: best practice: 1.4 billions documents

2010-11-22 Thread Yonik Seeley
On Mon, Nov 22, 2010 at 12:17 PM, Uwe Schindler u...@thetaphi.de wrote:
 The latest discussion was more about MultiReader vs. MultiSearcher.

 But you are right, 1.4 B documents is not easy to go, especially when you
 index grows and you get to the 2.1 B marker, then no MultiSearcher or
 whatever helps.

 On the other hand even distributed Solr has the same problems like
 MultiSearcher: scoring MultiTermQueries (Fuzzy) doesn't work correctly

Are you referring to the idf being local to the shard instead of
global to the whole colleciton?
Andrzej has a patch in the works, but it's not committed yet.

 negative MTQ clauses may produce wrong results if the query rewriting is
 done like in MultiSearcher (which is unsolveable broken and broken and
 broken and again broken for some queries as Boolean clauses - see DeMorgan
 laws).

I don't think this is a problem for Solr.  Queries are executed on
each shard as normal (no difference from a non-distributed query).

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: best practice: 1.4 billions documents

2010-11-21 Thread Yonik Seeley
On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
luca.rondan...@gmail.com wrote:
 Hi everybody,

 I really need some good advice! I need to index in lucene something like 1.4
 billions documents. I had experience in lucene but I've never worked with
 such a big number of documents. Also this is just the number of docs at
 start-up: they are going to grow and fast.

 I don't have to tell you that I need the system to be fast and to support
 real time updates to the documents

 The first solution that came to my mind was to use ParallelMultiSearcher,
 splitting the index into many sub-index (how many docs per index?
 100,000?) but I don't have experience with it and I don't know how well will
 scale while the number of documents grows!

 A more solid solution seems to build some kind of integration with hadoop.
 But I didn't find match about lucene and hadoop integration.

 Any idea? Which direction should I go (pure lucene or hadoop)?

There seems to be a common misconception about hadoop regarding search.
Map-reduce as implemented in hadoop is really for batch oriented jobs
only (or those types of jobs where you don't need a quick response
time).  It's definitely not for normal queries (unless you have
unusual requirements).

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexWriter.close() performance issue

2010-11-20 Thread Yonik Seeley
On Fri, Nov 19, 2010 at 5:41 PM, Mark Kristensson
mark.kristens...@smartsheet.com wrote:
 Here's the changes I made to org.apache.lucene.util.StringHelper:

  //public static StringInterner interner = new SimpleStringInterner(1024,8);

As Mike said, the real fix for trunk is to get rid of interning.
But for your version, you could try making the string intern cache larger.

StringHelper.interner = new SimpleStringInterner(30,8);

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



FAST ESP - Solr migration webinar

2010-11-11 Thread Yonik Seeley
We're holding a free webinar on migration from FAST to Solr.  Details below.

-Yonik
http://www.lucidimagination.com

=
Solr To The Rescue: Successful Migration From FAST ESP to Open Source
Search Based on Apache Solr

Thursday, Nov 18, 2010, 14:00 EST (19:00 GMT)
Hosted by SearchDataManagement.com

For anyone concerned about the future of their FAST ESP applications
since the purchase of Fast Search and Transfer by Microsoft in 2008,
this webinar will provide valuable insights on making the switch to
Solr.  A three-person rountable will discuss factors driving the need
for FAST ESP alternatives, differences between FAST and Solr, a
typical migration project lifecycle  methodology, complementary open
source tools, best practices, customer examples, and recommended next
steps.

The speakers for this webinar are:

Helge Legernes, Founding Partner  CTO of Findwise
Michael McIntosh, VP Search Solutions for TNR Global
Eric Gaumer, Chief Architect for ESR Technology.

For more information and to register, please go to:

http://SearchDataManagement.bitpipe.com/detail/RES/1288718603_527.html?asrc=CL_PRM_Lucid2
=

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IndexWriter.close() performance issue

2010-11-03 Thread Yonik Seeley
 It turns out that the prepareCommit() is the slow call here, taking several 
 seconds to complete.

 I've done some reading about it, but have not found anything that might be 
 helpful here. The fact that it is slow
 every single time, even when I'm adding exactly one document to the index, is 
 perplexing and leads to me to
 think something must be corrupt with the index.

prepareCommit() syncs the index files, making sure they are one stable storage.
Some filesystems have issues with syncing individual files and
essentially sync all files with unflushed data, leading to poor
performance.

-Yonik
http://www.lucidimagination.com



On Wed, Nov 3, 2010 at 2:53 PM, Mark Kristensson
mark.kristens...@smartsheet.com wrote:
 I've successfully reproduced the issue in our lab with a copy from production 
 and have broken the close() call into parts, as suggested, with one addition.

 Previously, the call was simply

        ...
        } finally {
                // Close
                if (indexWriter != null) {
                        try {
                                indexWriter.close();
        ...


 Now, that is broken into the various parts, including a prepareCommit();

        ...
        } finally {
                // Close
                if (indexWriter != null) {
                        try {
                                indexWriter.prepareCommit();
                                Logger.debug(prepareCommit() complete);
                                indexWriter.commit();
                                Logger.debug(commit() complete);
                                indexWriter.maybeMerge();
                                Logger.debug(maybeMerge() complete);
                                indexWriter.waitForMerges();
                                Logger.debug(waitForMerges() complete);
                                indexWriter.close();
        ...


 It turns out that the prepareCommit() is the slow call here, taking several 
 seconds to complete.

 I've done some reading about it, but have not found anything that might be 
 helpful here. The fact that it is slow every single time, even when I'm 
 adding exactly one document to the index, is perplexing and leads to me to 
 think something must be corrupt with the index.

 Furthermore, I tried optimizing the index to see if that would have any 
 impact (I wasn't expecting much) and it did not.

 I'm stumped at this point and am thinking I may have to rebuild the index, 
 though I would definitely prefer to avoid doing that and would like to know 
 why this is happening.

 Thanks for your help,
 Mark


 On Nov 2, 2010, at 9:26 AM, Mark Kristensson wrote:


 Wonderful information on what happens during indexWriter.close(), thank you 
 very much! I've got some testing to do as a result.

 We are on Lucene 3.0.0 right now.

 One other detail that I neglected to mention is that the batch size does not 
 seem to have any relation to the time it takes to close on the index where 
 we are having issues. We've had batches add as few as 3 documents and 
 batches add as many as 2500 documents in the last hour and every single 
 close() call for that index takes 6 to 8 seconds. While I won't know until I 
 am able to individually test the different pieces of the close() operation, 
 I'd be very surprised if a batch that adds just 3 new documents results in 
 very much merge work being done. It seems as if there is some task happening 
 during merge that the indexWriter is never able to successfully complete and 
 so it tries to complete that task every single time close() is called.

 So, my working theory until I can dig deeper is that something is mildly 
 corrupt with the index (though not serious enough to affect most operations 
 on the index). Are there any good utilities for examining the health of an 
 index?

 I've dabbled with the experimental checkIndex object in the past (before we 
 upgraded to 3.0), but have found it to be incredibly slow and of marginal 
 value. Does anyone have any experience using CheckIndex to track down an 
 issue with a production index?

 Thanks again!
 Mark

 On Nov 2, 2010, at 2:20 AM, Shai Erera wrote:

 When you close IndexWriter, it performs several operations that might have a
 connection to the problem you describe:

 * Commit all the pending updates -- if your update batch size is more or
 less the same (i.e., comparable # of docs and total # bytes indexed), then
 you should not see a performance difference between closes.

 * Consults the MergePolicy and runs any merges it returns as candidates.

 * Waits for the merges to finish.

 Roughly, IndexWriter.close() can be substituted w/:
 writer.commit(false); // commits the changes, but does not run merges.
 writer.maybeMerge(); // runs merges returned by MergePolicy.
 writer.waitForMerges(); // if you use ConcurrentMergeScheduler, the above
 call returns immediately, not waiting for the merges to finish.
 writer.close(); // at this point, commit + 

Re: lucene norms cached twice

2010-10-29 Thread Yonik Seeley
On Fri, Oct 29, 2010 at 3:32 PM, Cabansag, Ronald-Alvin R
ronald-alvin.caban...@cengage.com wrote:
 We use a QueryWrapperFilter.getDocIdSet(indexReader) to get the DocIdSet and 
 compute the hit count using its iterator.

If you want to avoid double-caching of norms, then you should call
getDocIdSet() for each segment reader, not the top level reader.

Aside: presumably you're actually doing something more advanced than
getting the hit count (and you just simplified your description
because it wasn't pertinent)... since you can get the hit count from
TopDocs.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Function Query, Required Clauses, and Matching

2010-10-25 Thread Yonik Seeley
On Mon, Oct 25, 2010 at 7:00 PM, Dennis Kubes ku...@apache.org wrote:
 A curiosity.  Some of the documentation for function queries says they match
 every document in the index.  When running a query that has boolean required
 clauses and an optional ValueSourceQuery or function query is the function
 query still matched against every document in the index or is it only on
 those documents that match required clauses?

It's only those that match the required clauses.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Checksum and transactional safety for lucene indexes

2010-09-24 Thread Yonik Seeley
On Tue, Sep 21, 2010 at 12:53 AM, Lance Norskog goks...@gmail.com wrote:
 If an index file is not completely written to disk, it never become
 available. Lucene has a file describing the current active index segments.
 It writes all new files to the disk, and changes the description file
 (segments.gen) only after that.

Right - but it's segments_n
segments.gen is actually optional (IIRC, solr doesn't even replicate
it to slaves).

-Yonik
http://lucenerevolution.org   Lucene/Solr Conference, Boston Oct 7-8

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Filters do not work with MultiSearcher?

2010-09-10 Thread Yonik Seeley
This is working as designed.

Note this method:
  public DocIdSet getDocIdSet(IndexReader indexReader) throws IOException {
return openBitSet;
}

You must pay attention to the IndexReader passed - and the DocIdSet
returned must always be based on that reader (and the first document
of every reader is always 0).  So returning the same DocIdSet each
time is not valid and will result in errors.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


On Fri, Sep 10, 2010 at 12:23 PM, Nader, John P john.na...@cengage.com wrote:
 We are attempting to perform a filtered search on two indices joined by a 
 MultiSearcher.  Unfortunately, it appears there is an issue in the lucene 
 code that is causing the filter to be simply reused at the starting ordinal 
 for each individual index instead of being augmented by the starting document 
 identifier.  We are hoping there is an alternate API that will allow us to 
 perform a filtered search on multiple indices.

 For example, we have two indices with three documents each, and a filter 
 containing only doc ID 1.  When we perform a filtered search on a 
 MultiSearcher that joins these two indices, we get two documents back (1, 4), 
 where we were expecting only the one.  This is because the MultiSearcher, 
 instead of starting at doc ID 3 for the second index, is interpreting the 
 filter individually for each index.

 We are using Lucene 3.0.2.  The API we see this behavior with is 
 MultiSearcher.search(Query, Filter, nDocs) with a MatchAllDocsQuery and the 
 filter code pasted below:

 public class OpenBitSetFilter extends Filter {

    private OpenBitSet openBitSet;


    public OpenBitSetFilter(OpenBitSet openBitSet) {
        this.openBitSet = openBitSet;
    }

    public DocIdSet getDocIdSet(IndexReader indexReader) throws IOException {
        return openBitSet;
    }

 }




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: API to retrieve search results without scoring or sorting

2010-07-19 Thread Yonik Seeley
On Mon, Jul 19, 2010 at 6:14 AM, Naveen Kumar id.n...@gmail.com wrote:
 Is there any API using which I can retrieve search results, such that they
 are neither scored nor sorted (for performance reasons). I just need the
 results, don't need any extra computation on that.

Use your own custom Collector class.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Get lengthNorm of a field

2010-07-19 Thread Yonik Seeley
On Mon, Jul 19, 2010 at 9:53 AM, Philippe mailer.tho...@gmail.com wrote:
 is there a possibility to retrieve the lengthNorm for all (or a specific)
 fields in a specific document?

See IndexReader: public abstract byte[] norms(String field) throws IOException;
And Similarity: public float decodeNormValue(byte b) {

The byte[] is indexed by document id, and you can decode that into a
float value with a Similarity.

-Yonik
http://www.lucidimagination.com




 Regards,
    Philippe

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Could multiple indexers change same collections at the same time?

2010-06-24 Thread Yonik Seeley
Yes, all of that still applies to Lucene 3x and 4x, and is unlikely to
change any time soon.

-Yonik
http://www.lucidimagination.com

On Thu, Jun 24, 2010 at 1:51 PM, Zhang, Lisheng
lisheng.zh...@broadvision.com wrote:
 Hi,

 I remembered I tested earlier lucene 1.4 and 2.4, and found the following:

 # it is OK for multiple searchers to search the same collection.

 # it is OK for one IndexerWriter to edit and multiple searchers to search
   at the same time.

 # it is generally NOT OK for multiple IndexerWriter to change same collection
   at the same time.

 Could you confirm briefly if above are true and give me Yes/No answer whether
 in latest lucene 3x above conclusions are still OK?

 Thanks very much for helps, Lisheng

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: segment_N file is missed

2010-06-16 Thread Yonik Seeley
On Tue, Jun 15, 2010 at 5:23 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 CheckIndex is not able to recover from this corruption (missing
 segments_N file); this would be a nice addition...

 But it sounds like you've worked out a way to write your own segmetns_N?

 Use oal.store.ChecksumIndexOutput (wraps any other IndexOutput) to
 properly write the checksum.

 BTW how did you lose your segments_N file...?

Can this also be caused by the new behavior introduced here?
https://issues.apache.org/jira/browse/LUCENE-2386
If you open a writer, add docs, and then crash before calling commit?


-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Docs with any score are collected in the Collector implementations

2010-06-02 Thread Yonik Seeley
On Wed, Jun 2, 2010 at 1:10 PM,  jan.kure...@nokia.com wrote:
 that's probably because I move from lucene to solr.

 We will need to filter them from the result manually then first.

Solr has a function range query that can filter out any values outside
of the given range.
http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/

And of course, a function query can consist of a normal relevancy
query... so here is a lucene query of text:solr with a lower bound
of 0 exclusive:

http://localhost:8983/solr/select?q={!frange l=0
incl=false}query($qq)qq=text:solr

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using JSON for index input and search output

2010-05-30 Thread Yonik Seeley
On Sun, May 30, 2010 at 1:33 PM, Visual Logic visual.lo...@gmail.com wrote:
 JSON is the format used for all the configuration and property files in the 
 RIA application we are developing. Is Lucene able to create a document from a 
 given JSON file and index it? Is Lucene able to provide a JSON output 
 response from a query made to an index? Does the Tika package provide this?

No, and no.
XML, JSON, etc, are out of scope for lucene, which is a core search library.
Tika extracts text from documents like Word and PDF.

 Local indexing and searching is needed on the local client so Solr is not a 
 solution even though it does provide a search response in JSON format.

Solr is embeddable as well, so you can directly index/search.  But why
can't you run a separate server?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Using JSON for index input and search output

2010-05-30 Thread Yonik Seeley
On Sun, May 30, 2010 at 2:27 PM, Visual Logic visual.lo...@gmail.com wrote:
 Solr is embeddable but does that not just mean that SolrJ only provides the 
 ability to call Solr running on some server?

Nope - embeddable as in running in the same JVM as your application.

 For some of my use cases using Solr on a remote server would work fine. For 
 other cases it will not be quick enough,

Running as a separate server can be on the same host and be very
quick.  Was it too slow when you tried it?
It's a common misconception that HTTP is slow... it's really just a
TCP socket (which can be reused with persistent connections) with some
standardized headers.  Solr also has a binary protocol that works just
fine over HTTP, so it's really not more overhead than doing something
like talking to a database.

But the right solution probably depends on the details of your
specific usecases - if you elaborate on them, people may be able to
provide more specific recommendations.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to get the number of unique terms in the inverted index

2010-05-28 Thread Yonik Seeley
It seems like there should be a formula for estimating the total
number of unique terms given that you know the unique term counts for
each segment, and make certain assumptions like random document
distribution across segments.

-Yonik
http://www.lucidimagination.com

On Thu, May 27, 2010 at 9:17 PM, kannan chandrasekaran
ckanna...@yahoo.com wrote:
 I am just trying out a few experiments to calculate similarity between terms 
 based on their co-occurences in the dataset...  Basically I am trying to 
 build contextual vectors  and calculate similarity using a similarity measure 
 ( say cosine similarity).

 I dont think this is an XY problem . The vectors I am trying to build are not 
 the same as the TermVectors option ((term,freq) pairs per document) in the 
 lucene ( if thats what u meant)

 Thanks
 Kannan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to get the number of unique terms in the inverted index

2010-05-27 Thread Yonik Seeley
On Thu, May 27, 2010 at 2:32 PM, kannan chandrasekaran
ckanna...@yahoo.com wrote:
 I was wondering  if there is a way to retrieve the number of unique terms in 
 the lucene ( version 2.4.0) ... I am aware of the terms()  terms(Term) 
 method that returns an enumeration (TermEnum) but that involves iterating 
 through the terms and couting them.  I looking for something similar to 
 numdocs() in the IndexReader class.

No there is not.
In 4.0-dev, with the new flex APIs, you can retrieve the number of
unique terms in a single segment (Terms.getUniqueTermCount()), but not
a whole index.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT and Caching based on IndexReader

2010-05-17 Thread Yonik Seeley
On Mon, May 17, 2010 at 5:00 PM, Shay Banon kim...@gmail.com wrote:
   I wanted to verify if my understanding is correct. Assuming that I use
 NRT, and refresh, say, every 1 second, caching based on IndexReader, such is
 what is used in the CachingWrapperFilter is basically useless

No, it's fine.  Searching in Lucene is now done per-segment, and so
the readers that are passed to Filter.getDocIdSet are the segment
readers, not the top-level readers.  Caching is now per-segment.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT and Caching based on IndexReader

2010-05-17 Thread Yonik Seeley
Yep, confirmed what you are seeing.  I'll check into it and open an issue.

-Yonik
http://www.lucidimagination.com

On Mon, May 17, 2010 at 5:54 PM, Shay Banon kim...@gmail.com wrote:
 Yea, I noticed that ;). Even so, I think that with NRT, even the lower level
 readers are cloned, meaning that you always get a new instance... . Here is
 a sample program that tests this behavior, am I doing something wrong? By
 the way, if what I say is correct, it affects field cache as well

 public static void main(String[] args) throws Exception {
    Directory dir = new RAMDirectory();
    IndexWriter indexWriter = new IndexWriter(dir, new
 StandardAnalyzer(Version.LUCENE_CURRENT), true,
 IndexWriter.MaxFieldLength.UNLIMITED);
    IndexReader reader = indexWriter.getReader();

    SetIndexReader readers = new HashSetIndexReader(); // tracks all
 readers
    for (int i = 0; i  100; i++) {
        readers.add(reader);
        Document doc = new Document();
        doc.add(new Field(id, Integer.toString(i), Field.Store.YES,
 Field.Index.NO));
        indexWriter.addDocument(doc);

        IndexReader newReader = reader.reopen(true);
        if (reader == newReader) {
            System.err.println(Should not get the same reader...);
        } else {
            reader.close();
            reader = newReader;
        }
    }

    reader.close();

    // now, go and check that all are ref == 0
    // and, that all readers, even sub readers, are unique instances
 (sadly...)
    SetIndexReader allReaders = new HashSetIndexReader();
    for (IndexReader reader1 : readers) {
        if (reader1.getRefCount() != 0) {
            System.err.println(A reader is not closed);
        }
        if (allReaders.contains(reader1)) {
            System.err.println(Found an existing reader...);
        }
        allReaders.add(reader1);
        if (reader1.getSequentialSubReaders() != null) {
            for (IndexReader reader2 : reader1.getSequentialSubReaders()) {
                if (reader2.getRefCount() != 0) {
                    System.err.println(A reader is not closed...);
                }
                if (allReaders.contains(reader2)) {
                    System.err.println(Found an existing reader...);
                }
                allReaders.add(reader2);

                // there should not be additional readers...
                if (reader2.getSequentialSubReaders() != null) {
                    System.err.println(Should not be more readers...);
                }
            }
        }
    }

    indexWriter.close();
 }




 On Tue, May 18, 2010 at 12:30 AM, Yonik Seeley
 yo...@lucidimagination.comwrote:

 On Mon, May 17, 2010 at 5:00 PM, Shay Banon kim...@gmail.com wrote:
    I wanted to verify if my understanding is correct. Assuming that I use
  NRT, and refresh, say, every 1 second, caching based on IndexReader, such
 is
  what is used in the CachingWrapperFilter is basically useless

 No, it's fine.  Searching in Lucene is now done per-segment, and so
 the readers that are passed to Filter.getDocIdSet are the segment
 readers, not the top-level readers.  Caching is now per-segment.

 -Yonik
 http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT and Caching based on IndexReader

2010-05-17 Thread Yonik Seeley
On Mon, May 17, 2010 at 9:00 PM, Shay Banon kim...@gmail.com wrote:
 Great, so I am not imagining things this late into the night ... ;), not so
 great, since using NRT with field cache (like sorting) or caching filters,
 or anything that caches based on IndexReader not really an option. This
 makes NRT very problematic to use in a real application.

NRT is still pretty new :-)  And I do believe this is a bug, so we'll
get it fixed.
It's not actually a problem for FieldCache though - it no longer keys
on the reader directly (if deleted docs are the only things that have
changed, the FieldCache entry can still be shared).

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT and Caching based on IndexReader

2010-05-17 Thread Yonik Seeley
On Mon, May 17, 2010 at 9:12 PM, Shay Banon kim...@gmail.com wrote:
 Just saw that you opened a case for that. I think that its important in your
 test case to also test for object identity, not just equals. This is because
 the IndexReader (or the FieldCacheKey) are used as keys in weak hash maps,
 which uses identity (==) equality for keys.

Yeah, just me being lazy... I just knew that those objects don't
implement equals and hence it ends up the same as ==.  But I agree an
explicit == would be better.

 If FieldCacheKey is supposed to represent the key by which index readers
 should be tested for equality (for example, it will be used in the
 CachingWrapperFilter), and not the index reader itself, then I think it
 should be renamed. What do you think? I am just looking now at what it does,
 its new...

I don't think it's general purpose, since it ignores things like a
change in deleted documents.  I think we should use the same reader
when the segment has not been changed.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NRT and Caching based on IndexReader

2010-05-17 Thread Yonik Seeley
On Mon, May 17, 2010 at 9:14 PM, Shay Banon kim...@gmail.com wrote:
 Oh, and one more thing. Deleted docs is a sub case, with NRT, most people
 will almost always add docs as well... . So it is still not really usable
 for field cache, right?

FieldCache should be fine for the general cases - the same entry will
be used if the segment hasn't changed at all, or if the segment has
only changed which documents are deleted.  Adding new documents adds
new segments and does affect (until merge) existing segments, so the
entries will be reused.

-Yonik
http://www.lucidimagination.com


 On Tue, May 18, 2010 at 4:12 AM, Shay Banon kim...@gmail.com wrote:

 Just saw that you opened a case for that. I think that its important in
 your test case to also test for object identity, not just equals. This is
 because the IndexReader (or the FieldCacheKey) are used as keys in weak hash
 maps, which uses identity (==) equality for keys.

 If FieldCacheKey is supposed to represent the key by which index readers
 should be tested for equality (for example, it will be used in the
 CachingWrapperFilter), and not the index reader itself, then I think it
 should be renamed. What do you think? I am just looking now at what it does,
 its new...

 -shay.banon


 On Tue, May 18, 2010 at 4:04 AM, Yonik Seeley 
 yo...@lucidimagination.comwrote:

 On Mon, May 17, 2010 at 9:00 PM, Shay Banon kim...@gmail.com wrote:
  Great, so I am not imagining things this late into the night ... ;), not
 so
  great, since using NRT with field cache (like sorting) or caching
 filters,
  or anything that caches based on IndexReader not really an option. This
  makes NRT very problematic to use in a real application.

 NRT is still pretty new :-)  And I do believe this is a bug, so we'll
 get it fixed.
 It's not actually a problem for FieldCache though - it no longer keys
 on the reader directly (if deleted docs are the only things that have
 changed, the FieldCache entry can still be shared).

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FieldCache and 2.9

2010-05-11 Thread Yonik Seeley
You are requesting the FieldCache entry from the top-level reader and
hence a whole new FieldCache entry must be created.
Lucene 2.9 sorting requests FieldCache entries at the segment level
and hence reuses entries for those segments that haven't changed.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague



On Tue, May 11, 2010 at 9:27 AM, Carl Austin carl.aus...@detica.com wrote:
 Hi,

 I have been using the FieldCache in lucene version 2.9 compared to that
 in 2.4. The load time is massively decreased, however I am not seeing
 any benefit in getting a field cache after re-open of an index reader
 when I have only added a few extra documents.
 A small test class is included below (based off one from Lucid
 Imagination), that creates 5Mil docs, gets a field cache, creates
 another few docs and gets the field cache again. I though the second get
 would be very very fast, as only 1 segment should have changed, however
 it takes more time for the reopen and cache get than it did the
 original.

 Am I doing something wrong here or have I misunderstood the new segment
 changes?

 Thanks

 Carl


 import java.io.File;

 import org.apache.lucene.analysis.SimpleAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.index.IndexReader;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.search.FieldCache;
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.FSDirectory;

 public class ContrivedFCTest {

        public static void main(String[] args) throws Exception {
                Directory dir = FSDirectory.open(new File(args[0]));
                IndexWriter writer = new IndexWriter(dir, new
 SimpleAnalyzer(), true,
                                IndexWriter.MaxFieldLength.LIMITED);
                for (int i = 0; i  500; i++) {
                        if (i % 10 == 0) {
                                System.out.println(i);
                        }
                        Document doc = new Document();
                        doc.add(new Field(field, String + i,
 Field.Store.NO,
                                        Field.Index.NOT_ANALYZED));
                        writer.addDocument(doc);
                }
                writer.close();

                IndexReader reader = IndexReader.open(dir, true);
                long start = System.currentTimeMillis();
                FieldCache.DEFAULT.getStrings(reader, field);
                long end = System.currentTimeMillis();
                System.out.println(load time for initial field cache:
 + (end - start)
                                / 1000.0f + s);

                writer = new IndexWriter(dir, new SimpleAnalyzer(),
 false,
                                IndexWriter.MaxFieldLength.LIMITED);
                for (int i = 501; i  505; i++) {
                        if (i % 10 == 0) {
                                System.out.println(i);
                        }
                        Document doc = new Document();
                        doc.add(new Field(field, String + i,
 Field.Store.NO,
                                        Field.Index.NOT_ANALYZED));
                        writer.addDocument(doc);
                }
                writer.close();

                IndexReader reader2 = reader.reopen(true);
                System.out.println(reader size =  +
 reader2.numDocs());
                long start2 = System.currentTimeMillis();
                FieldCache.DEFAULT.getStrings(reader2, field);
                long end2 = System.currentTimeMillis();
                System.out.println(load time for re-opened field
 cache:
                                + (end2 - start2) / 1000.0f + s);
        }
 }

 This message should be regarded as confidential. If you have received this 
 email in error please notify the sender and destroy it immediately.
 Statements of intent shall only become binding when confirmed in hard copy by 
 an authorised signatory.  The contents of this email may relate to dealings 
 with other companies within the Detica Limited group of companies.

 Detica Limited is registered in England under No: 1337451.

 Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MatchAllDocsQuery and MatchNoDocsQuery

2010-05-10 Thread Yonik Seeley
Yes on all counts.  Lucene doesn't modify query objects, so they are
save for reuse among multiple threads.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague



2010/5/10 Mindaugas Žakšauskas min...@gmail.com:
 Hi,

 Can anybody confirm whether MatchAllDocsQuery can be used as an
 immutable singletone? By this I mean creating a single instance and
 sharing it whenever I need to either use it on its own or in
 cojunction with other queries put into a BooleanQuery; to result all
 documents in a search result. Can the same instance even be reused
 among different threads?

 What would be the best way implementing MatchNoDocsQuery? My initial
 tests show that a new BooleanQuery() without any additional clauses
 would just do the job, but I just wanted to double check whether this
 is be a reliable assumption. Above questions also apply - could this
 be reused among different contexts, threads?

 Thanks in advance.

 Regards,
 Mindaugas

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: problem in Lucene's ranking function

2010-05-05 Thread Yonik Seeley
2010/5/5 José Ramón Pérez Agüera jose.agu...@gmail.com:
[...]
 The consequence is that a document
 matching a single query term over several fields could score much
 higher than a document matching several query terms in one field only,

One partial workaround that people use is DisjunctionMaxQuery (used by
dismax query parser in Solr).
http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Fwd: Apache Lucene EuroCon Call For Participation: Prague, Czech Republic May 20 21, 2010

2010-03-24 Thread Yonik Seeley
Forwarding to lucene only - the big cross-post caused my gmail filters
to file it.
-Yonik

-- Forwarded message --
From: Grant Ingersoll gsing...@apache.org
Date: Wed, Mar 24, 2010 at 8:03 PM
Subject: Apache Lucene EuroCon Call For Participation: Prague, Czech
Republic May 20  21, 2010
To: Lucene mailing list gene...@lucene.apache.org,
solr-u...@lucene.apache.org, java-user@lucene.apache.org,
mahout-u...@lucene.apache.org, nutch-u...@lucene.apache.org,
openrelevance-u...@lucene.apache.org, tika-u...@lucene.apache.org,
pylucene-u...@lucene.apache.org, connectors-...@incubator.apache.org,
lucene-net-...@lucene.apache.org


Apache Lucene EuroCon Call For Participation - Prague, Czech Republic
May 20  21, 2010

All submissions must be received by Tuesday, April 13, 2010, 12
Midnight CET/6 PM US EDT

The first European conference dedicated to Lucene and Solr is coming
to Prague from May 18-21, 2010. Apache Lucene EuroCon is running on on
not-for-profit basis, with net proceeds donated back to the Apache
Software Foundation. The conference is sponsored by Lucid Imagination
with additional support from community and other commercial
co-sponsors.

Key Dates:
24 March 2010: Call For Participation Open
13 April 2010: Call For Participation Closes
16 April 2010: Speaker Acceptance/Rejection Notification
18-19 May 2010: Lucene and Solr Pre-conference Training Sessions
20-21 May 2010: Apache Lucene EuroCon

This conference creates a new opportunity for the Apache Lucene/Solr
community and marketplace, providing  the chance to gather, learn and
collaborate on the latest in Apache Lucene and Solr search
technologies and what's happening in the community and ecosystem.
There will be two days of Lucene and Solr training offered May 18 
19, and followed by two days packed with leading edge Lucene and Solr
Open Source Search content and talks by search and open source thought
leaders.

We are soliciting 45-minute presentations for the conference, 20-21
May 2010 in Prague. The conference and all presentations will be in
English.

Topics of interest include:
- Lucene and Solr in the Enterprise (case studies, implementation,
return on investment, etc.)
- “How We Did It”  Development Case Studies
- Spatial/Geo search
- Lucene and Solr in the Cloud
- Scalability and Performance Tuning
- Large Scale Search
- Real Time Search
- Data Integration/Data Management
- Tika, Nutch and Mahout
- Lucene Connectors Framework
- Faceting and Categorization
- Relevance in Practice
- Lucene  Solr for Mobile Applications
- Multi-language Support
- Indexing and Analysis Techniques
- Advanced Topics in Lucene  Solr Development

All accepted speakers will qualify for discounted conference
admission. Financial assistance is available for speakers that
qualify.

To submit a 45-minute presentation proposal, please send an email to
c...@lucene-eurocon.org containing the following information in plain
text:

1. Your full name, title, and organization

2. Contact information, including your address, email, phone number

3. The name of your proposed session (keep your title simple and
relevant to the topic)

4. A 75-200 word overview of your presentation (in English); in
addition to the topic, describe whether your presentation is intended
as a tutorial, description of an implementation, an
theoretical/academic discussion, etc.

5. A 100-200-word speaker bio that includes prior conference speaking
or related experience (in English)

To be considered, proposals must be received by 12 Midnight CET
Tuesday, 13 April 2010 (Tuesday 13 April 6 PM US Eastern time, 3 PM US
Pacific Time).

Please email any questions regarding the conference to
i...@lucene-eurocon.org. To be added to the conference mailing list,
please email sig...@lucene-eurocon.org. If your organization is
interested in sponsorship opportunities, email
spon...@lucene-eurocon.org

Key Dates

24 March 2010: Call For Participation Open
13 April 2010: Call For Participation Closes
16 April 2010: Speaker Acceptance/Rejection Notification
18-19 May 2010  Lucene and Solr Pre-conference Training Sessions
20-21 May 2010: Apache Lucene EuroCon

We look forward to seeing you in Prague!

Grant Ingersoll
Apache Lucene EuroCon Program Chair
www.lucene-eurocon.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Combining TopFieldCollector with custom Collector

2010-03-11 Thread Yonik Seeley
On Thu, Mar 11, 2010 at 4:10 PM, Peter Keegan peterlkee...@gmail.com wrote:
 I want the TFC to do all the cool things it does like custom sorting, saving
 the field values, max score, etc. I suppose the custom Collector could
 explicitly delegate all TFC's methods, but this doesn't seem right.

No need to delegate the TFC specific methods... just wrap the TFC in
your own collector, do the search, and then directly access the TFC to
get what you need.  This is what Solr does.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NumericField exact match

2010-02-27 Thread Yonik Seeley
On Fri, Feb 26, 2010 at 3:33 PM, Ivan Vasilev ivasi...@sirma.bg wrote:
 Does it matter precision step when I use NumericRangeQuery for exact
 matches?

No.  There is a full-precision version of the value indexed regardless
of the precision step, and that's used for an exact match query.

 I mean if I use the default precision step when indexing that
 fields it is guaranteed that:
 1. With this query I will always hit the docs that contain val for the
 field;
 2. I will never hit docs with different that have diferent val for the
 field;

Correct.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sort and Collector

2010-02-03 Thread Yonik Seeley
On Wed, Feb 3, 2010 at 1:40 PM, tsuraan tsur...@gmail.com wrote:
 Is there any way to run a search where I provide a Query, a Sort, and
 a Collector?  I have a case where it is sometimes, but rarely,
 necessary to get all the results from a query, but usually I'm
 satisfied with a smaller amount.  That part I can do with just a query
 and a collector, but I'd like the results to be sorted as they are
 submitted to the collector's collect method.  Is that possible?

It's not really possible.
Lucene must iterate over all of the hits before it knows for sure that
it has the top sorted by any criteria (other than docid).
A Collector is called for every hit as it happens, and thus one can't
specify a sort order (sorting itself is actually implemented with a
sorting Collector).

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NumericRangeQuery performance with 1/2 billion documents in the index

2010-01-03 Thread Yonik Seeley
Perhaps this is just a huge index, and not enough of it can be cached in RAM.
Adding additional clauses to a boolean query incrementally destroys locality.

104GB of index and 4GB of RAM means you're going to be hitting the
disk constantly.  You need more hardware - if you're requirements are
low (low query volume, high query latency of a few seconds OK) then
you can probably get away with a single box... just either get  a SSD
or get more RAM (like 32G or more).

If you want higher query volumes or consistent sub-second search,
you're going to have to go distributed.
Roll your own or look at Solr.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NumericRangeQuery performance with 1/2 billion documents in the index

2010-01-03 Thread Yonik Seeley
On Sun, Jan 3, 2010 at 10:42 AM, Karl Wettin karl.wet...@gmail.com wrote:

 3 jan 2010 kl. 16.32 skrev Yonik Seeley:

 Perhaps this is just a huge index, and not enough of it can be cached in
 RAM.
 Adding additional clauses to a boolean query incrementally destroys
 locality.

 104GB of index and 4GB of RAM means you're going to be hitting the
 disk constantly.  You need more hardware - if you're requirements are
 low (low query volume, high query latency of a few seconds OK) then
 you can probably get away with a single box... just either get  a SSD
 or get more RAM (like 32G or more).

 If you want higher query volumes or consistent sub-second search,
 you're going to have to go distributed.
 Roll your own or look at Solr.

 I'm not sure I agree.

 A 104GB index says nothing about the date field. And it says nothing about
 the range of the query.

Given that there are 500M docs, one can make an educated guess that
much of this 104GB is index and not just stored fields.  IMO, it's
simply too many docs and too big of a ratio between RAM and index size
for good query performance.  But I don't think we've heard what the
requirements for this index are.

A quick ls -l of the index directory would be revealing though.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Finding the highest term in a field

2009-11-19 Thread Yonik Seeley
On Thu, Nov 19, 2009 at 1:04 AM, Daniel Noll dan...@nuix.com wrote:
 I take it the existing numeric fields can't already do stuff like
 this?

Nope, it's a fundamental limitation of the current TermEnums.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Finding the highest term in a field

2009-11-18 Thread Yonik Seeley
On Wed, Nov 18, 2009 at 10:48 PM, Daniel Noll dan...@nuix.com wrote:
 But what if I want to find the highest?  TermEnum can't step backwards.

I've also wanted to do the same. It's coming with the new flexible
indexing patch:
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12764020#action_12764020

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sort fields shouldn't be tokenized

2009-11-16 Thread Yonik Seeley
On Mon, Nov 16, 2009 at 11:38 AM, Jeff Plater
jpla...@healthmarketscience.com wrote:
 Thanks - so if my sort field is a single term then I should be ok with
 using an analyzer (to lowercase it for example).

Correct - the key is that there is not more than one token per
document for the field being sorted on.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: share some numbers for range queries

2009-11-15 Thread Yonik Seeley
On Mon, Nov 16, 2009 at 1:02 AM, John Wang john.w...@gmail.com wrote:
   I did some performance analysis for different ways of doing numeric
 ranging with lucene. Thought I'd share:

FYI, the second approach is already implemented in both Lucene and Solr.
http://lucene.apache.org/java/2_9_1/api/core/org/apache/lucene/search/FieldCacheRangeFilter.html

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Equality Numeric Query

2009-11-11 Thread Yonik Seeley
On Wed, Nov 11, 2009 at 8:54 AM, Shai Erera ser...@gmail.com wrote:
 I index documents with numeric fields using the new Numeric package. I
 execute two types of queries: range queries (for example, [1 TO 20}) and
 equality queries (for example 24.75). Don't mind the syntax.

 Currently, to execute the equality query, I create a NumericRangeQuery with
 the lower/upper value being 24.75 and both limits are set to inclusive. Two
 questions:
 1) Is there a better approach? For example, if I had indexed the values as
 separate terms, I could create a TermQuery.

Create a term query on NumericUtils.floatToPrefixCoded(24.75f)

 2) Can I run into precision issues such that 24.751 will be matched as well?

Nope... every numeric indexed value has it's precision indexed along
with it as a prefix, so there will be no false matches.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene index write performance optimization

2009-11-10 Thread Yonik Seeley
On Tue, Nov 10, 2009 at 11:43 AM, Jamie Band ja...@stimulussoft.com wrote:
 As an aside note, is there any way for Lucene to support simultaneous writes
 to an index?

The indexing process is highly parallelized... just use multiple
threads to add documents to the same IndexWriter.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Proposal for changing Lucene's backwards-compatibility policy

2009-10-27 Thread Yonik Seeley
On Tue, Oct 27, 2009 at 9:07 PM, Luis Alves lafa...@gmail.com wrote:
 But there needs to be some forced push for these shorter major release
 cycles,
 to allow for code clean cycles to also be sorter.

Maybe... or maybe not.
There's also value in a more stable API over a longer period of time.
Different people will pick a different balance, and it's not as simple
as declaring that we need to be able to remove older APIs faster.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: help needed improving lucene concurret search performance

2009-10-23 Thread Yonik Seeley
How many processors do you have on this system?
If you are CPU bound, 100 threads is going to be 10 times slower (at a
minimum) than 10 threads (unless you have more than 10 CPUs).

-Yonik
http://www.lucidimagination.com

On Fri, Oct 23, 2009 at 2:18 AM, Wilson Wu songzi0...@gmail.com wrote:
 Dear Friend,
     I have encountered some performance problems recently in lucene
 search 2.9. I use a single IndexSearcher in the whole system, It seems
 perfect when there is less than 10 threads doing search concurrenty.
 Bu if there is more than 100 threads doing concurrent search,the
 average response time is becoming bigger(1s),and the max response
 time reaches 299s. I really don't know how to improve,can you help me?
   Thanks a lot !

                                                              Wilson
                                                              2009.10.23
    The profiling result about 400 concurret search is at:
 http://i3.6.cn/cvbnm/aa/f5/00/63521d982a469f5063b82268eee91d08.gif
 it seems a lot of time consumed by TermScorer.score.
    Follewing is my servlet class which is reponse to search request:
 public final class DispatchServlet extends
 javax.servlet.http.HttpServlet implements javax.servlet.Servlet
 {
      private static final long serialVersionUID = -5547647006004900451L;
      protected final Log log = LogFactory.getLog(getClass());
      protected Searcher searcher;
      protected Directory dir;
      protected RAMDirectory ram;

    public DispatchServlet() {
        super();
     }

    public void init() throws ServletException {
        super.init();
        try {

               dir = FSDirectory.open(new
 File(/usr/bestv/search_engin_index/index/program));
                ram = new RAMDirectory(dir);
                searcher = new IndexSearcher(ram,true);
                int h = searcher.search(tq,null,1).totalHits;
                System.out.println(the searcher has warmed and
 searched  + h +  docs );
            }
        } catch (IOException e) {
            log.error(e);
        }
  }

 protected void doPost(HttpServletRequest request, HttpServletResponse
 response) throws ServletException, IOException {
               response.setContentType(text/html);
               doExecute(request.getParameter(q),response);
       }

 protected void doGet(HttpServletRequest request, HttpServletResponse
 response) throws ServletException, IOException {
               response.setContentType(text/html);
               try{
                       String schCon =
 URLDecoder.decode(request.getParameter(q),UTF-8);
                       doExecute(schCon,response);
               }catch(Exception e){
                       response.getWriter().write(Parameter
 Error,please send param 'q');
               }
       }

    public void doExecute(String schCon,HttpServletResponse response)
 throws ServletException,IOException{
               response.getWriter().write(new 
 SearchCommand().search(searcher));
  }

 }

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Clarification on TokenStream.close() needed

2009-10-20 Thread Yonik Seeley
2009/10/20 Teruhiko Kurosaka k...@basistech.com:
 My Tokenizer started showing an error when I switched
 to Solr 1.4 dev version.  I am not too confident but
 it seems that Solr 1.4 calls close() on my Tokenizer
 before calling reset(Reader) in order to reuse
 the Tokenizer.  That is, close() is called more than
 once.

Is this when indexing a document, or querying a document.
close() should only be called once.

If indexing, it would be closed in Lucene at DocInverterPerField.java:197

-Yonik
http://www.lucidimagination.com



 The API doc of close() reads:
 Releases resources associated with this stream.

 So I thought close() shold be called only once,
 and the Tokenizer objects cannot be reused after
 close() is called.  Is my interpretation correct?

 If my interpretation is wrong and it is legal to
 call close() more than once, where is the best place
 to free per-instance resources?

 T. Kuro Kurosaka


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hits and TopDoc

2009-10-20 Thread Yonik Seeley
On Tue, Oct 20, 2009 at 5:03 PM, Nathan Howard natehowa...@gmail.com wrote:
 This is sort of related to the above question, but I'm trying to update some
 (now depricated) Java/Lucene code that I've become aware of once we started
 using 2.4.1 (we were previously using 2.3.2):

 Hits results = MultiSearcher.search(Query));

 int start = currentPage * resultsPerPage;
 int stop = (currentPage + 1) * resultsPerPage();

 for(int x = start; (x  searchResults.length())  (x  stop); x++)
 {
    Document doc = searchResults.doc(x);
    // do search post-processing with the Document
 }

 Results per page is normally small (10ish or so).

 I'm having difficulty figuring out how to get TopDocs to replicate this
 paging functionality (which the application must maintain).

You do it tthe same way basically... calculate the biggest doc you
need (stop-1 in your code), ask for that many TopDocs, and then
iterate over the page you want, calling
searcher.doc(topDocs.scoreDocs[x].doc)

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hits and TopDoc

2009-10-20 Thread Yonik Seeley
Hmm, yes, I should have thought of quoting the havadoc :-)
The Hits javadoc has been udpated though... we shouldn't be pushing
people toward collectors unless they really need them:

 *   TopDocs topDocs = searcher.search(query, numHits);
 *   ScoreDoc[] hits = topDocs.scoreDocs;
 *   for (int i = 0; i  hits.length; i++) {
 * int docId = hits[i].doc;
 * Document d = searcher.doc(docId);
 * // do something with current hit


-Yonik
http://www.lucidimagination.com



On Tue, Oct 20, 2009 at 5:27 PM, Steven A Rowe sar...@syr.edu wrote:
 Hi Nathan,

 On 10/20/2009 at 5:03 PM, Nathan Howard wrote:
 This is sort of related to the above question, but I'm trying to update
 some (now depricated) Java/Lucene code that I've become aware of once we
 started using 2.4.1 (we were previously using 2.3.2):

 Hits results = MultiSearcher.search(Query));

 int start = currentPage * resultsPerPage;
 int stop = (currentPage + 1) * resultsPerPage();

 for(int x = start; (x  searchResults.length())  (x  stop); x++)
 {
     Document doc = searchResults.doc(x);
     // do search post-processing with the Document
 }

 Results per page is normally small (10ish or so).

 I'm having difficulty figuring out how to get TopDocs to replicate this
 paging functionality (which the application must maintain).

 From 
 http://lucene.apache.org/java/2_4_1/api/core/org/apache/lucene/search/Hits.html:
 =
 Deprecated. Hits will be removed in Lucene 3.0.

 Instead e. g. TopDocCollector and TopDocs can be used:

   TopDocCollector collector = new TopDocCollector(hitsPerPage);
   searcher.search(query, collector);
   ScoreDoc[] hits = collector.topDocs().scoreDocs;
   for (int i = 0; i  hits.length; i++) {
     int docId = hits[i].doc;
     Document d = searcher.doc(docId);
     // do something with current hit
     ...
 =

 Construct the TopDocCollector with your stop variable instead of 
 hitsPerPage, initialize the loop control variable with the value of your 
 start variable instead of 0, and you should be good to go.

 Steve


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Proposal for changing Lucene's backwards-compatibility policy

2009-10-16 Thread Yonik Seeley
On Fri, Oct 16, 2009 at 4:54 AM, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 On Fri, Oct 16, 2009 at 10:23 AM, Danil ŢORIN torin...@gmail.com wrote:
 What about creating major version more often?

 +1 We're not going to run out of version numbers, so I don't see a
 reason not to upgrade the major version number when making
 backwards-incompatible changes.

+1
(Option A).

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: NPE in NearSpansUnordered

2009-10-15 Thread Yonik Seeley
Are you using any custom query types?  Anything to help us reproduce
(like the acutal query this happened on) would be greatly appreciated.

-Yonik
http://www.lucidimagination.com


On Thu, Oct 15, 2009 at 1:17 PM, Peter Keegan peterlkee...@gmail.com wrote:
 I'm using Lucene 2.9 and sometimes get a NPE in NearSpansUnordered:

  java.lang.NullPointerExceptionjava.lang.NullPointerException
 at
 org.apache.lucene.search.spans.NearSpansUnordered.start(NearSpansUnordered.java:219)
 at
 org.apache.lucene.search.payloads.PayloadNearQuery$PayloadNearSpanScorer.processPayloads(PayloadNearQuery.java:201)
 at
 org.apache.lucene.search.payloads.PayloadNearQuery$PayloadNearSpanScorer.getPayloads(PayloadNearQuery.java:180)
 at
 org.apache.lucene.search.payloads.PayloadNearQuery$PayloadNearSpanScorer.getPayloads(PayloadNearQuery.java:183)
 at
 org.apache.lucene.search.payloads.PayloadNearQuery$PayloadNearSpanScorer.setFreqCurrentDoc(PayloadNearQuery.java:214)
 at org.apache.lucene.search.spans.SpanScorer.nextDoc(SpanScorer.java:64)
 at org.apache.lucene.search.Scorer.score(Scorer.java:74)
 at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:247)
 at org.apache.lucene.search.Searcher.search(Searcher.java:152)

 The CellQueue pq is empty when this occurs. Are there any conditions in
 which the queue might be expected to be empty?

 Peter


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
Guys, please - you're not new at this... this is what JavaDoc is for:

  /**
   * Returns a readonly reader containing all
   * current updates.  Flush is called automatically.  This
   * provides near real-time searching, in that changes
   * made during an IndexWriter session can be made
   * available for searching without closing the writer.
   *
   * pIt's near real-time because there is no hard
   * guarantee on how quickly you can get a new reader after
   * making changes with IndexWriter.  You'll have to
   * experiment in your situation to determine if it's
   * fast enough.  As this is a new and experimental
   * feature, please report back on your findings so we can
   * learn, improve and iterate./p
   *
   * pThe resulting reader supports {...@link
   * IndexReader#reopen}, but that call will simply forward
   * back to this method (though this may change in the
   * future)./p
   *
   * pThe very first time this method is called, this
   * writer instance will make every effort to pool the
   * readers that it opens for doing merges, applying
   * deletes, etc.  This means additional resources (RAM,
   * file descriptors, CPU time) will be consumed./p
   *
   * pFor lower latency on reopening a reader, you should
   * call {...@link #setMergedSegmentWarmer} to
   * pre-warm a newly merged segment before it's committed
   * to the index.  This is important for minimizing
   * index-to-search delay after a large merge.  /p
   *
   * pIf an addIndexes* call is running in another thread,
   * then this reader will only search those segments from
   * the foreign index that have been successfully copied
   * over, so far/p.
   *
   * pbNOTE/b: Once the writer is closed, any
   * outstanding readers may continue to be used.  However,
   * if you attempt to reopen any of those readers, you'll
   * hit an {...@link AlreadyClosedException}./p
   *
   * pbNOTE:/b This API is experimental and might
   * change in incompatible ways in the next release./p
   *
   * @return IndexReader that covers entire index plus all
   * changes made so far by this IndexWriter instance
   *
   * @throws IOException
   */
  public IndexReader getReader() throws IOException {


-Yonik
http://www.lucidimagination.com


On Mon, Oct 12, 2009 at 4:18 PM, John Wang john.w...@gmail.com wrote:
 Oh, that is really good to know!
 Is this deterministic? e.g. as long as writer.addDocument() is called, next
 getReader reflects the change? Does it work with deletes? e.g.
 writer.deleteDocuments()?
 Thanks Mike for clarifying!

 -John

 On Mon, Oct 12, 2009 at 12:11 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 Just to clarify: IndexWriter.newReader returns a reader that searches
 uncommitted changes as well.  Ie, you need not call IndexWriter.commit
 to make the changes visible.

 However, if you're opening a reader the normal way
 (IndexReader.open) then it is necessary to first call
 IndexWriter.commit.

 Mike

 On Mon, Oct 12, 2009 at 5:24 AM, melix cedric.champ...@lingway.com
 wrote:
 
  Hi,
 
  I'm going to replace an old reader/writer synchronization mechanism we
 had
  implemented with the new near realtime search facilities in Lucene 2.9.
  However, it's still a bit unclear on how to efficiently do it.
 
  Is the following implementation the good way to do achieve it ? The
 context
  is concurrent read/writes on an index :
 
  1. create a Directory instance
  2. create a writer on this directory
  3. on each write request, add document to the writer
  4. on each read request,
   a. use writer.getReader() to obtain an up-to-date reader
   b. create an IndexSearcher with that reader
   c. perform Query
   d. close IndexSearcher
  5. on application close
   a. close writer
   b. close directory
 
  While this seems to be ok, I'm really wondering about the performance of
  opening a searcher for each request. I could introduce some kind of delay
  and cache a searcher for some seconds, but I'm not sure it's the best
 thing
  to do.
 
  Thanks,
 
  Cedric
 
 
  --
  View this message in context:
 http://www.nabble.com/Realtime-search-best-practices-tp25852756p25852756.html
  Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
On Mon, Oct 12, 2009 at 4:35 PM, Jake Mannix jake.man...@gmail.com wrote:
  It may be surprising, but in fact I have read that
 javadoc.

It was not your email I responded to.

  It talks about not needing to close the
 writer, but doesn't specifically talk about the what
 the relationship between commit() calls and
 getReader() calls is.

Do you have a suggestion of how to update the JavaDoc?
I'm not sure I understand the relationship between commit and
getReader that you refer to.

 , but why
 is it so obvious that what could be happening
 is that it only returns all changes since the last
 commit, but without touching disk because it
 has docs in memory as well?

Sorry, this seems confusing - I'm not sure what you're trying to say.
Perhaps we should approach this as proposed javadoc changes?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Realtime search best practices

2009-10-12 Thread Yonik Seeley
Good point on isCurrent - I think it should only be with respect to
the latest index commit point? and we should clarify that in the
javadoc.

[...]
 // but what does the nrtReader say?
 // it does not have access to the most recent commit
 // state, as there's been a commit (with documents)
 // since it was opened.  But the nrtReader *has* those
 // documents.

I think we keep it simple - the nrtReader.isCurrent() would return
false after a commit is called.
Yes, isCurrent() is no longer such a great name.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Yonik Seeley
On Wed, Sep 16, 2009 at 12:33 PM, Uwe Schindler u...@thetaphi.de wrote:
 How should we proceed? Stop the final artifact build and voting or proceed
 with the release of 2.9? We waited so long and for most people it is faster
 than slower!

I think we know that 2.9 will not be faster for everyone:
  - Per-segment searching and the new comparatores are a general win,
but will be slower for some people.
  - Query parsing and small document indexing will be somewhat slower
due to the new token APIs (the workarounds for back compatibility) if
token streams aren't reused.

I don't see any indication of any bugs in Lucene in this thread either.

-Yonik

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Yonik Seeley
It's been a while since I wrote that benchmarker... is it OK that the
answer is different?  Did you use the same test file?

-Yonik
http://www.lucidimagination.com



On Tue, Sep 15, 2009 at 2:18 PM, Mark Miller markrmil...@gmail.com wrote:
 The results:

 config: impl=SeparateFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295611, ms=173550, MB/sec=1683.7899579371938

 config: impl=ChannelFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=1377768, MB/sec=212.09793463050383

 config: impl=ChannelPread serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=632253, MB/sec=462.19115955163517

 config: impl=PooledPread serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=774664, MB/sec=377.2238637654518

 ClassicFile was heading for the same fate as ChannelFile.


 I'll have to check what its like on the file system - but it appears
 just ridiculously slower. Even with SeparateFile, All 4 cores are bouncing
 from 0-12% independently and really favoring the low end of that.
 ChannelPread appears no better.

 There are results from other OS's/setups in the JIRA issue.

 I'm using ext4.

 Uwe Schindler wrote:
 How does a conventional file system compare?

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: Tuesday, September 15, 2009 7:15 PM
 To: java-user@lucene.apache.org
 Subject: Re: lucene 2.9.0RC4 slower than 2.4.1?

 Mark Miller wrote:

 Indeed - I just ran the FileReaderTest on a Linux tmpfs ramdisk - with
 SeparateFile all 4 of my cores are immediately pinned and remain so.
 With ChannelFile, all 4 cores hover 20-30%.

 It would appear it may not be a good idea to use NIOFSDirectory on

 ramdisks.

 Even still though - it looks like you have a further issue - your Lucene
 2.9 old-api results don't use it, and are still not good.



 The quick results:

 ramdisk: sudo mount -t tmpfs tmpfs /tmp/space -o
 size=1G,nr_inodes=200k,mode=01777

 config: impl=SeparateFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295611, ms=173550, MB/sec=1683.7899579371938

 config: impl=ChannelFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=1377768, MB/sec=212.09793463050383


 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Yonik Seeley
Remember to disable CPU frequency scaling when benchmarking... some
things with IO cause the freq to drop, and when it's CPU bound again
it takes a while for Linux to scale up the freq again.

For example, on my ubuntu box, ChannelFile went from 100MB/sec to
388MB/sec.  This effect probably won't be uniform across
implementations either.

-Yonik
http://www.lucidimagination.com

On Tue, Sep 15, 2009 at 3:25 PM, Mark Miller markrmil...@gmail.com wrote:
 I just really I hadn't sent this one. Here are results from the harddrive:

 It looks like its closer to the same speed on the hardrive once
 everything is loaded in the system cache (as you'd expect). SeparateFile
 was 1200 vs almost 1700 on RAMDISK. ChannelPread looked a lot closer though.


 - Mark

 Tests from harddisk (filesystem cache warmed):

 config: impl=SeparateFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282293977, ms=238230, MB/sec=1226.6370616630988

 config: impl=ChannelPread serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=766340, MB/sec=381.3212767179059


 Mark Miller wrote:
 Michael McCandless wrote:

 I don't like that the answer is different... but it's really really
 odd that it's different-yet-almost-the-same.

 Mark, were these 4 results on a normal (ext4) filesystem, or tmpfs?
 (Because the top 2 entries of your 4 results match the first set of 2
 entries you sent... so I'm thinking these 4 were actually tmpfs not
 ext4).


 Those 4 were tmpfs - I mention ext4 at the end because I had just given
 a feel for the hardrive tests and wanted to note it was from ext4 - the
 results are def ramdisk though.

 What JRE/OS, linux, kernel versions, and hardware, are you running on?


 These are on:
 Ubuntu Karmic Koala 9.10, currently updated
 java-1.5.0-sun-1.5.0.20
 2.6.31-10-generic

 RAM: 3.9 Gig
 4 core Intel Core2 duo 2.0GHz

 Slow 5200 rpm laptop drives.


 The gains of SeparateFile over all else are stunning.  And, quite
 different from the linux tests I had run under LUCENE-753.  Maybe we
 need to revert FSDir.open to return SimpleFSDir again, on non-Windows
 hosts.  But then we don't have good concurrency...

 Mike

 On Tue, Sep 15, 2009 at 2:59 PM, Yonik Seeley
 yonik.see...@lucidimagination.com wrote:


 It's been a while since I wrote that benchmarker... is it OK that the
 answer is different?  Did you use the same test file?

 -Yonik
 http://www.lucidimagination.com



 On Tue, Sep 15, 2009 at 2:18 PM, Mark Miller markrmil...@gmail.com wrote:


 The results:

 config: impl=SeparateFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295611, ms=173550, MB/sec=1683.7899579371938

 config: impl=ChannelFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=1377768, MB/sec=212.09793463050383

 config: impl=ChannelPread serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=632253, MB/sec=462.19115955163517

 config: impl=PooledPread serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=774664, MB/sec=377.2238637654518

 ClassicFile was heading for the same fate as ChannelFile.


 I'll have to check what its like on the file system - but it appears
 just ridiculously slower. Even with SeparateFile, All 4 cores are bouncing
 from 0-12% independently and really favoring the low end of that.
 ChannelPread appears no better.

 There are results from other OS's/setups in the JIRA issue.

 I'm using ext4.

 Uwe Schindler wrote:


 How does a conventional file system compare?

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de




 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: Tuesday, September 15, 2009 7:15 PM
 To: java-user@lucene.apache.org
 Subject: Re: lucene 2.9.0RC4 slower than 2.4.1?

 Mark Miller wrote:



 Indeed - I just ran the FileReaderTest on a Linux tmpfs ramdisk - with
 SeparateFile all 4 of my cores are immediately pinned and remain so.
 With ChannelFile, all 4 cores hover 20-30%.

 It would appear it may not be a good idea to use NIOFSDirectory on



 ramdisks.



 Even still though - it looks like you have a further issue - your 
 Lucene
 2.9 old-api results don't use it, and are still not good.





 The quick results:

 ramdisk: sudo mount -t tmpfs tmpfs /tmp/space -o
 size=1G,nr_inodes=200k,mode=01777

 config: impl=SeparateFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295611, ms=173550, MB/sec=1683.7899579371938

 config: impl=ChannelFile serial=false nThreads=4 iterations=100
 bufsize=1024 poolsize=2 filelen=730554368
 answer=-282295361, ms=1377768, MB/sec=212.09793463050383


 --
 - Mark

 http://www.lucidimagination.com

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Yonik Seeley
Here's my results in my quad core phenom, with ondemand CPU freq
scaling disabled (clocks locked at 3GHz)

Ubuntu 9.04, filesystem=ext4 on 7200RPM IDE drive, testfile=95MB fully cached.

Linux odin 2.6.28-15-generic #49-Ubuntu SMP Tue Aug 18 19:25:34 UTC
2009 x86_64 GNU/Linux
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)


config: impl=ClassicFile serial=false nThreads=4 iterations=20
bufsize=1024 poolsize=2 filelen=95610240
answer=1165427971, ms=15610, MB/sec=489.99482383087764

config: impl=SeparateFile serial=false nThreads=4 iterations=20
bufsize=1024 poolsize=2 filelen=95610240
answer=1165427672, ms=4115, MB/sec=1858.7652976913728

config: impl=PooledPread serial=false nThreads=4 iterations=20
bufsize=1024 poolsize=2 filelen=95610240
answer=1165427971, ms=6352, MB/sec=1204.15919395466

config: impl=ChannelFile serial=false nThreads=4 iterations=20
bufsize=1024 poolsize=2 filelen=95610240
answer=1165427971, ms=20347, MB/sec=375.91876935174713

config: impl=ChannelPread serial=false nThreads=4 iterations=20
bufsize=1024 poolsize=2 filelen=95610240
answer=1165427971, ms=5189, MB/sec=1474.0449412218154

config: impl=ChannelTransfer serial=false nThreads=4 iterations=20
bufsize=1024 poolsize=2 filelen=95610240
answer=1165427971, ms=14794, MB/sec=517.021711504664

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Yonik Seeley
On Tue, Sep 15, 2009 at 4:12 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 Note that when nthreads1 I sometimes get wrong answers for SimpleFile...

s/SimpleFile/SingleFile/g

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Yonik Seeley
Note that when nthreads1 I sometimes get wrong answers for SimpleFile...
hopefully it's just a bug in the test... I'll look into it a little.

-Yonik
http://www.lucidimagination.com



On Tue, Sep 15, 2009 at 4:00 PM, Mark Miller markrmil...@gmail.com wrote:
 I'm jealous of your 4 3.0Ghz to my 2.0Ghz.

 I was on dynamic scaling frequency and switched to 2.0Ghz hard.

 On ramdisk, my puny 2.0's almost catch you and get a bit over 1800MB/s
 with SeparateFile.

 I'm smoked on PooledPread and ChannelPread though. Still sub 500 for
 both, even
 on the ramdisk.

 Its an absurd comparison though - everyone knows a jackalope is faster
 than a koala.

 - Mark

 Yonik Seeley wrote:
 Here's my results in my quad core phenom, with ondemand CPU freq
 scaling disabled (clocks locked at 3GHz)

 Ubuntu 9.04, filesystem=ext4 on 7200RPM IDE drive, testfile=95MB fully 
 cached.

 Linux odin 2.6.28-15-generic #49-Ubuntu SMP Tue Aug 18 19:25:34 UTC
 2009 x86_64 GNU/Linux
 Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
 Java HotSpot(TM) 64-Bit Server VM (build 14.0-b16, mixed mode)


 config: impl=ClassicFile serial=false nThreads=4 iterations=20
 bufsize=1024 poolsize=2 filelen=95610240
 answer=1165427971, ms=15610, MB/sec=489.99482383087764

 config: impl=SeparateFile serial=false nThreads=4 iterations=20
 bufsize=1024 poolsize=2 filelen=95610240
 answer=1165427672, ms=4115, MB/sec=1858.7652976913728

 config: impl=PooledPread serial=false nThreads=4 iterations=20
 bufsize=1024 poolsize=2 filelen=95610240
 answer=1165427971, ms=6352, MB/sec=1204.15919395466

 config: impl=ChannelFile serial=false nThreads=4 iterations=20
 bufsize=1024 poolsize=2 filelen=95610240
 answer=1165427971, ms=20347, MB/sec=375.91876935174713

 config: impl=ChannelPread serial=false nThreads=4 iterations=20
 bufsize=1024 poolsize=2 filelen=95610240
 answer=1165427971, ms=5189, MB/sec=1474.0449412218154

 config: impl=ChannelTransfer serial=false nThreads=4 iterations=20
 bufsize=1024 poolsize=2 filelen=95610240
 answer=1165427971, ms=14794, MB/sec=517.021711504664

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 --
 - Mark

 http://www.lucidimagination.com




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Yonik Seeley
OK, I see the issue - SingleFile doesn't have it's own filepointer.
I'll update the original issue.  (for large files, this shouldn't
change the times any).

-Yonik
http://www.lucidimagination.com

On Tue, Sep 15, 2009 at 4:13 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Tue, Sep 15, 2009 at 4:12 PM, Yonik Seeley
 yo...@lucidimagination.com wrote:
 Note that when nthreads1 I sometimes get wrong answers for SimpleFile...

 s/SimpleFile/SingleFile/g


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 2.9 RC2 now available for testing

2009-09-09 Thread Yonik Seeley
On Wed, Sep 9, 2009 at 8:57 AM, Peter Keeganpeterlkee...@gmail.com wrote:
 Using JProfiler, I observe that the improvement
 is due to a huge reduction in the number of calls to TermDocs.next and
 TermDocs.skipTo (about 65% fewer calls).

Indexes are searched per-segment now (i.e. MultiTermDocs isn't normally used).
Off the top of my head, I'm not sure how this can lead to fewer
TermDocs.skipTo() calls though.  Are you sure you weren't also
counting Scorer.skipTo()... which would now be Scorer.advance()?
Have you verified that your custom scorer is working correctly with
2.9 and that you're getting the same number of hits on the overall
query as you were with previous versions?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 2.9 RC2 now available for testing

2009-09-09 Thread Yonik Seeley
On Wed, Sep 9, 2009 at 9:17 AM, Yonik
Seeleyyonik.see...@lucidimagination.com wrote:
 On Wed, Sep 9, 2009 at 8:57 AM, Peter Keeganpeterlkee...@gmail.com wrote:
 Using JProfiler, I observe that the improvement
 is due to a huge reduction in the number of calls to TermDocs.next and
 TermDocs.skipTo (about 65% fewer calls).

 Indexes are searched per-segment now (i.e. MultiTermDocs isn't normally used).
 Off the top of my head, I'm not sure how this can lead to fewer
 TermDocs.skipTo() calls though.

Wait... perhaps it's just that  accounting for the skipTo() decrease?
Instead of MultiTermDocs.skipTo() delegating to
SegmentTermDocs.skipTo() (2 calls since they both inherit from
TermDocs), it's now just SegmentTermDocs.skipTo() directly.

-Yonik
http://www.lucidimagination.com


  Are you sure you weren't also
 counting Scorer.skipTo()... which would now be Scorer.advance()?
 Have you verified that your custom scorer is working correctly with
 2.9 and that you're getting the same number of hits on the overall
 query as you were with previous versions?

 -Yonik
 http://www.lucidimagination.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extending Sort/FieldCache

2009-09-08 Thread Yonik Seeley
On Sun, Sep 6, 2009 at 4:42 AM, Shai Ereraser...@gmail.com wrote:
 I've resisted using payloads for this purpose in Solr because it felt
 like an interim hack until CSF is implemented.

 I don't see it as a hack, but as a proper use of a great feature in Lucene.

It's proper use for an application perhaps, but not for core Lucene.
Applications are pretty much required to work with what's given in
Lucene... but Lucene developers can make better choices.  Hence if at
all possible, work should be put into implementing CSF rather than
sorting by payloads.

 CSF and this are essentially the same.

In which case we wouldn't need CSF?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Extending Sort/FieldCache

2009-09-05 Thread Yonik Seeley
On Fri, Sep 4, 2009 at 12:33 AM, Shai Ereraser...@gmail.com wrote:
 2) Contribute my payload-based sorting package. Currently it only reads from
 disk during searches, and I'd like to enhance it to use in-memory cache as
 well. It's a moderate-size package, so this one will need to wait until (1)
 is done, and I get enough time to adapt it to 2.9 and work on the issue.

I've resisted using payloads for this purpose in Solr because it felt
like an interim hack until CSF is implemented.  It feels like payloads
are properly used when one actually cares what the term or position
is.  Thoughts?  Do we think CSF will make it in 3.1?

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to check for field uniqueness when indexing?

2009-08-26 Thread Yonik Seeley
On Wed, Aug 26, 2009 at 12:47 PM, Daniel Shanesha...@lexum.umontreal.ca wrote:
 Humm... there is something I dont catch..

 When you open up an index writer, you batch up add and deletes. Now if you
 create a signature for the document, as long as you add it works, but what
 happens if you delete stuff from the index using a query as well as adding?

 Does Solr also remember the deletions as well?

It used to - but now it delegates all that to IndexWriter as well (and
lucene buffers them instead).

-Yonik
http://www.lucidimagination.com


 Daniel Shane

 Yonik Seeley wrote:

 On Fri, Aug 21, 2009 at 12:49 AM, Chris
 Hostetterhossman_luc...@fucit.org wrote:


 : But in that case, I assume Solr does a commit per document added.

 not at all ... it computes a signature and then uses that as a unique
 key.
 IndexWriter.updateDocument does all the hard work.


 Right - Solr used to do that hard work, but we handed that over to
 Lucene when that capability was added.  It involves batching either
 way (but letting Lucene handle it at a lower level is better since
 it can prevent inconsistencies from crashes).

 -Yonik
 http://www.lucidimagination.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there a way to check for field uniqueness when indexing?

2009-08-20 Thread Yonik Seeley
On Fri, Aug 21, 2009 at 12:49 AM, Chris
Hostetterhossman_luc...@fucit.org wrote:

 : But in that case, I assume Solr does a commit per document added.

 not at all ... it computes a signature and then uses that as a unique key.
 IndexWriter.updateDocument does all the hard work.

Right - Solr used to do that hard work, but we handed that over to
Lucene when that capability was added.  It involves batching either
way (but letting Lucene handle it at a lower level is better since
it can prevent inconsistencies from crashes).

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



trie* space-time tradeoff

2009-07-20 Thread Yonik Seeley
Anyone have any numbers?  I couldn't find complete info in the Trie*
JIRA issues, esp relating to size increase in the index.

There was this:
 The indexes each contain 13 numeric, tree encoded fields (doubles and Dates). 
 Index size (including the normal fields) was:

* 8bit: 4.8 GiB
* 4bit: 5.1 GiB
* 2bit: 5.7 GiB

But no info on baselines... for example, what's the index size with
1) those numeric fields not indexed at all
2) those numeric fields indexed with no precision step

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Yonik Seeley
Could this perhaps have anything to do with the changes to DocIdSetIterator?
Glancing at the default implementation of advance makes me wince a bit:

 public int advance(int target) throws IOException {
while (nextDoc()  target) {}
return doc;
  }

IMO, this is a back-compatibility anti-pattern.  It would be better to
throw an exception then quietly slow down some of the users queries by
an order of magnitude.  Actually, I don't think I would count it as
back compatible because of that.

-Yonik
http://www.lucidimagination.com



On Wed, Jul 15, 2009 at 2:54 PM, Michael
McCandlessluc...@mikemccandless.com wrote:
 On Wed, Jul 15, 2009 at 2:30 PM, eks deveks...@yahoo.co.uk wrote:

 Weird.  Have you run CheckIndex?
 nope, I guess it brings nothing: two times built index; Bug provoked by 
 changing one parameter  that controls only search caused it = no corrupt 
 index?

 You think we should give it a try? Hell, why not :)

 Yah it's quite a long shot but if it is corrupt, we'll be kicking
 ourselves about 30 emails from now...

 What do you mean by Can you do a binary search to locate the term(s) that's 
 causing it?

 I know exactly which term combination causes it, last Query.toString() I 
 have sent if I simplify Query by dropping one term with its expansions, 
 it runs fine... or if I replace any of these terms it works fine,We tried 
 with higer freq. terms, lower... everything fine... bizzar

 Right I meant try to whittle down the query that tickles the infinite
 loop.  Sounds like any whittling causes the issue to scurry away.

 If I make a patch that adds verbosity to what BS is doing, can you run
 it  post the output?

 Mike

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: speed of BooleanQueries on 2.9

2009-07-15 Thread Yonik Seeley
On Wed, Jul 15, 2009 at 4:37 PM, Uwe Schindleru...@thetaphi.de wrote:
 And the fix only affects custom DocIdSetIterators.

And custom Queries (via Scorer) since Scorer inherits from DISI.
But as Mike says, it shouldn't be the issue behind in this thread.

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



  1   2   3   4   5   6   >