Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
You cannot get a string out of automaton by its ordinal without
storing additional data.
The string is stored there not as a single arc, but as a sequence of
them (basically.. err.. as a string),
so referencing them is basically writing the string asis. Space
savings here come from sharing arcs between strings.

Though, it's possible to do if you associate an additional number with
each node. (I invented some way, shared it with Mike and forgot.. good
grief :/)

Perfect hashing, on the other hand, is like a MapString, Integer
that accepts a predefined set of N strings and returns an int in
0..N-1 interval.
And it can't do the reverse lookup, by design, that's a lossy
compression for all good perfect hashing algos.
So, it's irrelevant here, huh?

On Thu, May 19, 2011 at 08:53, David Smiley (@MITRE.org)
dsmi...@mitre.org wrote:
 I've been pondering how to reduce the size of FieldCache entries when there
 are a large number of Strings. I'd like to facet on such a field with Solr
 but with less memory.  As I understand it, FSTs are a highly compressed
 representation of a set of Strings (among other possibilities).  The
 fieldCache would need to point to an FST entry (an arc?) using something
 small, say an integer.  Is there a way to point to an FST entry with an
 integer, and then somehow with relative efficiency construct the String from
 the arcs to get there?

 ~ David Smiley

 -
  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/FST-and-FieldCache-tp2960030p2960030.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
 I think, if we add ord as an output to the FST, then it builds
 everything we need?  Ie no further data structures should be needed?
 Maybe I'm confused :)

 If you put the ord as an output the common part will be shifted towards the
 front of the tree. This will work if you want to look up a given value
 assigned to some string, but will not work if you need to look up the string
 from its value. The latter case can be solved if you know which branch to
 take while descending from root and the shared prefix alone won't give you
 this information. At least I don't see how it could.

 I am familiar with the basic prefix hashing procedure suggested by Daciuk
 (and other authors), but maybe some progress has been made there, I don't
 know... the one I know is really conceptually simple -- since each arc
 encodes the number of leaves (or input sequences) in the automaton, you know
 which path must lead you to your string. For example if you have a node like
 this and seek for the 12-th term:

 0 -- 10 -- ...
   +- 10 -- ...
   +- 5 -- ..
 you look at the first path, it'd give you terms 1..10, then the next one
 contains terms 11..20 so you add 10 to an internal counter which is added to
 further computations, descend and repeat the procedure until you find a leaf
 node.

 Dawid

There's a possible speedup here. If, instead of storing the count of
all downstream leaves, you store the sum of counts for all previous
siblings, you can do a binary lookup instead of linear scan on each
node.
Taking your example:

0 -- 0 -- ...
  +- 10 -- ... We know that for 12-th term we should descend along
this edge, as it has the biggest tag less than 12.
  +- 15 -- ...

That's what I invented, and yes, it was invented by countless people before :)

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
On Thu, May 19, 2011 at 16:45, Dawid Weiss dawid.we...@cs.put.poznan.pl wrote:

 That's what I invented, and yes, it was invented by countless people
 before :)
 You know I didn't mean to sound rude, right? I'm really admiring your
 ability to come up with these solutions by yourself, I'm merely copying
 other folks' ideas.
I tried to prevent another reference to mr. Daciuk :)

 Anyway, the optimization you're describing is sure possible. Lucene's FST
 implementation can actually combine both approaches because always expanding
 nodes is inefficient and those already expanded will allow a binary search
 (assuming the automaton structure is known to the implementation).
 Another refinement of this idea creates a detached table (err.. index :) of
 states to start from inside the automaton, so that you don't have to go
 through the initial 2-3 states which are more or less always large and even
 binary search is costly there.
 Dawid

But you have to lookup this err..index somehow. And that's either
binary or hash lookup. Where's the win?


-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
This is more about compressing strings in TermsIndex, I think.
And ability to use said TermsIndex directly in some cases that
required FieldCache before. (Maybe FC is still needed, but it can be
degraded to docId-ord map, storing actual strings in TI).
This yields fat space savings when we, eg,  need to both lookup on a
field and build facets out of it.

mmap is cool :)  What I want to see is a FST-based TermsDict that is
simply mmaped into memory, without building intermediate indexes, like
Lucene does now.
And docvalues are orthogonal to that, no?

On Thu, May 19, 2011 at 17:22, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 maybe thats because we have one huge monolithic implementation

 Doesn't the DocValues branch solve this?

 Also, instead of trying to implement clever ways of compressing
 strings in the field cache, which probably won't bare fruit, I'd
 prefer to look at [eventually] MMap'ing (using DV) the field caches to
 avoid the loading and heap costs, which are signifcant.  I'm not sure
 if we can easily MMap packed ints and the shared byte[], though it
 seems fairly doable?

 On Thu, May 19, 2011 at 6:05 AM, Robert Muir rcm...@gmail.com wrote:
 2011/5/19 Michael McCandless luc...@mikemccandless.com:

 Of course, for
 certain apps that perf hit is justified, so probably we should make
 this an option when populating field cache (ie, in-memory storage
 option of using an FST vs using packed ints/byte[]).


 or should we actually try to have different fieldcacheimpls?

 I see all these missions to refactor the thing, which always fail.

 maybe thats because we have one huge monolithic implementation.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: FST and FieldCache?

2011-05-19 Thread Earwin Burrfoot
On Thu, May 19, 2011 at 20:43, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, May 19, 2011 at 12:35 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 And I do agree there are times when mmap is appropriate, eg if query
 latency is unimportant to you, but it's not a panacea and it comes
 with serious downsides

 Do we have a benchmark of ByteBuffer vs. byte[]'s in RAM?

 I don't know of a straight up comparison...
I did compare MMapDir vs RAMDir variant a couple of years ago.
Searches slowed down a teeny-weeny little bit. GC times went down
noticeably. For me it was a big win.

Whatever Mike might say, mmap is great for latency-conscious applications : )

If someone tries to create artificial benchmark for byte[] VS
ByteBuffer, I'd recommend going through Lucene's abstraction layer.
If you simply read/write in a loop, JIT will optimize away boundary
checks for byte[] in some cases. This didn't ever happen to *Buffer
family for me.

 There's also RAM based SSDs whose performance could be comparable with
 well, RAM.

 True, though it's through layers of abstraction designed originally
 for serving files off of spinning magnets :)

 Also, with our heap based field caches, the first sorted
 search requires that they be loaded into RAM.  Then we don't unload
 them until the reader is closed?  With MMap the unloading would happen
 automatically?

 True, but really if the app knows it won't need that FC entry for a
 long time (ie, long enough to make it worth unloading/reloading) then
 it should really unload it.  MMap would still have to write all those
 pages to disk...

 DocValues actually makes this a lot cheaper because loading DocValues
 is much (like ~100 X from Simon's testing) faster than populating
 FieldCache since FieldCache must do all the uninverting.

 Mike

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Moving towards Lucene 4.0

2011-05-19 Thread Earwin Burrfoot
On Thu, May 19, 2011 at 21:44, Chris Hostetter hossman_luc...@fucit.org wrote:

 : I think we should focus on everything that's *infrastructure* in 4.0, so
 : that we can develop additional features in subsequent 4.x releases. If we
 : end up releasing 4.0 just to discover many things will need to wait to 5.0,
 : it'll be a big loss.

 the catch with that approach (i'm speaking generally here, not with any of
 these particular lucene examples in mind) is that it's hard to know that
 the infrastructure really makes sense until you've built a bunch of stuff
 on it -- i think Josh Bloch has a paper where he says that you shouldn't
 publish an API abstraction until you've built at least 3 *real*
 (ie: not just toy or example) implementations of that API.

 it would be really easy to say the infrastructure for X, Y, and Z is all
 in 4.0, features that leverage this infra will start coming in 4.1 and
 then discover on the way to 4.1 that we botched the APIs.

How do I express my profound love for these words, while remaining chaste? : )

 what does this mean concretely for the specific big ticket changes that
 we've got on trunk? ... i dunno, just my word of caution.

 :  we just started the discussion about Lucene 3.2 and releasing more
 :  often. Yet, I think we should also start planning for Lucene 4.0 soon.
 :  We have tons of stuff in trunk that people want to have and we can't
 :  just keep on talking about it - we need to push this out to our users.

 I agree, but i think the other approach we should take is to be more
 agressive about reviewing things that would be good candidates for
 backporting.

 If we feel like some feature has a well defined API on trunk, and it's got
 good tests, and people have been using it and filing bugs and helping to
 make it better then we should consider it a candidate for backporting --
 if the merge itself looks like it would be a huge pain in hte ass we don't
 *have* to backport, but we should at least look.

 That may not help for any of the big ticket infra changes discussed in
 this thread (where we know it really needs to wait for a major release)
 but it would definitely help with the get features out to users faster
 issue.



 -Hoss

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fuzzy search always returning docs sorted by the highest match

2011-05-18 Thread Earwin Burrfoot
You aren't likely to encounter strings like abc company inc in
Lucene index, as it will be tokenized into three tokens abc,
company, inc under most Analyzers.
So, for this exact example you don't even need fuzzy matching.

Also, maybe you should try 'user' mailing list for questions regarding
the use of Lucene.

On Wed, May 18, 2011 at 00:54, Guilherme Aiolfi grad...@gmail.com wrote:
 I'm re-sending my first message because I've just received the mailing-list
 confirmation. If it's a duplicated, forget about this one.

 Hi,
 I want to do a fuzzy search and always return documents no matter what the
 score. So, to do this, I'm tried sorting by strdist() in solr 3.1. It worked
 great and does ALMOST exactly what I wanted. The problem is that the
 algorithms supported  jw, ngram and edit are not the best fit for my
 scenario.
 The best results come from StrikeAMatch
 (http://www.devarticles.com/c/a/Development-Cycles/How-to-Strike-a-Match/).
 So, I've found this
 link https://issues.apache.org/jira/browse/LUCENE-2230 that implemented what
 I wanted. But I was told that I should use trunk because there were some
 really great news about fuzzy search there.
 I read this article explaining some
 changes http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html.
 But I still don't think it replaces the StrikeAMatch algo, because that one
 can have best results in searches like abc comparing to strings like abc
 company inc (distance  2).
 But still, Fuad Efendi told me that StrikeAMatch is toys for kids compare to
 the state of lucene trunk. So here I'm, I want to know how 4.0 will help
 achieve what I want.
 Thanks.






-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Lucene/Solr JIRA

2011-05-18 Thread Earwin Burrfoot
+1 to Chris.

Even if the code is partially shared and project is the same, the end
products are completely different.
Merging lists/jira will force niche developers/users to manually sift
through heaps of irrelevant emails/issues.

On Thu, May 19, 2011 at 00:53, Chris Hostetter hossman_luc...@fucit.org wrote:

 : just a few words. I disagree here with you hoss IMO the suggestion to
 : merge JIRA would help to move us closer together and help close the
 : gap between Solr and Lucene. I think we need to start identifying us
 : with what we work on. It feels like we don't do that today and we
 : should work hard to stop that and make hard breaks that might hurt but

 I just don't see how you think that would help anything ... we still need
 to distinguish Jira issues to identify where in the stack they affect.

 If there is a divide among the developers because of the niches where
 they tend to work, will that divide magicly go away because we partition
 all issues using the component feature of instead of by the Jira
 project feature?

 I don't really see how that makes any sense.

 Even if we all thought it did, and even if the cost/effort of
 migrating/converting were totally free, the user bases (who interact with
 the Solr APIs vs directly using the Lucene-Core/Module APIs) are so
 distinct that I genuinely think sticking with distinct Jira Projects
 makes more sense for our users.

 : JIRA. I'd go even further and nuke the name entirely and call
 : everything lucene - I know not many folks like the idea and it might
 : take a while to bake in but I think for us (PMC / Committers) and the

 Everything already is called Lucene ... the Project is Apache Lucene
 the community is Lucene ... the Lucene project currently releases
 several products, and one of them is called Apache Solr ... if you're
 suggestion that we should ultimately elimianate the name Solr then we'd
 still have to decide what we're going going to call that end product, the
 artifact that we ship that provides the abstraction layer that Solr
 currently provides.

 Even if you mean to suggest that we should only have one unified product
 -- one singular release artifact -- that abstraction layer still needs a
 name.  The name we have now is Solr, it has brand awareness and a user
 base who understands what it means to say they are Installing Solr or
 that a new feature is available when Using Solr

 Eliminating that name doesn't seem like it would benefit the user
 community in anyway.



 -Hoss

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Fuzzy search always returning docs sorted by the highest match

2011-05-18 Thread Earwin Burrfoot
I'm baffled. As probably are you.

If all you want is a fuzzy match against a list of strings, Lucene is
a huge fat overkill, and you need to look elsewhere.

2011/5/19 Guilherme Aiolfi grad...@gmail.com:
 Well, it was about the implementation of a algorithm that was purposed by a
 user and was implemented in another way. And this, and not the user mailing
 list was recommended by this developer to ask this question.
 So, not entirely my fault. But I apologize for the inconvenience.
 I just want to clarify that searching for the tokens separably is not what I
 want since those words can exist but not all in the same doc. I want to
 compare the whole phrase. For that to work I not using any Analyzer.
 As I said, I've got it working, but I don't know how to use the right
 algorithm for the job.
 I'm going to redirect my question to the other mailing list.
 Thanks anyway.

 On Wed, May 18, 2011 at 6:32 PM, Earwin Burrfoot ear...@gmail.com wrote:

 You aren't likely to encounter strings like abc company inc in
 Lucene index, as it will be tokenized into three tokens abc,
 company, inc under most Analyzers.
 So, for this exact example you don't even need fuzzy matching.

 Also, maybe you should try 'user' mailing list for questions regarding
 the use of Lucene.

 On Wed, May 18, 2011 at 00:54, Guilherme Aiolfi grad...@gmail.com wrote:
  I'm re-sending my first message because I've just received the
  mailing-list
  confirmation. If it's a duplicated, forget about this one.
 
  Hi,
  I want to do a fuzzy search and always return documents no matter what
  the
  score. So, to do this, I'm tried sorting by strdist() in solr 3.1. It
  worked
  great and does ALMOST exactly what I wanted. The problem is that the
  algorithms supported  jw, ngram and edit are not the best fit for my
  scenario.
  The best results come from StrikeAMatch
 
  (http://www.devarticles.com/c/a/Development-Cycles/How-to-Strike-a-Match/).
  So, I've found this
  link https://issues.apache.org/jira/browse/LUCENE-2230 that implemented
  what
  I wanted. But I was told that I should use trunk because there were some
  really great news about fuzzy search there.
  I read this article explaining some
 
  changes http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html.
  But I still don't think it replaces the StrikeAMatch algo, because that
  one
  can have best results in searches like abc comparing to strings like
  abc
  company inc (distance  2).
  But still, Fuad Efendi told me that StrikeAMatch is toys for kids
  compare to
  the state of lucene trunk. So here I'm, I want to know how 4.0 will help
  achieve what I want.
  Thanks.
 
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-17 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034639#comment-13034639
 ] 

Earwin Burrfoot commented on LUCENE-3105:
-

StringInterner is in fact faster than CHM. And is compatible with 
String.intern(), ie - it returns the same String instances. It also won't eat 
up memory if spammed with numerous unique strings (which is a strange feature, 
but people requested that).
In Lucene 4.0 all of this is moot anyway, fields there are strongly separated 
and intern() is not used.

 String.intern() calls slow down IndexWriter.close() and IndexReader.open() 
 for index with large number of unique field names
 

 Key: LUCENE-3105
 URL: https://issues.apache.org/jira/browse/LUCENE-3105
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.1
Reporter: Mark Kristensson
 Attachments: LUCENE-3105.patch


 We have one index with several hundred thousand unqiue field names (we're 
 optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
 design...) and found that opening an index writer and closing an index reader 
 results in horribly slow performance on that one index. I have isolated the 
 problem down to the calls to String.intern() that are used to allow for quick 
 string comparisons of field names throughout Lucene. These String.intern() 
 calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
 StringHelper.java has its own hashmap implementation that it uses in 
 conjunction with String.intern(). Rather than using a one-off hashmap, I've 
 elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-17 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034640#comment-13034640
 ] 

Earwin Burrfoot commented on LUCENE-3105:
-

Hmm.. Ok, it *is* still used, but that's gonna be fixed, mm?

 String.intern() calls slow down IndexWriter.close() and IndexReader.open() 
 for index with large number of unique field names
 

 Key: LUCENE-3105
 URL: https://issues.apache.org/jira/browse/LUCENE-3105
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/index
Affects Versions: 3.1
Reporter: Mark Kristensson
 Attachments: LUCENE-3105.patch


 We have one index with several hundred thousand unqiue field names (we're 
 optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
 design...) and found that opening an index writer and closing an index reader 
 results in horribly slow performance on that one index. I have isolated the 
 problem down to the calls to String.intern() that are used to allow for quick 
 string comparisons of field names throughout Lucene. These String.intern() 
 calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
 StringHelper.java has its own hashmap implementation that it uses in 
 conjunction with String.intern(). Rather than using a one-off hashmap, I've 
 elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-13 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032936#comment-13032936
 ] 

Earwin Burrfoot commented on LUCENE-3092:
-

Chris, I don't like the idea of expanding IOContext again and again, but this 
case seems in line with intended purporse - give Directory implementation hints 
as to what we're going to do with it.

I don't like events either. They look fragile and binding them to threads is a 
WTF. With all our pausing/unpausing magic there's no guarantee merge will end 
on the same thread it started on.

bq. Stuff like FlushPolicy could take information about concurrent merges and 
hold of flushes for a little while if memory allows it etc.
Coordinating access to shared resource (IO subsystem) with events is very 
awkward. Ok, your FlushPolicy receives events from MergePolicy and holds 
flushes during merge. _Now, when a flush is in progress, should FlushPolicy 
notify MergePolicy so it can hold its merges?_
It goes downhill from there. What if FP and MP fire events simultaneously? :) 
What should other listeners do?

Try looking at a bigger picture. Merges are not your problem. Neither are 
flushes. Your problem is that several threads try to take their dump on disk 
simultaneously (for whatever reason, you don't really care). So what we need is 
an arbitration mechanism for Directory writes. A mechanism located presumably @ 
Directory level (eg, we don't need to throttle anything when writing to RAMDir).

One possible implementation is that we add a constructor parameter to 
FSDirectory specifying desired level of IO parallelism, and then it keeps track 
of its IndexOutputs and stalls writes selectively. We can also add 
'expectedWriteSize' to IOContext, so the Directory may favor shorter writes 
over bigger ones. Instead of 'expectedWriteSize' we can use 'priority'.

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-13 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032989#comment-13032989
 ] 

Earwin Burrfoot commented on LUCENE-3092:
-

bq. but I couldn't disagree more that this is an issue with an Event model
There are no issues with event model itself. It's just that this model is badly 
suitable for this issue's usecase.
Event listeners are good. Using them to emulate what is essentially a mutex - 
is ugly and fragile as hell.

bq. We have a series of components in Lucene; Directories, IndexWriter, 
MergeScheduler etc, and we have some crosscutting concerns such as merges 
themselves.
My point is that for many concerns they shouldn't necessarily be crosscutting.
Eg - Directory can support IO priorities/throttling, so it doesn't have to know 
about merges or flushes.
Many OSes have have special APIs that allow IO prioritization, do they know 
about merges, or Lucene at all? No.

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-13 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032997#comment-13032997
 ] 

Earwin Burrfoot commented on LUCENE-3092:
-

bq. The IOCtx should reference the OneMerge (if in fact this file is being 
opened because of a merge)?
IOCtx should have a value 'expectedSize', or 'priority', or something similar.
This does not introduce a transitive dependency of Directory from MergePolicy 
(to please you once more - a true WTF), and this allows to apply the same logic 
to flushes. Eg - all small flushes/merges go to cache, all big flushes/merges 
go straight to disk.

 NRTCachingDirectory, to buffer small segments in a RAMDir
 -

 Key: LUCENE-3092
 URL: https://issues.apache.org/jira/browse/LUCENE-3092
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch


 I created this simply Directory impl, whose goal is reduce IO
 contention in a frequent reopen NRT use case.
 The idea is, when reopening quickly, but not indexing that much
 content, you wind up with many small files created with time, that can
 possibly stress the IO system eg if merges, searching are also
 fighting for IO.
 So, NRTCachingDirectory puts these newly created files into a RAMDir,
 and only when they are merged into a too-large segment, does it then
 write-through to the real (delegate) directory.
 This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos

2011-05-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032046#comment-13032046
 ] 

Earwin Burrfoot commented on LUCENE-3084:
-

* Speaking logically, merges operate on Sets of SIs, not List?
* Let's stop subclassing random things? : ) SIS can contain a List of SIs (and 
maybe a Set, or whatever we need in the future), and only expose operations its 
clients really need.

 MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
 --

 Key: LUCENE-3084
 URL: https://issues.apache.org/jira/browse/LUCENE-3084
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084.patch


 SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
 purposes these fields are unused.
 We should cutover to ListSI instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos

2011-05-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13032099#comment-13032099
 ] 

Earwin Burrfoot commented on LUCENE-3084:
-

bq. Merges are ordered
Hmm.. Why should they be?

bq. SegmentInfos itself must be list
It may contain list as a field instead. And have a much cleaner API as a 
consequence.

On another note, I wonder, is the fact that Vector is internally synchronized 
used somewhere within SegmentInfos client code?

 MergePolicy.OneMerge.segments should be ListSegmentInfo not SegmentInfos
 --

 Key: LUCENE-3084
 URL: https://issues.apache.org/jira/browse/LUCENE-3084
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3084-trunk-only.patch, 
 LUCENE-3084-trunk-only.patch, LUCENE-3084.patch


 SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
 purposes these fields are unused.
 We should cutover to ListSI instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3077) DWPT doesn't see changes to DW#infoStream

2011-05-06 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029881#comment-13029881
 ] 

Earwin Burrfoot commented on LUCENE-3077:
-

We should just make it final everywhere ...

 DWPT doesn't see changes to DW#infoStream
 -

 Key: LUCENE-3077
 URL: https://issues.apache.org/jira/browse/LUCENE-3077
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 4.0
Reporter: Simon Willnauer
Priority: Minor
 Fix For: 4.0


 DW does not push infostream changes to DWPT since DWPT#infoStream is final 
 and initialized on DWPTPool initialization (at least for initial DWPT) we 
 should push changes to infostream to DWPT too

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: I was accepted in GSoC!!!

2011-05-05 Thread Earwin Burrfoot
By the way, guys. LuSolr SVN repository is mirrored @
git://git.apache.org/lucene-solr.git , which is in turn mirrored @
https://github.com/apache/lucene-solr .
Working with git (maybe with stgit) is easier than juggling patches by hand.

On Wed, May 4, 2011 at 15:00, David Nemeskey nemeskey.da...@sztaki.hu wrote:
 Hi Uwe,

 do you mean one issue per GSoC proposal, or one for every logical unit in
 the project?

 If the second: Robert told me to use the flexscoring branch as a base for my
 project, since preliminary work has already been done in that branch. Should I
 open JIRA issues nevertheless?

 Thanks,
 David

 On 2011 May 04, Wednesday 09:56:02 Uwe Schindler wrote:
 Hi Vinicius,

 Submitting patches via JIRA is fine! We were just thinking about possibly
 providing some SVN to work with (as additional training), but came to the
 conclusion, that all students should go the standard Apache Lucene way of
 submitting patches to JIRA issues. You can of course still use SVN / GIT
 locally to organize your code. At the end we just need a patch to be
 committed by one of the core committers.

Uwe

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2904) non-contiguous LogMergePolicy should be careful to not select merges already running

2011-05-05 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029403#comment-13029403
 ] 

Earwin Burrfoot commented on LUCENE-2904:
-

I think we should simply change the API for MergePolicy.
Instead of SegmentInfos it should accept a SetSegmentInfo with SIs eligible 
for merging (eg, completely written  not elected for another merge).
IW.getMergingSegments() is a damn cheat, and Expert notice is not an excuse! 
:)
Why should each and every MP do the set substraction when IW can do it for them 
once and for all? 

 non-contiguous LogMergePolicy should be careful to not select merges already 
 running
 

 Key: LUCENE-2904
 URL: https://issues.apache.org/jira/browse/LUCENE-2904
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2904.patch


 Now that LogMP can do non-contiguous merges, the fact that it disregards 
 which segments are already being merged is more problematic since it could 
 result in it returning conflicting merges and thus failing to run multiple 
 merges concurrently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2904) non-contiguous LogMergePolicy should be careful to not select merges already running

2011-05-05 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029408#comment-13029408
 ] 

Earwin Burrfoot commented on LUCENE-2904:
-

Ok, I'm wrong. We need both a list of all SIs and eligible SIs for 
calculations. But that should be handled through API change, not a new public 
method on IW.

 non-contiguous LogMergePolicy should be careful to not select merges already 
 running
 

 Key: LUCENE-2904
 URL: https://issues.apache.org/jira/browse/LUCENE-2904
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2904.patch


 Now that LogMP can do non-contiguous merges, the fact that it disregards 
 which segments are already being merged is more problematic since it could 
 result in it returning conflicting merges and thus failing to run multiple 
 merges concurrently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3065) NumericField should be stored in binary format in index (matching Solr's format)

2011-05-05 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13029421#comment-13029421
 ] 

Earwin Burrfoot commented on LUCENE-3065:
-

It's sad NumericFields are hardbaked into index format.

Eg - I have some fields that are similar to Numeric in that they are 
'stringified' binary structures, and they can't become first-class in the same 
manner as Numeric.

 NumericField should be stored in binary format in index (matching Solr's 
 format)
 

 Key: LUCENE-3065
 URL: https://issues.apache.org/jira/browse/LUCENE-3065
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, 
 LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch


 (Spinoff of LUCENE-3001)
 Today when writing stored fields we don't record that the field was a 
 NumericField, and so at IndexReader time you get back an ordinary Field and 
 your number has turned into a string.  See 
 https://issues.apache.org/jira/browse/LUCENE-1701?focusedCommentId=12721972page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12721972
 We have spare bits already in stored fields, so, we should use one to record 
 that the field is numeric, and then encode the numeric field in Solr's 
 more-compact binary format.
 A nice side-effect is we fix the long standing issue that you don't get a 
 NumericField back when loading your document.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking

2011-05-02 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027612#comment-13027612
 ] 

Earwin Burrfoot commented on LUCENE-3041:
-

The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?

 Support Query Visting / Walking
 ---

 Key: LUCENE-3041
 URL: https://issues.apache.org/jira/browse/LUCENE-3041
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Chris Male
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
 LUCENE-3041.patch, LUCENE-3041.patch


 Out of the discussion in LUCENE-2868, it could be useful to add a generic 
 Query Visitor / Walker that could be used for more advanced rewriting, 
 optimizations or anything that requires state to be stored as each Query is 
 visited.
 We could keep the interface very simple:
 {code}
 public interface QueryVisitor {
   Query visit(Query query);
 }
 {code}
 and then use a reflection based visitor like Earwin suggested, which would 
 allow implementators to provide visit methods for just Querys that they are 
 interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3041) Support Query Visting / Walking

2011-05-02 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027612#comment-13027612
 ] 

Earwin Burrfoot edited comment on LUCENE-3041 at 5/2/11 10:30 AM:
--

The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?
Same can be said for tests.

What about throwing original invocation exception instead of the wrapper? Since 
we're emulating a language feature, a simple method call, it's logical to only 
throw custom exceptions in .. well .. exceptional cases, like ambiguity/no 
matching method. If client code throws Errors/RuntimeExceptions, they should be 
transparently rethrown.

  was (Author: earwin):
The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?
  
 Support Query Visting / Walking
 ---

 Key: LUCENE-3041
 URL: https://issues.apache.org/jira/browse/LUCENE-3041
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 4.0
Reporter: Chris Male
Assignee: Simon Willnauer
Priority: Minor
 Fix For: 4.0

 Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
 LUCENE-3041.patch, LUCENE-3041.patch


 Out of the discussion in LUCENE-2868, it could be useful to add a generic 
 Query Visitor / Walker that could be used for more advanced rewriting, 
 optimizations or anything that requires state to be stored as each Query is 
 visited.
 We could keep the interface very simple:
 {code}
 public interface QueryVisitor {
   Query visit(Query query);
 }
 {code}
 and then use a reflection based visitor like Earwin suggested, which would 
 allow implementators to provide visit methods for just Querys that they are 
 interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation

2011-05-02 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027626#comment-13027626
 ] 

Earwin Burrfoot commented on LUCENE-3061:
-

Mark these as @experimental?

 Open IndexWriter API to allow custom MergeScheduler implementation
 --

 Key: LUCENE-3061
 URL: https://issues.apache.org/jira/browse/LUCENE-3061
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-3061.patch, LUCENE-3061.patch


 IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
 makes it impossible for someone to implement his own MergeScheduler. We 
 should open up these API, as well as any other that can be useful for custom 
 MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot
Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
 Hi

 Today, LogMP allows you to set different thresholds for segments sizes,
 thereby allowing you to control the largest segment that will be
 considered for merge + the largest segment your index will hold (=~
 threshold * mergeFactor).

 So, if you want to end up w/ say 20GB segments, you can set
 maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.

 However, this often does not achieve your desired goal -- if the index
 contains 5 and 7 GB segments, they will never be merged b/c they are
 bigger than the threshold. I am willing to spend the CPU and IO resources
 to end up w/ 20 GB segments, whether I'm merging 10 segments together or
 only 2. After I reach a 20GB segment, it can rest peacefully, at least
 until I increase the threshold.

 So I wonder, first, if this threshold (i.e., largest segment size you
 would like to end up with) is more natural to set than thee current
 thresholds,
 from the application level? I.e., wouldn't it be a simpler threshold to set
 instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
 and mergeFactor?

 Second, should this be an addition to LogMP, or a different
 type of MP. One that adheres to only those two factors (perhaps the
 segSize threshold should be allowed to set differently for optimize and
 regular merges). It can pick segments for merge such that it maximizes
 the result segment size (i.e., don't necessarily merge in sequential
 order), but not more than mergeFactor.

 I guess, if we think that maxResultSegmentSizeMB is more intuitive than
 the current thresholds, application-wise, then this change should go
 into LogMP. Otherwise, it feels like a different MP is needed, because
 LogMP is already complicated and another threshold would confuse things.

 What do you think of this? Am I trying to optimize too much? :)

 Shai





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot
Dunno, I'm quite happy with numLargeSegments (you critically
misspelled it). It neatly avoids uber-merges, keeps the number of
segments at bay, and does not require to recalculate thresholds when
my expected index size changes.

The problem is - each person needs his own set of knobs (or thinks he
needs them) for MergePolicy, and I can't call any of these sets
superior to others :/

2011/5/2 Shai Erera ser...@gmail.com:
 I did look at it, but I didn't find that it answers this particular need
 (ending with a segment no bigger than X). Perhaps by tweaking several
 parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can achieve
 something, but it's not very clear what is the right combination.

 Which is related to one of the points -- is it not more intuitive for an app
 to set this threshold (if it needs any thresholds), than tweaking all of
 those parameters? If so, then we only need two thresholds (size +
 mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
 (perhaps w/ some adaptations) to derive a merge plan.

 Shai

 On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Have you checked BalancedSegmentMergePolicy? It has some more knobs :)

 On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
  Hi
 
  Today, LogMP allows you to set different thresholds for segments sizes,
  thereby allowing you to control the largest segment that will be
  considered for merge + the largest segment your index will hold (=~
  threshold * mergeFactor).
 
  So, if you want to end up w/ say 20GB segments, you can set
  maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
 
  However, this often does not achieve your desired goal -- if the index
  contains 5 and 7 GB segments, they will never be merged b/c they are
  bigger than the threshold. I am willing to spend the CPU and IO
  resources
  to end up w/ 20 GB segments, whether I'm merging 10 segments together or
  only 2. After I reach a 20GB segment, it can rest peacefully, at least
  until I increase the threshold.
 
  So I wonder, first, if this threshold (i.e., largest segment size you
  would like to end up with) is more natural to set than thee current
  thresholds,
  from the application level? I.e., wouldn't it be a simpler threshold to
  set
  instead of doing weird calculus that depend on maxMergeMB(ForOptimize)
  and mergeFactor?
 
  Second, should this be an addition to LogMP, or a different
  type of MP. One that adheres to only those two factors (perhaps the
  segSize threshold should be allowed to set differently for optimize and
  regular merges). It can pick segments for merge such that it maximizes
  the result segment size (i.e., don't necessarily merge in sequential
  order), but not more than mergeFactor.
 
  I guess, if we think that maxResultSegmentSizeMB is more intuitive than
  the current thresholds, application-wise, then this change should go
  into LogMP. Otherwise, it feels like a different MP is needed, because
  LogMP is already complicated and another threshold would confuse things.
 
  What do you think of this? Am I trying to optimize too much? :)
 
  Shai
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: MergePolicy Thresholds

2011-05-02 Thread Earwin Burrfoot
 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/

 I agree. I wonder tough if the knobs we give on LogMP are intuitive enough.

 It neatly avoids uber-merges

 I didn't see that I can define what uber-merge is, right? Can I tell it to
 stop merging segments of some size? E.g., if my index grew to 100 segments,
 40GB each, I don't think that merging 10 40GB segments (to create 400GB
 segment) is going to speed up my search, for instance. A 40GB segment
 (probably much less) is already big enough to not be touched anymore.
No, you can't. But you can tell it to have exactly (not 'at most') N
top-tier segments and try to keep their sizes close with merges.
Whatever that size may be.
And this is exactly what I want. And defining max cap on segment size
is not what I want.

So the same set of knobs can be intuitive and meaningful for one
person, and useless for another. And you can't pick the best one.

 Will BalancedMP stop merging such segments (if all segments are of that
 order of magnitude)?

 Shai

 On Mon, May 2, 2011 at 5:23 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Dunno, I'm quite happy with numLargeSegments (you critically
 misspelled it). It neatly avoids uber-merges, keeps the number of
 segments at bay, and does not require to recalculate thresholds when
 my expected index size changes.

 The problem is - each person needs his own set of knobs (or thinks he
 needs them) for MergePolicy, and I can't call any of these sets
 superior to others :/

 2011/5/2 Shai Erera ser...@gmail.com:
  I did look at it, but I didn't find that it answers this particular need
  (ending with a segment no bigger than X). Perhaps by tweaking several
  parameters (e.g. maxLarge/SmallNumSegments + maxMergeSizeMB) I can
  achieve
  something, but it's not very clear what is the right combination.
 
  Which is related to one of the points -- is it not more intuitive for an
  app
  to set this threshold (if it needs any thresholds), than tweaking all of
  those parameters? If so, then we only need two thresholds (size +
  mergeFactor), and we can reuse BalancedMP's findBalancedMerges logic
  (perhaps w/ some adaptations) to derive a merge plan.
 
  Shai
 
  On Mon, May 2, 2011 at 4:42 PM, Earwin Burrfoot ear...@gmail.com
  wrote:
 
  Have you checked BalancedSegmentMergePolicy? It has some more knobs :)
 
  On Mon, May 2, 2011 at 17:03, Shai Erera ser...@gmail.com wrote:
   Hi
  
   Today, LogMP allows you to set different thresholds for segments
   sizes,
   thereby allowing you to control the largest segment that will be
   considered for merge + the largest segment your index will hold (=~
   threshold * mergeFactor).
  
   So, if you want to end up w/ say 20GB segments, you can set
   maxMergeMB(ForOptimize) to 2GB and mergeFactor=10.
  
   However, this often does not achieve your desired goal -- if the
   index
   contains 5 and 7 GB segments, they will never be merged b/c they are
   bigger than the threshold. I am willing to spend the CPU and IO
   resources
   to end up w/ 20 GB segments, whether I'm merging 10 segments together
   or
   only 2. After I reach a 20GB segment, it can rest peacefully, at
   least
   until I increase the threshold.
  
   So I wonder, first, if this threshold (i.e., largest segment size you
   would like to end up with) is more natural to set than thee current
   thresholds,
   from the application level? I.e., wouldn't it be a simpler threshold
   to
   set
   instead of doing weird calculus that depend on
   maxMergeMB(ForOptimize)
   and mergeFactor?
  
   Second, should this be an addition to LogMP, or a different
   type of MP. One that adheres to only those two factors (perhaps the
   segSize threshold should be allowed to set differently for optimize
   and
   regular merges). It can pick segments for merge such that it
   maximizes
   the result segment size (i.e., don't necessarily merge in sequential
   order), but not more than mergeFactor.
  
   I guess, if we think that maxResultSegmentSizeMB is more intuitive
   than
   the current thresholds, application-wise, then this change should go
   into LogMP. Otherwise, it feels like a different MP is needed,
   because
   LogMP is already complicated and another threshold would confuse
   things.
  
   What do you think of this? Am I trying to optimize too much? :)
  
   Shai
  
  
 
 
 
  --
  Kirill Zakharenko/Кирилл Захаренко
  E-Mail/Jabber: ear...@gmail.com
  Phone: +7 (495) 683-567-4
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail

Re: Setting the max number of merge threads across IndexWriters

2011-05-01 Thread Earwin Burrfoot
Almost any design that keeps circular references between components is
broken. Inability to share MergeSchedulers is just another testimonial
to that.

2011/4/16 Shai Erera ser...@gmail.com:
 Hi
 This was raised in LUCENE-2755 (along with other useful refactoring to
 MS-IW-MP interaction). Here is the relevant comment which addresses Jason's
 particular
 issue: https://issues.apache.org/jira/browse/LUCENE-2755?focusedCommentId=12966029page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12966029
 In short, we can refactor CMS to not hold to an IndexWriter member if we
 change a lot of the API. But IMO, an ExecutorServiceMS is the right way to
 go, if you don't mind giving up some CMS features, like controlling thread
 priority and stalling running threads. In fact, even w/ ExecutorServiceMS
 you can still achieve some (e.g., stalling), but some juggling will be
 required.
 Then, instead of trying to factor out IW members from this MS, you could
 share the same ES with all MS instances, each will keep a reference to a
 different IW member. This is just a thought though, I haven't tried it.
 Shai

 On Thu, Apr 14, 2011 at 8:23 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Can't remember. Probably no. I started an experimental MS api rewrite
 (incorporating ability to share MSs between IWs) some time ago, but
 never had the time to finish it.

 On Thu, Apr 14, 2011 at 19:56, Simon Willnauer
 simon.willna...@googlemail.com wrote:
  On Thu, Apr 14, 2011 at 5:52 PM, Earwin Burrfoot ear...@gmail.com
  wrote:
  I proposed to decouple MergeScheduler from IW (stop keeping a
  reference to it). Then you can create a single CMS and pass it to all
  your IWs.
  Yep that was it... is there an issue for this?
 
  simon
 
  On Thu, Apr 14, 2011 at 19:40, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
  I think the proposal involved using a ThreadPoolExecutor, which seemed
  to not quite work as well as what we have.  I think it'll be easier to
  simply pass a global context that keeps a counter of the actively
  running threads, and pass that into each IW's CMS?
 
  On Thu, Apr 14, 2011 at 8:25 AM, Simon Willnauer
  simon.willna...@googlemail.com wrote:
  On Thu, Apr 14, 2011 at 5:20 PM, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
  Today the ConcurrentMergeScheduler allows setting the max thread
  count and is bound to a single IndexWriter.
 
  However in the [common] case of multiple IndexWriters running in
  the same process, this disallows one from managing the aggregate
  number of merge threads executing at any given time.
 
  I think this can be fixed, shall I open an issue?
 
  go ahead! I think I have seen this suggestion somewhere maybe you
  need
  to see if there is one already
 
  simon
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 
 
  --
  Kirill Zakharenko/Кирилл Захаренко
  E-Mail/Jabber: ear...@gmail.com
  Phone: +7 (495) 683-567-4
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Setting the max number of merge threads across IndexWriters

2011-05-01 Thread Earwin Burrfoot
You don't mean 'static var' under 'global'? I hope, very much.

2011/4/16 Jason Rutherglen jason.rutherg...@gmail.com:
 I'd rather not lose [important] functionality.  I think a global max
 thread count is the least intrusive way to go, however I also need to
 see if that's possible.  If so I'll open an issue and post a patch.

 2011/4/15 Shai Erera ser...@gmail.com:
 Hi
 This was raised in LUCENE-2755 (along with other useful refactoring to
 MS-IW-MP interaction). Here is the relevant comment which addresses Jason's
 particular
 issue: https://issues.apache.org/jira/browse/LUCENE-2755?focusedCommentId=12966029page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12966029
 In short, we can refactor CMS to not hold to an IndexWriter member if we
 change a lot of the API. But IMO, an ExecutorServiceMS is the right way to
 go, if you don't mind giving up some CMS features, like controlling thread
 priority and stalling running threads. In fact, even w/ ExecutorServiceMS
 you can still achieve some (e.g., stalling), but some juggling will be
 required.
 Then, instead of trying to factor out IW members from this MS, you could
 share the same ES with all MS instances, each will keep a reference to a
 different IW member. This is just a thought though, I haven't tried it.
 Shai

 On Thu, Apr 14, 2011 at 8:23 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Can't remember. Probably no. I started an experimental MS api rewrite
 (incorporating ability to share MSs between IWs) some time ago, but
 never had the time to finish it.

 On Thu, Apr 14, 2011 at 19:56, Simon Willnauer
 simon.willna...@googlemail.com wrote:
  On Thu, Apr 14, 2011 at 5:52 PM, Earwin Burrfoot ear...@gmail.com
  wrote:
  I proposed to decouple MergeScheduler from IW (stop keeping a
  reference to it). Then you can create a single CMS and pass it to all
  your IWs.
  Yep that was it... is there an issue for this?
 
  simon
 
  On Thu, Apr 14, 2011 at 19:40, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
  I think the proposal involved using a ThreadPoolExecutor, which seemed
  to not quite work as well as what we have.  I think it'll be easier to
  simply pass a global context that keeps a counter of the actively
  running threads, and pass that into each IW's CMS?
 
  On Thu, Apr 14, 2011 at 8:25 AM, Simon Willnauer
  simon.willna...@googlemail.com wrote:
  On Thu, Apr 14, 2011 at 5:20 PM, Jason Rutherglen
  jason.rutherg...@gmail.com wrote:
  Today the ConcurrentMergeScheduler allows setting the max thread
  count and is bound to a single IndexWriter.
 
  However in the [common] case of multiple IndexWriters running in
  the same process, this disallows one from managing the aggregate
  number of merge threads executing at any given time.
 
  I think this can be fixed, shall I open an issue?
 
  go ahead! I think I have seen this suggestion somewhere maybe you
  need
  to see if there is one already
 
  simon
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 
 
  --
  Kirill Zakharenko/Кирилл Захаренко
  E-Mail/Jabber: ear...@gmail.com
  Phone: +7 (495) 683-567-4
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

2011-04-30 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027361#comment-13027361
 ] 

Earwin Burrfoot commented on LUCENE-3055:
-

Could anyone remind me, why the hell do we still have Analyzer.tokenStream AND 
reusableTokenStream rampaging around and confusing minds? We always recommend 
to use the latter, Robert just fixed some of the core classes to use the latter.

Also, if reusableTokenStream is the only method left standing, isn't it wise to 
hide actual reuse somewhere in Lucene internals and turn Analyzer into plain 
and dumb factory interface?

 LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
 --

 Key: LUCENE-3055
 URL: https://issues.apache.org/jira/browse/LUCENE-3055
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 3.1
Reporter: Ian Soboroff

 LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes 
 ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. 
 StandardAnalyzer to make a small modification e.g. to tokenStream().  These 
 issues don't indicate a new method of doing this.  The issues don't give a 
 reason except for design considerations, which seems a poor reason to make a 
 backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2571) Indexing performance tests with realtime branch

2011-04-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13020217#comment-13020217
 ] 

Earwin Burrfoot commented on LUCENE-2571:
-

bq. Merges are NOT blocking indexing on trunk no matter which MP you use.
Well.. merges tie up IO (especially if not on fancy SSDs/RAIDs), which in turn 
lags flushes - bigger delays for stop the world flushes / lower bandwith cap 
(after which they are forced to stop the world) for parallel flushes.

So Lance's point is partially valid.

 Indexing performance tests with realtime branch
 ---

 Key: LUCENE-2571
 URL: https://issues.apache.org/jira/browse/LUCENE-2571
 Project: Lucene - Java
  Issue Type: Task
  Components: Index
Reporter: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: wikimedium.realtime.Standard.nd10M_dps.png, 
 wikimedium.realtime.Standard.nd10M_dps_addDocuments.png, 
 wikimedium.realtime.Standard.nd10M_dps_addDocuments_flush.png, 
 wikimedium.trunk.Standard.nd10M_dps.png, 
 wikimedium.trunk.Standard.nd10M_dps_addDocuments.png


 We should run indexing performance tests with the DWPT changes and compare to 
 trunk.
 We need to test both single-threaded and multi-threaded performance.
 NOTE:  flush by RAM isn't implemented just yet, so either we wait with the 
 tests or flush by doc count.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Setting the max number of merge threads across IndexWriters

2011-04-14 Thread Earwin Burrfoot
I proposed to decouple MergeScheduler from IW (stop keeping a
reference to it). Then you can create a single CMS and pass it to all
your IWs.

On Thu, Apr 14, 2011 at 19:40, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 I think the proposal involved using a ThreadPoolExecutor, which seemed
 to not quite work as well as what we have.  I think it'll be easier to
 simply pass a global context that keeps a counter of the actively
 running threads, and pass that into each IW's CMS?

 On Thu, Apr 14, 2011 at 8:25 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 On Thu, Apr 14, 2011 at 5:20 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 Today the ConcurrentMergeScheduler allows setting the max thread
 count and is bound to a single IndexWriter.

 However in the [common] case of multiple IndexWriters running in
 the same process, this disallows one from managing the aggregate
 number of merge threads executing at any given time.

 I think this can be fixed, shall I open an issue?

 go ahead! I think I have seen this suggestion somewhere maybe you need
 to see if there is one already

 simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Setting the max number of merge threads across IndexWriters

2011-04-14 Thread Earwin Burrfoot
Can't remember. Probably no. I started an experimental MS api rewrite
(incorporating ability to share MSs between IWs) some time ago, but
never had the time to finish it.

On Thu, Apr 14, 2011 at 19:56, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Thu, Apr 14, 2011 at 5:52 PM, Earwin Burrfoot ear...@gmail.com wrote:
 I proposed to decouple MergeScheduler from IW (stop keeping a
 reference to it). Then you can create a single CMS and pass it to all
 your IWs.
 Yep that was it... is there an issue for this?

 simon

 On Thu, Apr 14, 2011 at 19:40, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 I think the proposal involved using a ThreadPoolExecutor, which seemed
 to not quite work as well as what we have.  I think it'll be easier to
 simply pass a global context that keeps a counter of the actively
 running threads, and pass that into each IW's CMS?

 On Thu, Apr 14, 2011 at 8:25 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:
 On Thu, Apr 14, 2011 at 5:20 PM, Jason Rutherglen
 jason.rutherg...@gmail.com wrote:
 Today the ConcurrentMergeScheduler allows setting the max thread
 count and is bound to a single IndexWriter.

 However in the [common] case of multiple IndexWriters running in
 the same process, this disallows one from managing the aggregate
 number of merge threads executing at any given time.

 I think this can be fixed, shall I open an issue?

 go ahead! I think I have seen this suggestion somewhere maybe you need
 to see if there is one already

 simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Numerical ids for terms?

2011-04-12 Thread Earwin Burrfoot
On Tue, Apr 12, 2011 at 13:41, Gregor Heinrich gre...@arbylon.net wrote:
 Hi -- has there been any effort to create a numerical representation of
 Lucene indices. That is, to use the Lucene Directory backend as a large
 term-document matrix at index level. As this would require bijective mapping
 between terms (per-field, as customary in Lucene) and a numerical index
 (integer, monotonous from 0 to numTerms()-1), I guess this requires some
 some special modifications to the Lucene core.
Lucene index already provides term - id mapping in some form.

 Another interesting feature would be to use Lucene's Directory backend for
 storage of large dense matrices, for instance to data-mining tasks from
 within Lucene.
Lucene's Directory is a dumb abstraction for random-access named
write-once byte streams.
It doesn't add /any/ value over mmap.

 Any suggestions?
*troll mode on* Use numpy/scipy? :)

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



An IDF variation with penalty for very rare terms

2011-04-12 Thread Earwin Burrfoot
Excuse me for somewhat of an offtopic, but have anybody ever seen/used -subj- ?
Something that looks like like http://dl.dropbox.com/u/920413/IDFplusplus.png
Traditional log(N/x) tail, but when nearing zero freq, instead of
going to +inf you do a nice round bump (with controlled
height/location/sharpness) and drop down to -inf (or zero).

Should be cool when doing cosine-measure(or something
comparable)-based document comparisons (eg. in a more like this
query, to mention Lucene at least once :) ), over dirty data.
Rationale is that - most good, discriminating terms are found in at
least a certain percentage of your documents, but there are lots of
mostly unique crapterms, which at some collection sizes stop being
strictly unique and with IDF's help explode your scores.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: character escapes in source? ... was: Re: Eclipse: Invalid character constant

2011-04-08 Thread Earwin Burrfoot
On Fri, Apr 8, 2011 at 03:01, Robert Muir rcm...@gmail.com wrote:
 On Thu, Apr 7, 2011 at 6:48 PM, Chris Hostetter
 hossman_luc...@fucit.org wrote:

 : -1. These files should be readable, for maintaining, debugging and
 : knowing whats going on.

 Readability is my main concern ... i don't know (and frequently can't
 tell) the differnece between a lot of non ascii characters -- and i'm
 guessing i'm not alone.  when it's spelled out explicitly using the
 character name or escape code, there is no ambiquity about what character
 was intended, or wether it got screwed up by some tool along the way (ie:
 the svn server, an svn client, the patch command, a text editor, an IDE,
 ant's fixcrlf task, etc...)

 Please take the time, just 5 or 10 minutes, to look thru some of this
 source code and tests.

 Imagine if you couldn't just look at the code to see what it does, but
 had to decode from some crazy numeric encoding scheme.
 Imagine if it were this way for things like stopword lists too.

 It would be basically impossible for you to look at the code and
 figure out what it does!
 For example, try looking at thai analyzer tests, if these were all
 numbers, how would you know wtf is going on?

 Although this comes up from time to time, I stand firm on my -1
 because its important to me for the source code to be readable.
 I'm not willing to give this up just because some people cannot read
 writing system XYZ.

 I have said before, i'm willing to change my -1 vote on this, if *ALL*
 string constants (including english ones) are changed to be character
 escapes.
 If you imagine what the code would look like if english string
 constants were instead codes, then I think you will understand my
 point of view!

 Its really really important to source code readability to be able to
 open a file and understand what it does, not to have to use some
 decoder because it uses characters other people dont understand.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



I think having both raw characters /and/ encoded representation is the
best? (one of them in comments)
I'm all for unicode sources, but at least two things hit me repeatedly:
1. Tools do screw up, and you have to recover somehow.
eg. IntelliJ IDEA's 'shelve' function uses platform default (MacRoman
in my case) and I've lost some text on things I shelved but never
committed anywhere.
2. There are characters that look all the same.
E.g. different whitespace/dashes. Or, (if you have cyrillic in your
fonts) I dare you to discern between a/а, c/с, e/е, o/о.
These are different characters from latin and cyrillic charsets (left
latin/right cyrillic), but in 99% fonts they are visually identical.
I had a filter that folded up similarily looking characters, and it
was documented in exactly this way - raw char+code.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [POLL] JTS compile/test dependency

2011-04-06 Thread Earwin Burrfoot
On Wed, Apr 6, 2011 at 22:43, Robert Muir rcm...@gmail.com wrote:
 On Wed, Apr 6, 2011 at 2:12 PM, Ryan McKinley ryan...@gmail.com wrote:
 Some may be following the thread on spatial development...  here is a
 quick summary, and a poll to help decide what may be the best next
 move.

 I'm hoping to introduce a high level spatial API that can be used for
 a variety of indexing strategies and computational needs.  For simple
 point in BBox and point in WGS84 radius, this does not require any
 external libraries.  To support more complex queries -- point in
 polygon, complex geometry intersections, etc -- we need an LGPL
 library JTS.  The LGPL dependency is only needed to compile/test,
 there is no runtime requirement for JTS.  To enable the more
 complicated options you would need to add JTS to the classpath and
 perhaps set a environment variable.  This is essentially what we are
 now doing with the (soon to be removed) bdb contrib.

 I am trying to figure out the best home for this code and development
 to live.  I think it is essential for the JTS support to be part of
 the core build/test -- splitting it into a separate module that is
 tested elsewhere is not an option.  This raises the basic question of
 if people are willing to have the LGPL build dependency as part of the
 main lucene build.  I think it is, but am sympathetic to the idea that
 it might not be.

 I'm sorta confused about this (i'll probably offend someone here, but so be
 it)
 We have a contrib module for spatial that is experimental, people want to
 deprecate, and say has problems.
 Why must the super-expert-polygon stuff sit with the basic capability that
 probably most users want: the ability to do basic searches (probably in
 combination with text too) in their app?
 Its hard for me to tell, i hope the reason isn't elegance, but why aren't
 we working on making a simple,supported,80-20 case in lucene that
 non-spatial-gurus (and users) understand and can maintain... then it would
 seem ideal for the complex stuff to be outside of this project with any
 dependencies it wants?
 Users are probably really confused about the spatial situation: is it
 because we are floundering around this expert stuff

Handling Unicode code points outside of BMP is highly expert stuff as
well. And is totally unneeded by 80% of the users for any other reason
except elegance. I think you two guys can really understand each
other here : )

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [POLL] JTS compile/test dependency

2011-04-06 Thread Earwin Burrfoot
On Thu, Apr 7, 2011 at 01:11, Robert Muir rcm...@gmail.com wrote:
 On Wed, Apr 6, 2011 at 5:07 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Handling Unicode code points outside of BMP is highly expert stuff as
 well. And is totally unneeded by 80% of the users for any other reason
 except elegance. I think you two guys can really understand each
 other here : )


 you are wrong: you either support unicode, or your application is
 buggy. Its not an optional feature, its the text standard used by the
 java programming language.

You either handle the the Earth as a proper somewhat-ellipsoid, or
your application is buggy. It's not an optional feature, it's even
stronger than a standard - it is a physical fact experienced by all of
us, earthlings.

Though 80% of the users can throw geoids and unicode planes out of the
window and live happily with some stupid local coordinate system and
two-byte characters (some even manage with one-byte!). Yeah, they
don't really care about being buggy in any geo/unicode-zealot's eyes.

Having said that, it's cool that people like you two exist :) Because
earth is round, maps are ugly, there are lots of different writing
systems and someone has to deal with that.

-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2981) Review and potentially remove unused/unsupported Contribs

2011-03-31 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014108#comment-13014108
 ] 

Earwin Burrfoot commented on LUCENE-2981:
-

Bye-bye, DB. Few things can compete with it in pointlessness.

 Review and potentially remove unused/unsupported Contribs
 -

 Key: LUCENE-2981
 URL: https://issues.apache.org/jira/browse/LUCENE-2981
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2981.patch


 Some of our contribs appear to be lacking for development/support or are 
 missing tests.  We should review whether they are even pertinent these days 
 and potentially deprecate and remove them.
 One of the things we did in Mahout when bringing in Colt code was to mark all 
 code that didn't have tests as @deprecated and then we removed the 
 deprecation once tests were added.  Those that didn't get tests added over 
 about a 6 mos. period of time were removed.
 I would suggest taking a hard look at:
 ant
 db
 lucli
 swing
 (spatial should be gutted to some extent and moved to modules)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Urgent! Forgot to close IndexWriter after adding Documents to the index.

2011-03-22 Thread Earwin Burrfoot
On Tue, Mar 22, 2011 at 06:21, Chris Hostetter hossman_luc...@fucit.org wrote:

 (replying to the dev list, see context below)

 : Unfortunately, you can't easily recover from this (except by
 : reindexing your docs again).
 :
 : Failing to call IW.commit() or IW.close() means no segments file was 
 written...


 I know there were good reasons for eliminating the autoCommit
 functionality from IndexWriter, but threads like tis make me thing thta
 even though autoCommit on flush/merge/whatever was bad, having an option
 for some sort of autoClose using a finalizer might by a good idea to
 give new/novice users a safety net.

 In the case of totally successful normal operation, this would result in
 one commit at GC (assuming the JVM calls the finalizer) and if there were
 any errors it should (if i understnad correclty) do an implicit rollback.

 Anyone see a downside?
Yes. Totally unexpected magical behaviour.
What if I didn't commit something on purporse?

        ...

 :  I had a program running for 2 days to build an index for around 160 
 million
 :  text files, and after program ended, I tried searching the index and found
 :  the index was not correctly built, *indexReader.numDocs()* returns 0. I
 :  checked the index directory, it looked good, all the index data seemed to 
 be
 :  there, the directory is 1.5 Gigabytes in size.
 : 
 :  I checked my code and found that I forgot to call 
 *indexWriter.optimize()*and
 :  *indexWriter.close()*, I want to know if it is possible to
 :  *re-optimize()*the index so I don't need to rebuild the whole index
 :  from scratch? I don't
 :  really want the program to take another 2 days.


 -Hoss

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: IndexReader.indexExists declares throwing IOE, but never does

2011-03-21 Thread Earwin Burrfoot
Technically, there's a big difference between I checked, and there
was no index, and I was unable to check the disk because file system
went BANG!.
So the proper behaviour is to return false  IOE (on proper occasion)?

On Mon, Mar 21, 2011 at 13:53, Michael McCandless
luc...@mikemccandless.com wrote:
 On Mon, Mar 21, 2011 at 12:52 AM, Shai Erera ser...@gmail.com wrote:
 Can we remove the declaration? The method never throws IOE, but instead
 catches it and returns false. I think it's reasonable that such a method
 will not throw exceptions.

 +1

 --
 Mike

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: IndexReader.indexExists declares throwing IOE, but never does

2011-03-21 Thread Earwin Burrfoot
2011/3/21 Shai Erera ser...@gmail.com:
 So the proper behaviour is to return false  IOE (on proper occasion)?

 I don't object to it, as I think it's reasonable (as today we may be hiding
 some info from the app). However, given that today we never throw IOE, and
 that if we start doing so, we'll change runtime behavior, I lean towards
 keeping the method simple and remove the throws declaration. Well, it's
 either we change the impl to throw IOE, or remove the declaration
 altogether.

 Changing the impl to throw IOE on proper occasion might be problematic --
 IndexNotFoundException is thrown when an empty index directory was given,
 however by its Javadocs, it can also indicate the index is corrupted.
 Perhaps the jdocs are wrong and it's thrown only if the index directory is
 empty, or no segments files are found. If that's the case, then we should
 change its javadocs. Otherwise, it will be difficult to know whether the
 INFE indicates an empty directory, for which you'll want to return false, or
 a corrupt index, for which you'll want to throw the exception.

 Besides, I consider this method almost like File.exists() which doesn't
 throw an exception. If indexExists() returns false, the app can decide to
 investigate further by trying to open IndexReader or read the SegmentInfos.
 But the API as-is needs to be simple IMO.
File.exists() parallel is a good one.
So, maybe, it's ok )

 Otherwise please keep the throws declaration so that you won't break
 public APIs if this changes implementation.

 Removing the throws declaration doesn't break apps. In the worse case,
 they'll have a catch block which is redundant?

 Shai

 On Mon, Mar 21, 2011 at 4:12 PM, Sanne Grinovero sanne.grinov...@gmail.com
 wrote:

 2011/3/21 Earwin Burrfoot ear...@gmail.com:
  Technically, there's a big difference between I checked, and there
  was no index, and I was unable to check the disk because file system
  went BANG!.
  So the proper behaviour is to return false  IOE (on proper occasion)?

 +1 to throw the exception when proper to do so

 Otherwise please keep the throws declaration so that you won't break
 public APIs if this changes implementation.

 
  On Mon, Mar 21, 2011 at 13:53, Michael McCandless
  luc...@mikemccandless.com wrote:
  On Mon, Mar 21, 2011 at 12:52 AM, Shai Erera ser...@gmail.com wrote:
  Can we remove the declaration? The method never throws IOE, but
  instead
  catches it and returns false. I think it's reasonable that such a
  method
  will not throw exceptions.
 
  +1
 
  --
  Mike
 
  http://blog.mikemccandless.com
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
 
 
 
  --
  Kirill Zakharenko/Кирилл Захаренко
  E-Mail/Jabber: ear...@gmail.com
  Phone: +7 (495) 683-567-4
  ICQ: 104465785
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007048#comment-13007048
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

bq. Oh yeah. But then we'd clone the full IWC on every set... this seems like 
overkill in the name of purity.
So what? What exactly is overkill? Few wasted bytes and CPU ns for an object 
that's created a couple of times during application lifetime?
There are also builders, which are very similar to what Steven is proposing.

bq. Another thought is to offer all settings on the IWC for init convenience 
and exposure and then add javadoc about updaters on IW for those settings that 
can be changed on the fly
That's exactly how I'd like to see it.

 Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
 --

 Key: LUCENE-2960
 URL: https://issues.apache.org/jira/browse/LUCENE-2960
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shay Banon
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2960.patch


 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
 It would be great to be able to control that on a live IndexWriter. Other 
 possible two methods that would be great to bring back are 
 setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
 setters can actually be set on the MergePolicy itself, so no need for setters 
 for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13007136#comment-13007136
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

You avoid deprecation/undeprecation and binary incompatibility, while 
incompatibly changing semantics. What do you win?

 Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
 --

 Key: LUCENE-2960
 URL: https://issues.apache.org/jira/browse/LUCENE-2960
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shay Banon
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2960.patch


 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
 It would be great to be able to control that on a live IndexWriter. Other 
 possible two methods that would be great to bring back are 
 setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
 setters can actually be set on the MergePolicy itself, so no need for setters 
 for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-14 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006759#comment-13006759
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

bq. infoStream is a PrintStream, which synchronizes anyway, so it should be 
safe to omit the volatile
You're absolutely right here.

bq. Yet, no real Java impl out there will ever do this since doing so will 
simply make that Java impl appear buggy.
Sorry, but real Java impls do this. The case with endless get() happened on a 
map that was never modified after being created and set. Just one of the many 
JVM instances on many machines got unlucky after restart.

bq. Well, and, it'd be bad for perf. – obviously the Java impl, CPU cache 
levels, should cache only frequently used things
Java impls don't cache things. They do reorderings, they also keep final fields 
on registers, omitting reloads that happen for non-final ones, but no caching 
in JMM-related cases. Caching here is done by CPU, and it caches all data read 
from memory.

bq. IWC cannot be made immutable – you build it up incrementally (new 
IWC(...).setThis(...).setThat(...)). Its fields cannot be final.
Setters can return modified immutable copy of 'this'. So you get both 
incremental building and immutability.

bq. How about this as a compromise: IW continues cloning the incoming IWC on 
init, as it does today. This means any changes to the IWC instance you passed 
to IW will have no effect on IW.
What about earlier compromise mentioned by Shay, Mark, me? Keep setters for 
'live' properties on IW.
This clearly draws the line, and you don't have to consult Javadocs for each 
and every setting to know if you can change it live or not.

 Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
 --

 Key: LUCENE-2960
 URL: https://issues.apache.org/jira/browse/LUCENE-2960
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shay Banon
Priority: Blocker
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2960.patch


 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
 It would be great to be able to control that on a live IndexWriter. Other 
 possible two methods that would be great to bring back are 
 setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
 setters can actually be set on the MergePolicy itself, so no need for setters 
 for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GPU acceleration

2011-03-13 Thread Earwin Burrfoot
On Sun, Mar 13, 2011 at 00:15, Ken O'Brien k...@kenobrien.org wrote:
 To clarify, I've not yet written any code. I aim to bring a large speedup to
 any functionality that is computationally expensive. I'm wondering which
 components are candidates for this.

 I'll be looking through the code but if anyone is aware of parallelizable
 code, I'll start with that.
More like 'vectorizable' code, huh?

Guys from Yandex use modified group varint encoding plus handcrafted
SSE magic to decode/intersect posting lists and claim tremendous
speedups over original group varint.
They also use SSE to run the decision trees used in ranking.

There were experiments with moving both pieces of code to the GPU, and
GPU did well in terms of speed, but they say getting data in and out
of GPU made the approach unfeasible.

 I'll basically replicate existing functionality to run on the gpu.

 On 12/03/11 21:08, Simon Willnauer wrote:

 On Sat, Mar 12, 2011 at 9:21 PM, Ken O'Brienk...@kenobrien.org  wrote:

 Hi,

 Is anyone looking at GPU acceleration for Solr? If not, I'd like to
 contribute code which adds this functionality.

 As I'm not familiar with the codebase, does anyone know which areas of
 functionality could benefit from high degrees of parallelism.

 Very interesting can you elaborate a little more what kind of
 functionality you exposed / try to expose to the GPU?

 simon

 Regards,

 Ken



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-13 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13006227#comment-13006227
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

{quote}
Why such purity? What do we gain?

I'm all for purity, but only if it doesn't interfere w/ functionality.
Here, it's taking away freedom...
{quote}
We gain consistency and predictability. And there are a lot of freedoms 
dangerous for developers.

{quote}
In fact it should be fine to share an IWC across multiple writers; you
can change the RAM buffer for all of them at once.
{quote}

You've brought up a purrfect example of how NOT to do things.
This is called 'action at a distance' and is a damn bug. Very annoying one.
I've thoroughly experienced it with previous major version of Apache HTTPClient 
- they had an API that suggested you can set per-request timeouts, while these 
were actually global for a single Client instance.
I fried my brain trying to understand why the hell random user requests timeout 
at hundred times their intended duration.
Oh! It was an occasional admin request changing the global.

irony You know, you can actually instantiate some DateRangeFilter with a 
couple of Dates, and then change these Dates (they are writeable) before each 
request. Isn't it an exciting kind of programming freedom?
Or, back to our current discussion - we can pass RAMBufferSizeMB as an 
AtomicDouble, instead of current double, then we can use .set() on an instance 
we passed, and have our live reconfigurability. What's more - AtomicDouble 
protects us from word tearing! /irony

bq. I doubt there's any JVM out there where our lack-of-volatile infoStream 
causes any problems.
Er.. While I have never personally witnessed unsynchronized long/double tearing,
I've seen the consequence of unsafely publishing a HashMap - an endless loop on 
get().
It happened on your run off the mill Sun 1.6 JVM.
So the bug is there, lying in wait. Maybe nobody ever actually used the freedom 
to change infoStream in-flight, or the guy was lucky, or in his particular 
situation the field was guarded by some unrelated sync.




While I see banishing live reconfiguration from IW as a lost cause, I ask to 
make IWC immutable at the very least. As Shay said - this will provide a clear 
barrier between mutable and immutable properties.

 Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
 --

 Key: LUCENE-2960
 URL: https://issues.apache.org/jira/browse/LUCENE-2960
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shay Banon
Priority: Blocker
 Fix For: 3.1, 4.0


 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
 It would be great to be able to control that on a live IndexWriter. Other 
 possible two methods that would be great to bring back are 
 setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
 setters can actually be set on the MergePolicy itself, so no need for setters 
 for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: IndexWriter#setRAMBufferSizeMB removed in trunk

2011-03-11 Thread Earwin Burrfoot
Is it really that hard to recreate IndexWriter if you have to change
the settings??

Yeah, yeah, you lose all your precious reused buffers, and maybe
there's a small indexing latency spike, when switching from old IW to
new one, but people aren't changing their IW configs several times a
second?

I suggest banning as much runtime-mutable settings as humanely
possible, and ask people to recreate objects for reconfiguration, be
it IW, IR, Analyzers, whatnot.

On Thu, Mar 10, 2011 at 23:07, Michael McCandless
luc...@mikemccandless.com wrote:
 On Thu, Mar 10, 2011 at 7:28 AM, Robert Muir rcm...@gmail.com wrote:

 This should block the release: if IndexWriterConfig is a broken design
 then we need to revert this now before its released, not make users
 switch over and then undeprecate/revert in a future release.

 +1

 I think we have to sort this out, one way or another, before releasing 3.1.

 I really don't like splitting setters across IWC vs IW.  That'll just
 cause confusion, and noise over time as we change our minds about
 where things belong.

 Looking through IWC, it seems that most setters can be done live.
 In fact, setRAMBufferSizeMB is *almost* live: all places in IW that
 use this pull it from the config, except for DocumentsWriter.  We
 could just push the config down to DW and have it pull live too...

 Other settings are not pulled live but for no good reason, eg
 termsIndexInterval is copied to a private field in IW but could just
 as easily be pulled when it's time to write a new segment...

 Maybe we should simply document which settings are live vs only take
 effect at init time?

 Mike

 --
 Mike

 http://blog.mikemccandless.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005617#comment-13005617
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

As I said on the list - if one needs to change IW config, he can always 
recreate IW with new settings.
Such changes cannot happen often enough for recreation to affect indexing 
performance.

The fact that you can change IW's behaviour post-construction by modifying 
unrelated IWC instance is frightening. IW should either make a private copy of 
IWC when constructing, or IWC should be made immutable.

 Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
 --

 Key: LUCENE-2960
 URL: https://issues.apache.org/jira/browse/LUCENE-2960
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shay Banon
Priority: Blocker
 Fix For: 3.1, 4.0


 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
 It would be great to be able to control that on a live IndexWriter. Other 
 possible two methods that would be great to bring back are 
 setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
 setters can actually be set on the MergePolicy itself, so no need for setters 
 for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: IndexWriter#setRAMBufferSizeMB removed in trunk

2011-03-11 Thread Earwin Burrfoot
Thanks for your support, but I don't think setInfoStream makes any
sense either : )

Do we /change/ infoStreams for IW @runtime? Why can't we pass it as
constructor argument/IWC field?
Ok, just maybe, I can imagine a case, where a certain app runs
happily, then misbehaves, and then you, with some clever trickery
supply it a fresh infoStream, to capture the problem live, without
restarting.
So, just maybe, we should leave setInfoStream asis.

2011/3/11 Shai Erera ser...@gmail.com:
 I agree. After IWC, the only setter left in IW is setInfoStream which makes
 sense. But the rest ... assuming these config change don't happen very
 often, recreating IW doesn't sound like a big thing to me. The alternative
 of complicating IWC to support runtime changes -- we need to be absolutely
 sure it's worth it.

 Also, if the solution is to allow changing IWC (runtime) settings, then I
 don't think this issue should block 3.1? We can anyway add other runtime
 settings following 3.1, and we won't undeprecate anything. So maybe mark
 that issue as a non-blocker?

 Shai

 On Fri, Mar 11, 2011 at 2:20 PM, Earwin Burrfoot ear...@gmail.com wrote:

 Is it really that hard to recreate IndexWriter if you have to change
 the settings??

 Yeah, yeah, you lose all your precious reused buffers, and maybe
 there's a small indexing latency spike, when switching from old IW to
 new one, but people aren't changing their IW configs several times a
 second?

 I suggest banning as much runtime-mutable settings as humanely
 possible, and ask people to recreate objects for reconfiguration, be
 it IW, IR, Analyzers, whatnot.

 On Thu, Mar 10, 2011 at 23:07, Michael McCandless
 luc...@mikemccandless.com wrote:
  On Thu, Mar 10, 2011 at 7:28 AM, Robert Muir rcm...@gmail.com wrote:
 
  This should block the release: if IndexWriterConfig is a broken design
  then we need to revert this now before its released, not make users
  switch over and then undeprecate/revert in a future release.
 
  +1
 
  I think we have to sort this out, one way or another, before releasing
  3.1.
 
  I really don't like splitting setters across IWC vs IW.  That'll just
  cause confusion, and noise over time as we change our minds about
  where things belong.
 
  Looking through IWC, it seems that most setters can be done live.
  In fact, setRAMBufferSizeMB is *almost* live: all places in IW that
  use this pull it from the config, except for DocumentsWriter.  We
  could just push the config down to DW and have it pull live too...
 
  Other settings are not pulled live but for no good reason, eg
  termsIndexInterval is copied to a private field in IW but could just
  as easily be pulled when it's time to write a new segment...
 
  Maybe we should simply document which settings are live vs only take
  effect at init time?
 
  Mike
 
  --
  Mike
 
  http://blog.mikemccandless.com
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 



 --
 Kirill Zakharenko/Кирилл Захаренко
 E-Mail/Jabber: ear...@gmail.com
 Phone: +7 (495) 683-567-4
 ICQ: 104465785

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Kirill Zakharenko/Кирилл Захаренко
E-Mail/Jabber: ear...@gmail.com
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005891#comment-13005891
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

bq. Furthermore, closing the IW also forces you to commit, and I don't like 
tying changing of configuration to forcing a commit.
Like I said, one isn't going to change his configuration five times a second. 
It's ok to commit from time to time?

bq. So why should we force it to be unchangeable? That can only remove freedom, 
freedom that is perhaps valuable to an app somewhere.
Each and every live reconfigurable setting adds to complexity.
At the very least it requires proper synchronization. Take your SegmentWarmer 
example - you should make the field volatile.
While it's possible to chicken out on primitive fields ([except 
long/double|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7]),
 as Yonik mentioned earlier, making nonvolatile mutable references introduces 
you to a world of hard-to-catch unsafe publication bugs (yes, infoStream is 
currently broken!).
For more complex cases, certain on-change logic is required. And then you have 
to support this logic across all possible code rewrites and refactorings.

 Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
 --

 Key: LUCENE-2960
 URL: https://issues.apache.org/jira/browse/LUCENE-2960
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shay Banon
Priority: Blocker
 Fix For: 3.1, 4.0


 In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
 It would be great to be able to control that on a live IndexWriter. Other 
 possible two methods that would be great to bring back are 
 setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
 setters can actually be set on the MergePolicy itself, so no need for setters 
 for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2908) clean up serialization in the codebase

2011-02-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12994769#comment-12994769
 ] 

Earwin Burrfoot commented on LUCENE-2908:
-

Oh, damn :)
On my project, we specifically use java-serialization to pass configured 
Queries/Filters between cluster nodes, as it saves us HEAPS of 
wrapping/unwrapping them into some parallel serializable classes.

 clean up serialization in the codebase
 --

 Key: LUCENE-2908
 URL: https://issues.apache.org/jira/browse/LUCENE-2908
 Project: Lucene - Java
  Issue Type: Task
Reporter: Robert Muir
Assignee: Robert Muir
 Fix For: 4.0

 Attachments: LUCENE-2908.patch


 We removed contrib/remote, but forgot to cleanup serialization hell 
 everywhere.
 this is no longer needed, never really worked (e.g. across versions), and 
 slows 
 development (e.g. i wasted a long time debugging stupid serialization of 
 Similarity.idfExplain when trying to make a patch for the scoring system).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [REINDEX] Note: re-indexing required !

2011-02-07 Thread Earwin Burrfoot
Lucene maintains compatibility with earlier stable release index
versions, and to some extent transparently upgrades them.
But there is no guaranteed compatibility between different
in-development indexes.

E.g. 3.2 reads 3.1 indexes and upgrades them, but 3.2-dev-snapshot-10
(while happily handling 3.1) may fail reading 3.2-dev-snapshot-8
index, as they have the same version tag, yet different formats.

On Sun, Jan 23, 2011 at 19:18, Earl Hood e...@earlhood.com wrote:
 On Sat, Jan 22, 2011 at 11:14 PM, Shai Erera ser...@gmail.com wrote:
 Under LUCENE-2720 the index format of both trunk and 3x has changed. You
 should re-index any indexes created with either of these code streams.

 Does the 3x refer to the 3.x development branch?

 I.e. For those of using the stable 3.x release of Lucene, will
 a future 3.x release require rebuilding indexes?

 --ewh

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2871) Use FileChannel in FSDirectory

2011-01-20 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12984222#action_12984222
 ] 

Earwin Burrfoot commented on LUCENE-2871:
-

Before arguing where to put this new IndexOutput, I think it's wise to have a 
benchmark proving we need it at all.
I have serious doubts FileChannel's going to outperform RAF.write(). Why should 
it?
And for the purporses of benchmark it can be anywhere.

 Use FileChannel in FSDirectory
 --

 Key: LUCENE-2871
 URL: https://issues.apache.org/jira/browse/LUCENE-2871
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Store
Reporter: Shay Banon
 Attachments: LUCENE-2871.patch, LUCENE-2871.patch


 Explore using FileChannel in FSDirectory to see if it improves write 
 operations performance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Let's drop Maven Artifacts !

2011-01-18 Thread Earwin Burrfoot
Somehow, they were made available since 2.0
- http://repo2.maven.org/maven2/org/apache/lucene/lucene-core/

The pom's are minimal, sans dependencies, so eg if your project
depends on lucene-spellchecker, lucene-core won't be transitively
included and your build is gonna fail (you therefore had to add
dependency on the core to your project yourself).
But they were enough to download and link jars/sources/javadocs.

On Tue, Jan 18, 2011 at 12:40, Shai Erera ser...@gmail.com wrote:
 Out of curiosity, how did the Maven people integrate Lucene before we had
 Maven artifacts. To the best of my understanding, we never had proper Maven
 artifacts (Steve is working on that in LUCENE-2657).

 Shai

 On Tue, Jan 18, 2011 at 11:03 AM, Simon Willnauer
 simon.willna...@googlemail.com wrote:

 On Tue, Jan 18, 2011 at 9:33 AM, Thomas Koch tho...@koch.ro wrote:
  Hi,
 
  the developers list may not be the right place to find strong maven
  supporters. All developers know lucene from inside out and are perfectly
  fine
  to install lucene from whatever artifact.
  Those people using maven are your end users, that propably don't even
  subscribe to users@.

 big +1 for this comment! I have to admit that I am not a big maven fan
 and each time I have to use it its a pain in the ass but it is the
 de-facto standard for the majority of java projects on this planet so
 really there is not much of an option in my opinion. A project like
 lucene has to release maven artifacts even if its a pain.

 Simon
 
  Thomas Koch, http://www.koch.ro
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
  For additional commands, e-mail: dev-h...@lucene.apache.org
 
 





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly

2011-01-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983160#action_12983160
 ] 

Earwin Burrfoot commented on LUCENE-2657:
-

bq. we need to be very clear  and it has no effect on artifacts
I feel something was missed in the heat of debate. Eg:
bq. The latest patch on this release uses the Ant artifacts directly.
bq. This patch uses the Ant-produced artifacts to prepare for Maven artifact 
publishing. 
bq. Maven itself is not invoked in the process. An Ant plugin handles the 
artifact deployment.
I will now try to decipher these quotes.
It seems the patch takes the artifacts produced by Ant, as a part of our usual 
(and only) build process, and shoves it down Maven repository's throat along 
with a bunch of pom-descriptors.
Nothing else is happening.

Also, after everything that has been said, I think nobody in his right mind 
will *force* anyone to actually use the Ant target in question as a part of 
release. But it's nice to have it around, in case some user-friendly commiter 
would like to push (I'd like to reiterate - ant generated) artifacts into Maven.

 Replace Maven POM templates with full POMs, and change documentation 
 accordingly
 

 Key: LUCENE-2657
 URL: https://issues.apache.org/jira/browse/LUCENE-2657
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Assignee: Steven Rowe
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch


 The current Maven POM templates only contain dependency information, the bare 
 bones necessary for uploading artifacts to the Maven repository.
 The full Maven POMs in the attached patch include the information necessary 
 to run a multi-module Maven build, in addition to serving the same purpose as 
 the current POM templates.
 Several dependencies are not available through public maven repositories.  A 
 profile in the top-level POM can be activated to install these dependencies 
 from the various {{lib/}} directories into your local repository.  From the 
 top-level directory:
 {code}
 mvn -N -Pbootstrap install
 {code}
 Once these non-Maven dependencies have been installed, to run all Lucene/Solr 
 tests via Maven's surefire plugin, and populate your local repository with 
 all artifacts, from the top level directory, run:
 {code}
 mvn install
 {code}
 When one Lucene/Solr module depends on another, the dependency is declared on 
 the *artifact(s)* produced by the other module and deposited in your local 
 repository, rather than on the other module's un-jarred compiler output in 
 the {{build/}} directory, so you must run {{mvn install}} on the other module 
 before its changes are visible to the module that depends on it.
 To create all the artifacts without running tests:
 {code}
 mvn -DskipTests install
 {code}
 I almost always include the {{clean}} phase when I do a build, e.g.:
 {code}
 mvn -DskipTests clean install
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly

2011-01-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12983162#action_12983162
 ] 

Earwin Burrfoot commented on LUCENE-2657:
-

Thanks, but I'm not the one confused here. : )

 Replace Maven POM templates with full POMs, and change documentation 
 accordingly
 

 Key: LUCENE-2657
 URL: https://issues.apache.org/jira/browse/LUCENE-2657
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1, 4.0
Reporter: Steven Rowe
Assignee: Steven Rowe
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
 LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch


 The current Maven POM templates only contain dependency information, the bare 
 bones necessary for uploading artifacts to the Maven repository.
 The full Maven POMs in the attached patch include the information necessary 
 to run a multi-module Maven build, in addition to serving the same purpose as 
 the current POM templates.
 Several dependencies are not available through public maven repositories.  A 
 profile in the top-level POM can be activated to install these dependencies 
 from the various {{lib/}} directories into your local repository.  From the 
 top-level directory:
 {code}
 mvn -N -Pbootstrap install
 {code}
 Once these non-Maven dependencies have been installed, to run all Lucene/Solr 
 tests via Maven's surefire plugin, and populate your local repository with 
 all artifacts, from the top level directory, run:
 {code}
 mvn install
 {code}
 When one Lucene/Solr module depends on another, the dependency is declared on 
 the *artifact(s)* produced by the other module and deposited in your local 
 repository, rather than on the other module's un-jarred compiler output in 
 the {{build/}} directory, so you must run {{mvn install}} on the other module 
 before its changes are visible to the module that depends on it.
 To create all the artifacts without running tests:
 {code}
 mvn -DskipTests install
 {code}
 I almost always include the {{clean}} phase when I do a build, e.g.:
 {code}
 mvn -DskipTests clean install
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Let's drop Maven Artifacts !

2011-01-18 Thread Earwin Burrfoot
On Tue, Jan 18, 2011 at 17:00, Robert Muir rcm...@gmail.com wrote:
 On Tue, Jan 18, 2011 at 8:54 AM, Grant Ingersoll gsing...@apache.org wrote:
 It seems to me that if we have a fix for the things that ail our Maven 
 support (Steve's work), that it isn't then the reason for holding up a 
 release and we should just keep them as there are a significant number of 
 users who consume Lucene that way (via the central repository).  I agree 
 that we should not switch our build system,  but supporting the POMs is no 
 different than supporting the IntelliJ/Eclipse generation tools (they are 
 both problematic since they are not automated)


 its totally different in every way! we don't release the
 intellij/eclipse stuff, its for internal use only.
 additionally, there are no release artifacts generated by these
Latest code from LUCENE-2657 does not generate any new artifacts. It
uploads those you already have (built via ant) to the repo.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Let's drop Maven Artifacts !

2011-01-18 Thread Earwin Burrfoot
On Tue, Jan 18, 2011 at 20:13, Robert Muir rcm...@gmail.com wrote:
 Unfortunately there is a very loud minority that care about maven

 I would wager that there is a sizable silent *majority* of users who 
 literally depend on Lucene's Maven artifacts.

 I can't help but remind myself, this is the same argument Oracle
 offered up for the whole reason hudson debacle
 (http://hudson-labs.org/content/whos-driving-thing)

 Declaring that I have a secret pocket of users that want XYZ isn't
 open source consensus.

There is proof of existance for some unknown part of this secret pool.

http://www.google.com/search?q=%22artifactid+lucene-core%22
Please, don't look at About NNN results, these are known to be
veeery approximate.
Just page through. Some of the pages are Lucene poms themselves. Many
of them are poms for the projects depending on lucene.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2755) Some improvements to CMS

2011-01-17 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982564#action_12982564
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

bq. if you still want to work on it, the I can keep the issue open and mark it 
3.2 (unless you want to give it a try in 3.1). 
I'll start another later, so please, go on.

 Some improvements to CMS
 

 Key: LUCENE-2755
 URL: https://issues.apache.org/jira/browse/LUCENE-2755
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Assignee: Shai Erera
Priority: Minor
 Fix For: 3.1, 4.0

 Attachments: LUCENE-2755.patch


 While running optimize on a large index, I've noticed several things that got 
 me to read CMS code more carefully, and find these issues:
 * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
 MergeThreads taking merges from the IndexWriter until they are exhausted, and 
 only then that blocked merge will run. I think it's unnecessary that that 
 merge will be blocked.
 * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
 default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
 size segments anymore, I think we should switch the default impl. There are 
 two ways to make it extensible, if we want:
 ** Have an overridable member/method in CMS that you can extend and override 
 - easy.
 ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
 bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
 several places in the code, so more risky and complicated.
 On the go, I'd like to add some documentation to CMS - it's not very easy to 
 read and follow.
 I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Let's drop Maven Artifacts !

2011-01-17 Thread Earwin Burrfoot
You're not alone. :)
But, I bet, much more people would like to skip that step and have
their artifacts downloaded from central.

On Mon, Jan 17, 2011 at 19:06, Steven A Rowe sar...@syr.edu wrote:
 On 1/17/2011 at 1:53 AM, Michael Busch wrote:
 I don't think any user needs the ability to run an ant target on
 Lucene's sources to produce maven artifacts

 I want to be able to make modifications to the Lucene source, install Maven 
 snapshot artifacts in my local repository, then depend on those snapshots 
 from other projects.  I doubt I'm alone.

 Steve




-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Let's drop Maven Artifacts !

2011-01-16 Thread Earwin Burrfoot
Maven is a defacto package/dependency manager for Java. Like it or not.
All better tools out there, like Ant+Ivy, or SBT - support Maven repositories.
Lots of people rely on Maven or better tools for their builds and as
soon as you're on declarative dependency management train, it's a
bother to just take a bunch of jars and stuff 'em into your project.
Development tools (Eclipse/IDEA) support auto-downloading and
attaching sources/javadocs for declared dependencies, and people use
this.

 Well ... you raise interesting points. So if a committer would be willing to
 support GIT, RTC, and whatever (just making up scenarios), would we allow
 all of those to exist within Lucene?
So, while having a wild contributor supporting .. dunno .. MacPorts
package for Lucene is a bit crazy, and in the end - nobody will ever
notice,
supporting Maven broadens your audience and makes it happy (even those
guys, who are not into Maven itself).

 I think the reasonable solution is to have a modules/maven package, with
 build.xml that generates whatever needs to be generated. Whoever cares about
 maven should run the proper Ant targets, just like whoever cares about
 Eclipse/IDEA can now run ant eclipse/idea. We'd have an ant maven. If
 that's what you intend doing in 2657 then fine.
That should be some person amongst the committers, be it a part of
default release process or not.
I believe publishing Maven artefact is somewhat nontrivial for a
person not related to the project in question.

 The release manager need not be concerned w/ Maven (or whatever) artifacts,
 they are not officially published anywhere, and everyone's happy. As long as
 all tests pass, the release is good to go.

 Is that better?

 Shai

 On Sun, Jan 16, 2011 at 8:05 PM, Steven A Rowe sar...@syr.edu wrote:

 -1 from me on dropping Maven artifacts.

 I find it curious that on the verge of fixing the broken Maven artifacts
 situation (LUCENE-2657), there is a big push for a divorce.

 Robert, I agree we should have a way to test the magic artifacts.  I'm
 working on it.  Your other objection is the work involved - you don't want
 to do it.  I will do the work.

 We should not drop Maven support when there are committers willing to
 support it.  I obviously count myself in that camp.

 Steve

 Robert Muir rcm...@gmail.com wrote:


 On Sun, Jan 16, 2011 at 12:03 PM, Shai Erera ser...@gmail.com wrote:
  Hey
 
  Wearing on my rebel hat today, I'd like to propose we drop maven support
  from our release process / build system. I've always read about the
  maven
  artifacts never being produced right, and never working (or maybe never
  is a
  too harsh word).
 
  I personally don't understand why we struggle to support Maven. I'm
  perfectly fine if we say that Lucene/Solr uses SVN, Ant and release a
  bunch
  of .jar files you can embed in your project. Who says we need to
  support
  Maven? And if so, why only Maven (I'm kidding !)?
 
  Are you with me? :)
 

 I am, the last time i suggested releasing 3.1, a 99-email thread about
 maven ensued that basically left me frustrated and not wanting to work
 towards a release.

 We still don't have a test-maven target that does even trivial
 verification of these magical artifacts that most of us don't
 understand... like any other functionality we have, we should have
 tests so that the release manager can verify things are working before
 the release.  If we have a contrib thats unmaintained with no tests,
 would we let it block a release?

 I don't think we should let the maven problems hold lucene releases
 hostage.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org






-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2374) Add introspection API to AttributeSource/AttributeImpl

2011-01-16 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982437#action_12982437
 ] 

Earwin Burrfoot commented on LUCENE-2374:
-

Nice. Except maybe introduce a simple interface instead of the MapString, ? ?

{code}
interface AttributeReflector { // Name is crap, should be changed
  void reflect(String key, Object value);
}

void reflectWith(AttributeReflector reflector);
{code}

You have no need for fake maps then, both in toString(), and in user code.


 Add introspection API to AttributeSource/AttributeImpl
 --

 Key: LUCENE-2374
 URL: https://issues.apache.org/jira/browse/LUCENE-2374
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 3.1, 4.0


 AttributeSource/TokenStream inspection in Solr needs to have some insight 
 into the contents of AttributeImpls. As LUCENE-2302 has some problems with 
 toString() [which is not structured and conflicts with CharSequence's 
 definition for CharTermAttribute], I propose an simple API that get a default 
 implementation in AttributeImpl (just like toString() current):
 - IteratorMap.EntryString,? AttributeImpl.contentsIterator() returns an 
 iterator (for most attributes its a singleton) of a key-value pair, e.g. 
 term-foobar,startOffset-Integer.valueOf(0),...
 - AttributeSource gets the same method, it just concat the iterators of each 
 getAttributeImplsIterator() AttributeImpl
 No backwards problems occur, as the default toString() method will work like 
 before (it just gets iterator and lists), but we simply remove the 
 documentation for the format. (Char)TermAttribute gets a special impl fo 
 toString() according to CharSequence and a corresponding iterator.
 I also want to remove the abstract hashCode() and equals() methods from 
 AttributeImpl, as they are not needed and just create work for the 
 implementor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders

2011-01-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982126#action_12982126
 ] 

Earwin Burrfoot commented on LUCENE-2858:
-

bq. Any comments about removing write access from IndexReaders? I think 
setNorms() will be removed soon, but how about the others like 
deleteDocument()? I would propose to also make all IndexReaders simply readers 
not writers? 

Voting with all my extremities - yes!!

 Separate SegmentReaders (and other atomic readers) from composite IndexReaders
 --

 Key: LUCENE-2858
 URL: https://issues.apache.org/jira/browse/LUCENE-2858
 Project: Lucene - Java
  Issue Type: Task
Reporter: Uwe Schindler
 Fix For: 4.0


 With current trunk, whenever you open an IndexReader on a directory you get 
 back a DirectoryReader which is a composite reader. The interface of 
 IndexReader has now lots of methods that simply throw UOE (in fact more than 
 50% of all methods that are commonly used ones are unuseable now). This 
 confuses users and makes the API hard to understand.
 This issue should split atomic readers from reader collections with a 
 separate API. After that, you are no longer able, to get TermsEnum without 
 wrapping from those composite readers. We currently have helper classes for 
 wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or 
 Multi*), those should be retrofitted to implement the correct classes 
 (SlowMultiReaderWrapper would be an atomic reader but takes a composite 
 reader as ctor param, maybe it could also simply take a ListAtomicReader). 
 In my opinion, maybe composite readers could implement some collection APIs 
 and also have the ReaderUtil method directly built in (possibly as a view 
 in the util.Collection sense). In general composite readers do not really 
 need to look like the previous IndexReaders, they could simply be a 
 collection of SegmentReaders with some functionality like reopen.
 On the other side, atomic readers do not need reopen logic anymore? When a 
 segment changes, you need a new atomic reader? - maybe because of deletions 
 thats not the best idea, but we should investigate. Maybe make the whole 
 reopen logic simplier to use (ast least on the collection reader level).
 We should decide about good names, i have no preference at the moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders

2011-01-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982132#action_12982132
 ] 

Earwin Burrfoot commented on LUCENE-2858:
-

bq. Still, i think we would need this method (somewhere) even with CSF, so that 
people can change the norms and they instantly take effect for searches.
This still puzzles me. I can strain my imagination, and get people who just 
need to change norms without reindexing.
But doing this and *requiring* instant turnaround? Kid me not :)


 Separate SegmentReaders (and other atomic readers) from composite IndexReaders
 --

 Key: LUCENE-2858
 URL: https://issues.apache.org/jira/browse/LUCENE-2858
 Project: Lucene - Java
  Issue Type: Task
Reporter: Uwe Schindler
 Fix For: 4.0


 With current trunk, whenever you open an IndexReader on a directory you get 
 back a DirectoryReader which is a composite reader. The interface of 
 IndexReader has now lots of methods that simply throw UOE (in fact more than 
 50% of all methods that are commonly used ones are unuseable now). This 
 confuses users and makes the API hard to understand.
 This issue should split atomic readers from reader collections with a 
 separate API. After that, you are no longer able, to get TermsEnum without 
 wrapping from those composite readers. We currently have helper classes for 
 wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or 
 Multi*), those should be retrofitted to implement the correct classes 
 (SlowMultiReaderWrapper would be an atomic reader but takes a composite 
 reader as ctor param, maybe it could also simply take a ListAtomicReader). 
 In my opinion, maybe composite readers could implement some collection APIs 
 and also have the ReaderUtil method directly built in (possibly as a view 
 in the util.Collection sense). In general composite readers do not really 
 need to look like the previous IndexReaders, they could simply be a 
 collection of SegmentReaders with some functionality like reopen.
 On the other side, atomic readers do not need reopen logic anymore? When a 
 segment changes, you need a new atomic reader? - maybe because of deletions 
 thats not the best idea, but we should investigate. Maybe make the whole 
 reopen logic simplier to use (ast least on the collection reader level).
 We should decide about good names, i have no preference at the moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders

2011-01-15 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982166#action_12982166
 ] 

Earwin Burrfoot commented on LUCENE-2858:
-

APIs have to be there still. All that commity, segment-deletery, mutabley stuff 
(that spans both atomic and composite readers).
So, while your plan is viable, it won't remove that much cruft.

 Separate SegmentReaders (and other atomic readers) from composite IndexReaders
 --

 Key: LUCENE-2858
 URL: https://issues.apache.org/jira/browse/LUCENE-2858
 Project: Lucene - Java
  Issue Type: Task
Reporter: Uwe Schindler
 Fix For: 4.0


 With current trunk, whenever you open an IndexReader on a directory you get 
 back a DirectoryReader which is a composite reader. The interface of 
 IndexReader has now lots of methods that simply throw UOE (in fact more than 
 50% of all methods that are commonly used ones are unuseable now). This 
 confuses users and makes the API hard to understand.
 This issue should split atomic readers from reader collections with a 
 separate API. After that, you are no longer able, to get TermsEnum without 
 wrapping from those composite readers. We currently have helper classes for 
 wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or 
 Multi*), those should be retrofitted to implement the correct classes 
 (SlowMultiReaderWrapper would be an atomic reader but takes a composite 
 reader as ctor param, maybe it could also simply take a ListAtomicReader). 
 In my opinion, maybe composite readers could implement some collection APIs 
 and also have the ReaderUtil method directly built in (possibly as a view 
 in the util.Collection sense). In general composite readers do not really 
 need to look like the previous IndexReaders, they could simply be a 
 collection of SegmentReaders with some functionality like reopen.
 On the other side, atomic readers do not need reopen logic anymore? When a 
 segment changes, you need a new atomic reader? - maybe because of deletions 
 thats not the best idea, but we should investigate. Maybe make the whole 
 reopen logic simplier to use (ast least on the collection reader level).
 We should decide about good names, i have no preference at the moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981774#action_12981774
 ] 

Earwin Burrfoot commented on LUCENE-2868:
-

We here use an intermediate query AST, with a number of walkers that do synonym 
substitution, optimization, caching, rewriting for multiple fields, and finally 
- generating a tree of Lucene Queries.

I can share a generic reflection-based visitor that's somewhat more handy than 
default visitor pattern in java.
Usage looks roughly like: 
{code}
class ToStringWalker extends DispatchingVisitorString { // String here stands 
for the type of walk result
  String visit(TermQuery q) {
return {term:  + q.getTerm() + };
  }

  String visit(BooleanQuery q) {
StringBuffer buf = new StringBuffer();
buf.append({boolean: );
for (BooleanQuery.Clause clause: q.clauses()) {
  buf.append(dispatch(clause.getQuery()).append(, ); // Here we 
}
buf.append(});
return buf.toString();
  }

  String visit(SpanQuery q) { // Runs for all SpanQueries
.
  }

  String visit(Query q) { // Runs for all Queries not covered by a more exact 
visit() method 
..
  }
}

Query query = ...;
String stringRepresentation = new ToStringWalker().dispatch(query);
{code}

dispatch() checks its parameter runtime type, picks a visit()'s most close 
overload (according to java rules for compile-time overloaded method 
resolution), and invokes it.

 It should be easy to make use of TermState; rewritten queries should be 
 shared automatically
 

 Key: LUCENE-2868
 URL: https://issues.apache.org/jira/browse/LUCENE-2868
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Reporter: Karl Wright
 Attachments: query-rewriter.patch


 When you have the same query in a query hierarchy multiple times, tremendous 
 savings can now be had if the user knows enough to share the rewritten 
 queries in the hierarchy, due to the TermState addition.  But this is clumsy 
 and requires a lot of coding by the user to take advantage of.  Lucene should 
 be smart enough to share the rewritten queries automatically.
 This can be most readily (and powerfully) done by introducing a new method to 
 Query.java:
 Query rewriteUsingCache(IndexReader indexReader)
 ... and including a caching implementation right in Query.java which would 
 then work for all.  Of course, all callers would want to use this new method 
 rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-13 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12981388#action_12981388
 ] 

Earwin Burrfoot commented on LUCENE-2324:
-

Maan, this comment list is infinite.
How do I currently get the ..er.. current version? Latest branch + latest 
Jason's patch?

Regardless of everything else, I'd ask you not to extend random things :) at 
least if you can't say is-a about them.
DocumentsWriterPerThreadPool.ThreadState IS A ReentrantLock? No. So you're 
better off encapsulating it rather than extending.
Same can be applied to SegmentInfos that extends Vector :/

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: Realtime Branch

 Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
 lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-12 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980649#action_12980649
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

What's with ongoing crazyness? :)

bq. DirectIOLinuxDirectory
First you introduce a kind of directory that is utterly useless except certain 
special situations. Then, instead of fixing the directory/folding its code 
somewhere normal, you try to workaround by switching between directories. 
What's the point of using abstract classes or interfaces, if you leak their 
implementation's logic all over the place?
Or making DIOLD wrap something. Yeah! Wrap my RAMDir!

bq. bufferSize
This value is only meaningful to a certain subset of Directory implementations. 
So the only logical place we want to see this value set - is these very impls.
Sample code:
{code}
Directory ramDir = new RAMDirectory();
ramDir.createIndexInput(name, context);
// See, ma? No bufferSizes, they are pointless for RAMDir

Directory fsDir = new NIOFSDirectory();
fsDir.setBufferSize(IOContext.NORMAL_READ, 1024);
fsDir.setBufferSize(IOContext.MERGE, 4096);
fsDir.createIndexInput(name, context)
// See, ma? The only one who's really concerned with 'actual' buffer size is 
this concrete Directory impl
// All client code is only concerned with the context.
// It's NIOFSDirectory's business to give meaningful interpretation for 
IOContext and assign the buffer sizes.
{code}

You don't need custom Directory impls to make DIOLD work, you should freakin' 
fix it.
The proper way is to test out the things, and then move DirectIO code to the 
only place it makes sense in - FSDir? Probably make it switch on/off-able, 
maybe not.

You don't need custom Directory impls to set buffer sizes (neither cast to 
BufferedIndexInput!), you should add the setting to these Directories, which 
make sense of it.

 Directory createOutput and openInput should take an IOContext
 -

 Key: LUCENE-2793
 URL: https://issues.apache.org/jira/browse/LUCENE-2793
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
 Attachments: LUCENE-2793.patch


 Today for merging we pass down a larger readBufferSize than for searching 
 because we get better performance.
 I think we should generalize this to a class (IOContext), which would hold 
 the buffer size, but then could hold other flags like DIRECT (bypass OS's 
 buffer cache), SEQUENTIAL, etc.
 Then, we can make the DirectIOLinuxDirectory fully usable because we would 
 only use DIRECT/SEQUENTIAL during merging.
 This will require fixing how IW pools readers, so that a reader opened for 
 merging is not then used for searching, and vice/versa.  Really, it's only 
 all the open file handles that need to be different -- we could in theory 
 share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-12 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980732#action_12980732
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

bq. Because in your example code above, it looks like it's added to Directory 
itself.
bq. My problem with your sample code is that it appears that the .setBufferSize 
method is on Directory itself. 

Ohoho. My fault, sorry. It should look like:
{code}
RAMDirectory ramDir = new RAMDirectory();
ramDir.setBufferSize(whatever) // Compilation error!
ramDir.createIndexInput(name, context);

NIOFSDirectory fsDir = new NIOFSDirectory();
fsDir.setBufferSize(IOContext.NORMAL_READ, 1024);
fsDir.setBufferSize(IOContext.MERGE, 4096);
fsDir.createIndexInput(name, context)
{code}

 Directory createOutput and openInput should take an IOContext
 -

 Key: LUCENE-2793
 URL: https://issues.apache.org/jira/browse/LUCENE-2793
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
 Attachments: LUCENE-2793.patch


 Today for merging we pass down a larger readBufferSize than for searching 
 because we get better performance.
 I think we should generalize this to a class (IOContext), which would hold 
 the buffer size, but then could hold other flags like DIRECT (bypass OS's 
 buffer cache), SEQUENTIAL, etc.
 Then, we can make the DirectIOLinuxDirectory fully usable because we would 
 only use DIRECT/SEQUENTIAL during merging.
 This will require fixing how IW pools readers, so that a reader opened for 
 merging is not then used for searching, and vice/versa.  Really, it's only 
 all the open file handles that need to be different -- we could in theory 
 share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-12 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980736#action_12980736
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

{quote}
As I said before though, i wouldn't mind if we had something more like a 
'modules/native' and FSDirectory checked, if this was available and 
automagically used it...
but I can't see myself thinking that we should put this logic into fsdir 
itself, sorry. 
{quote}
I'm perfectly OK with that approach (having some module FSDir checks). I also 
feel uneasy having JNI in core.
What I don't want to see, is Directory impls that you can't use on their own. 
If you can only use it for merging, then it's not a Directory, it breaks the 
contract! - move the code elsewhere.

 Directory createOutput and openInput should take an IOContext
 -

 Key: LUCENE-2793
 URL: https://issues.apache.org/jira/browse/LUCENE-2793
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
 Attachments: LUCENE-2793.patch


 Today for merging we pass down a larger readBufferSize than for searching 
 because we get better performance.
 I think we should generalize this to a class (IOContext), which would hold 
 the buffer size, but then could hold other flags like DIRECT (bypass OS's 
 buffer cache), SEQUENTIAL, etc.
 Then, we can make the DirectIOLinuxDirectory fully usable because we would 
 only use DIRECT/SEQUENTIAL during merging.
 This will require fixing how IW pools readers, so that a reader opened for 
 merging is not then used for searching, and vice/versa.  Really, it's only 
 all the open file handles that need to be different -- we could in theory 
 share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders

2011-01-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980388#action_12980388
 ] 

Earwin Burrfoot commented on LUCENE-2858:
-

bq. On the other side, atomic readers do not need reopen logic anymore? When a 
segment changes, you need a new atomic reader?
There is a freakload of places that upgrade SegmentReader in various ways, 
with deletions guilty only for the part of the cases. I'll try getting back to 
LUCENE-2355 at the end of the week.

 Separate SegmentReaders (and other atomic readers) from composite IndexReaders
 --

 Key: LUCENE-2858
 URL: https://issues.apache.org/jira/browse/LUCENE-2858
 Project: Lucene - Java
  Issue Type: Task
Reporter: Uwe Schindler
 Fix For: 4.0


 With current trunk, whenever you open an IndexReader on a directory you get 
 back a DirectoryReader which is a composite reader. The interface of 
 IndexReader has now lots of methods that simply throw UOE (in fact more than 
 50% of all methods that are commonly used ones are unuseable now). This 
 confuses users and makes the API hard to understand.
 This issue should split atomic readers from reader collections with a 
 separate API. After that, you are no longer able, to get TermsEnum without 
 wrapping from those composite readers. We currently have helper classes for 
 wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or 
 Multi*), those should be retrofitted to implement the correct classes 
 (SlowMultiReaderWrapper would be an atomic reader but takes a composite 
 reader as ctor param, maybe it could also simply take a ListAtomicReader). 
 In my opinion, maybe composite readers could implement some collection APIs 
 and also have the ReaderUtil method directly built in (possibly as a view 
 in the util.Collection sense). In general composite readers do not really 
 need to look like the previous IndexReaders, they could simply be a 
 collection of SegmentReaders with some functionality like reopen.
 On the other side, atomic readers do not need reopen logic anymore? When a 
 segment changes, you need a new atomic reader? - maybe because of deletions 
 thats not the best idea, but we should investigate. Maybe make the whole 
 reopen logic simplier to use (ast least on the collection reader level).
 We should decide about good names, i have no preference at the moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges

2011-01-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980390#action_12980390
 ] 

Earwin Burrfoot commented on LUCENE-2856:
-

A CompositeSegmentListener niftily removes the need for collection.

 Create IndexWriter event listener, specifically for merges
 --

 Key: LUCENE-2856
 URL: https://issues.apache.org/jira/browse/LUCENE-2856
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Jason Rutherglen
 Attachments: LUCENE-2856.patch


 The issue will allow users to monitor merges occurring within IndexWriter 
 using a callback notifier event listener.  This can be used by external 
 applications such as Solr to monitor large segment merges.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980400#action_12980400
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

Looks crazy. In a -bad- tangled way.
You get IOFactory from Directory, put into IOContext, and then invoke it, 
passing it (wow!) an IOContext and a Directory. What if you pass totally 
different Directory? Different IOContext? It blows up eerily.

And there's no justification for this - we already have an IOFactory, it's 
called Directory! It just needs an extra parameter on its factory methods 
(createInput/Output), that's all.

 Directory createOutput and openInput should take an IOContext
 -

 Key: LUCENE-2793
 URL: https://issues.apache.org/jira/browse/LUCENE-2793
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
 Attachments: LUCENE-2793.patch


 Today for merging we pass down a larger readBufferSize than for searching 
 because we get better performance.
 I think we should generalize this to a class (IOContext), which would hold 
 the buffer size, but then could hold other flags like DIRECT (bypass OS's 
 buffer cache), SEQUENTIAL, etc.
 Then, we can make the DirectIOLinuxDirectory fully usable because we would 
 only use DIRECT/SEQUENTIAL during merging.
 This will require fixing how IW pools readers, so that a reader opened for 
 merging is not then used for searching, and vice/versa.  Really, it's only 
 all the open file handles that need to be different -- we could in theory 
 share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges

2011-01-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980448#action_12980448
 ] 

Earwin Burrfoot commented on LUCENE-2856:
-

A SegmentListener that has a number of children SLs and delegates 
eventHappened() calls to them. 

 Create IndexWriter event listener, specifically for merges
 --

 Key: LUCENE-2856
 URL: https://issues.apache.org/jira/browse/LUCENE-2856
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Jason Rutherglen
 Attachments: LUCENE-2856.patch


 The issue will allow users to monitor merges occurring within IndexWriter 
 using a callback notifier event listener.  This can be used by external 
 applications such as Solr to monitor large segment merges.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980454#action_12980454
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

{quote}
bq. You get IOFactory from Directory
That's for the default, the main use is the static IOFactory class.
{quote}
You lost me here. If you got A from B, you don't have to pass B again to invoke 
A, if you do - that's 99% a design mistake.
But still, my point was that you don't need IOFactory at all.

bq. Right, however we're basically trying to intermix Directory's, which 
doesn't work when pointed at the same underlying File. I thought about a 
meta-Directory that routes based on the IOContext, however we'd still need a 
way to create an IndexInput and IndexOutput, from different Directory 
implementations. 
What Directories are you trying to intermix? What for?

I thought the only thing done in that issue is an attempt to give Directory 
hints as to why we're going to open its streams.
A simple enum IOContext and extra parameter on createOutput/Input would 
suffice. But with Lucene's micromanagement attitude, an enum turns into 
slightly more complex thing, with bufferSizes and whatnot.
Still - no need for mixing Directories.

 Directory createOutput and openInput should take an IOContext
 -

 Key: LUCENE-2793
 URL: https://issues.apache.org/jira/browse/LUCENE-2793
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
 Attachments: LUCENE-2793.patch


 Today for merging we pass down a larger readBufferSize than for searching 
 because we get better performance.
 I think we should generalize this to a class (IOContext), which would hold 
 the buffer size, but then could hold other flags like DIRECT (bypass OS's 
 buffer cache), SEQUENTIAL, etc.
 Then, we can make the DirectIOLinuxDirectory fully usable because we would 
 only use DIRECT/SEQUENTIAL during merging.
 This will require fixing how IW pools readers, so that a reader opened for 
 merging is not then used for searching, and vice/versa.  Really, it's only 
 all the open file handles that need to be different -- we could in theory 
 share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-11 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980458#action_12980458
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

In fact, I suggest dropping bufferSize altogether. As far as I can recall, it 
was introduced as a precursor to IOContext and can now be safely replaced.

Even if we want to give user control over buffer size for all streams, or only 
those opened in specific IOContext, he can pass these numbers as config 
parameters to his Directory impl.
That makes total sense, as:
1. IndexWriter/IndexReader couldn't care less about buffer sizes, it just 
passes them to the Directory. It's not their concern.
2. A bunch of Directories doesn't use said bufferSize at all, making this 
parameter not only private Directory affairs, but even further - 
implementation-specific.

So my bet is - introduce IOContext as a simple Enum, change bufferSize 
parameter on createInput/Output to IOContext, done.

 Directory createOutput and openInput should take an IOContext
 -

 Key: LUCENE-2793
 URL: https://issues.apache.org/jira/browse/LUCENE-2793
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
 Attachments: LUCENE-2793.patch


 Today for merging we pass down a larger readBufferSize than for searching 
 because we get better performance.
 I think we should generalize this to a class (IOContext), which would hold 
 the buffer size, but then could hold other flags like DIRECT (bypass OS's 
 buffer cache), SEQUENTIAL, etc.
 Then, we can make the DirectIOLinuxDirectory fully usable because we would 
 only use DIRECT/SEQUENTIAL during merging.
 This will require fixing how IW pools readers, so that a reader opened for 
 merging is not then used for searching, and vice/versa.  Really, it's only 
 all the open file handles that need to be different -- we could in theory 
 share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2011-01-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979522#action_12979522
 ] 

Earwin Burrfoot commented on LUCENE-2312:
-

Some questions to align myself with impending reality.

Is that right that future RT readers are no longer immutable snapshots (in a 
sense that they have variable maxDoc)?
If it is so, are you keeping current NRT mode, with fast turnaround, yet 
immutable readers?

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch

 Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2474) Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey)

2011-01-10 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979888#action_12979888
 ] 

Earwin Burrfoot commented on LUCENE-2474:
-

bq. Earwin's working on improving this, I think, under LUCENE-2355
I stalled, and then there were just so many changes under trunk, so I have to 
restart now :) Thanks for another kick.

 Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean 
 custom caches that use the IndexReader (getFieldCacheKey)
 

 Key: LUCENE-2474
 URL: https://issues.apache.org/jira/browse/LUCENE-2474
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shay Banon
 Attachments: LUCENE-2474.patch, LUCENE-2474.patch


 Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean 
 custom caches that use the IndexReader (getFieldCacheKey).
 A spin of: https://issues.apache.org/jira/browse/LUCENE-2468. Basically, its 
 make a lot of sense to cache things based on IndexReader#getFieldCacheKey, 
 even Lucene itself uses it, for example, with the CachingWrapperFilter. 
 FieldCache enjoys being called explicitly to purge its cache when possible 
 (which is tricky to know from the outside, especially when using NRT - 
 reader attack of the clones).
 The provided patch allows to plug a CacheEvictionListener which will be 
 called when the cache should be purged for an IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979276#action_12979276
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

bq. But doesn't that mean that an app w/ rare queries but each query is massive 
fails to use all available concurrency?
Yes. But that's not my case. And likely not someone else's.

I think if you want to be super-generic, it's better to defer exact threading 
to the user, instead of doing a one-size-fits-all solution. Else you risk 
conjuring another ConcurrentMergeScheduler.
While we're at it, we can throw in some sample implementation, which can 
satisfy some of the users, but not everyone.

 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979277#action_12979277
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

And we're nearing a day when we keep the whole term dictionary in memory (as 
Sphinx does for instance).
At that point a gazillion of term lookup-related hacks (like lookup cache) 
become obsolete :)
Term dictionary itself can also be memory-mapped after this, instead of being 
read and built from disk, which makes new segment opening 
near-instantaneous.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979305#action_12979305
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

As I said, there's already a search server with strictly in-memory (in mmap 
sense. it can theoretically be paged out) terms dict AND widespread adoption. 
Their users somehow manage.

My guess is that's because people with insane number of terms store various 
crap like unique timestamps as terms. With CSF (attributes in Sphinx lingo), 
and some nice filters that can work over CSF, there's no longer any need to 
stuff your timestamps in the same place you stuff your texts. That can be 
reflected in documentation, and then, suddenly, we can drop on-disk only 
support.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979306#action_12979306
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

A lot of fork-join type frameworks don't even care. Even though scheduling 
threads is something people supposedly use them for.
Why? I guess that's due to low yield/cost ratio.
You frequently quote progress, not perfection in relation to the code, but 
why don't we apply this same principle to our threading guarantees?
I don't want to use allowed concurrency fully. That's not realistic. I want 85% 
of it. That's already a huge leap ahead of single-threaded searches.


 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979346#action_12979346
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

bq. I don't like the reasoning that, just because sphinx does it and their 
'users manage', that makes it ok.
I'm in no way advocating it as an all-round better solution. It has it's 
wrinkles just as anything else.
My reasoning is merely that alternative exists, and it is viable. As proven by 
pretty high-profile users.
They have memory-resident term dictionary, and it works, I heard no complaints 
regarding this ever.

bq. sphinx also requires mysql
Have you read anything at all? It has an integration ready, for the layman user 
who just wants to stick a fulltext search into their little app, but it is in 
no way reliant on it.
Sphinx is a direct alternative to Solr.

{quote}
But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good 
decisions on what gets swapped out is risky - Lucene is better informed than 
the OS on which data structures are worth spending RAM on (norms, terms index, 
field cache, del docs).
If indeed the terms dict (thanks to FSTs) becomes small enough to fit in RAM, 
then we should load it into RAM (and do away w/ the terms index).
{quote}
That's a bit delusional. If a system is forced to swap out, it'll swap your 
explicitly managed RAM just as likely as memory-mapped files. I've seen this 
countless times.
But then, you have a number of benefits - like sharing filesystem cache when 
opening same file multiple times, offloading things from Java heap (which is 
almost always a good thing), fastest load-into-memory times possible.


Sorry, if I sound offending at times, but, damn, there's a whole world of 
simple and efficient code lying ahead in that direction :)

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979366#action_12979366
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

bq. Nope, havent looked at their code... i think i stopped at the documentation 
when i saw how they analyzed text!
All my points are contained within their documentation. No need to look at the 
code (it's as shady as Lucene's).
In the same manner, Lucene had crappy analyzis for years, until you've taken 
hold of (unicode) police baton.
So let's not allow color differences between our analyzers affect our judgement 
on other parts of ours : )

bq. In other words, Test2BTerms in src/test should pass on my 32-bit windows 
machine with whatever we default to.
I'm questioning is there any legal, adequate reason to have that much terms.
I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms 
though :/

A hybrid solution, with term-dict being loaded completely into memory (either 
via mmap, or into arrays) on per-field basis, is probably best in the end, 
however sad it may be.

 Add variable-gap terms index impl.
 --

 Key: LUCENE-2843
 URL: https://issues.apache.org/jira/browse/LUCENE-2843
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2843.patch, LUCENE-2843.patch


 PrefixCodedTermsReader/Writer (used by all real core codecs) already
 supports pluggable terms index impls.
 The only impl we have now is FixedGapTermsIndexReader/Writer, which
 picks every Nth (default 32) term and holds it in efficient packed
 int/byte arrays in RAM.  This is already an enormous improvement (RAM
 reduction, init time) over 3.x.
 This patch adds another impl, VariableGapTermsIndexReader/Writer,
 which lets you specify an arbitrary IndexTermSelector to pick which
 terms are indexed, and then uses an FST to hold the indexed terms.
 This is typically even more memory efficient than packed int/byte
 arrays, though, it does not support ord() so it's not quite a fair
 comparison.
 I had to relax the terms index plugin api for
 PrefixCodedTermsReader/Writer to not assume that the terms index impl
 supports ord.
 I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
 out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
 when the FST is used as a terms index but seekCeil when it's holding
 all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

2011-01-08 Thread Earwin Burrfoot
On Mon, Jan 3, 2011 at 18:18, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / 
 Cominventjan@cominvent.com wrote:
 The problem with large start is probably worse when sharding is involved. 
 Anyone know how the shard component goes about fetching 
 start=100rows=10 from say 10 shards? Does it have to merge sorted lists 
 of 1mill+10 docsids from each shard which is the worst case?

 Yep, that's how it works today.


Technically, if your docs have a non-biased (in regards to their
sort-value) distribution across shards, you can fetch much less than
topN docs from each shard.
I played with the idea, and it worked for me. Though later I dropped
the opto, as it complicated things somewhat and my users aren't
querying gazillions of docs often.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2010-12-30 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12976027#action_12976027
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

I use the following scheme:
* There is a fixed pool of threads shared by all searches, that limits total 
concurrency.
* Each new search apprehends at most a fixed number of threads from this pool 
(say, 2-3 of 8 in my setup),
* and these threads churn through segments as through a queue (in maxDoc order, 
but I think even that is unnecessary).

No special smart binding between threads and segments (eg. 1 thread for each 
biggie, 1 thread for all of the small ones) -
means simpler code, and zero possibility of stalling, when there are threads to 
run, segments to search, but binding policy does not connect them.
Using fewer threads per-search than total available is a precaution against 
biggie searches blocking fast ones.

 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: strange problem of PForDelta decoder

2010-12-30 Thread Earwin Burrfoot
until we fix Lucene to run a single search concurrently (which we
badly need to do).
 I am interested in this idea.(I have posted it before) do you have some
 resources such as papers or tech articles about it?
 I have tried but it need to modify index format dramatically and we use
 solr distributed search to relieve the problem of response time. so finally
 give it up.
 lucene4's index format is more flexible that it supports customed codecs
 and it's now on development, I think it's good time to take it into
 consideration
 that let it support multithread searching for a single query.
 I have a naive solution. dividing docList into many groups
 e.g grouping docIds by it's even or odd
 term1 df1=4  docList =  0  4  8  10
 term1 df2=4  docList = 1  3  9  11

 term2 df1=4  docList = 0  6  8  12
 term2 df2=4  docList = 3  9  11 15
   then we can use 2 threads to search topN docs on even group and odd group
 and finally merge their results into a single on just like solr
 distributed search.
 But it's better than solr distributed search.
   First, it's in a single process and data communication between
 threads is much
 faster than network.
   Second, each threads process the same number of documents.For solr
 distributed
 search, one shard may process 7 documents and another shard may 1 document
 Even if we can make each shard have the same document number. we can not
 make it uniformly for each term.
    e.g. shard1 has doc1 doc2
           shard2 has doc3 doc4
    but term1 may only occur in doc1 and doc2
    while term2 may only occur in doc3 and doc4
    we may modify it
           shard1 doc1 doc3
           shard2 doc2 doc4
    it's good for term1 and term2
    but term3 may occur in doc1 and doc3...
    So I think it's fine-grained distributed in index while solr
 distributed search is coarse-
 grained.
This is just crazy :)

The simple way is just to search different segments in parallel.
BalancedSegmentMergePolicy makes sure you have roughly even-sized
large segments (and small ones don't count, they're small!).
If you're bound on squeezing out that extra millisecond (and making
your life miserable along the way), you can search a single segment
with multiple threads (by dividing it in even chunks, and then doing
skipTo to position your iterators to the beginning of each chunk).

First approach is really easy to implement. Second one is harder, but
still doesn't require you to cook the number of CPU cores available
into your index!

It's the law of diminishing returns at play here. You're most likely
to search in parallel over mostly memory-resident index
(RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems
tend to slow down considerably on parallel sequential reads, so you
already have pretty decent speed.
Searching different segments in parallel (with BSMP) makes you several
times faster.
Searching in parallel within a segment requires some weird hacks, but
has maybe a few percent advantage over previous solution.
Sharding posting lists requires a great deal of weird hacks, makes
index machine-bound, and boosts speed by another couple of percent.
Sounds worthless.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: is the classes ended with PerThread(*PerThread) multithread

2010-12-28 Thread Earwin Burrfoot
There is a single indexchain, with a single instance of each chain
component, except those ending in -PerThread.

Though that's gonna change with
https://issues.apache.org/jira/browse/LUCENE-2324

On Tue, Dec 28, 2010 at 13:10, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Tue, Dec 28, 2010 at 10:57 AM, xu cheng xcheng@gmail.com wrote:
 hi simon
 thanks for replying very much.
 after reading the source code with your suggestion, here's my understanding,
 and I don't know whether it's right:
 the DocumentsWriter actually don't create threads, but the codes that use
 DocumentsWriter can do the multithreading(say, several threads call
 updateDocument). and each thread has its DocumentsWriterThreadState, in the
 mean while, each DocumentsWriterThreadState has its own objects(the
 *PerThread such as DocFieldProcessorPerThread, DocInverterPerThread and so
 on )
 as the methods of DocumentsWriter are called by multiple threads, for
 example, 4 threads, there are 4 DocumentsWriterThreadState objects, and 4
 index chains, ( each index chain has it's own *PerThread objects ,  to
 process the document).
 am I right??

 that sounds about right

 simon
 thanks for replying again!


 2010/12/28 Simon Willnauer simon.willna...@googlemail.com

 Hey there,

 so what you are looking at are classes that are created per Thread
 rather than shared with other threads. Lucene internally rarely
 creates threads or subclasses Thread, Runnable or Callable
 (ParallelMultiSearcher is an exception or some of the merging code).
 Yet, inside the indexer when you add (update) a document Lucene
 utilizes the callers thread rather than spanning a new one. When you
 look at DocumentsWriter.java there should be a method callled
 getThreadState. Each indexing thread, lets say in updateDocument, gets
 its Thread-Private DocumentsWriterThreadState. This thread state holds
 a DocConsumerPerThread obtained from the DocumentsWriters DocConsumer
 (see the indexing chain). DocConsumerPerThread in that case is some
 kind of decorator that hold other DocConsumerPerThread instances like
 TermsHashPerThread etc.

 The general pattern is for each DocConsumer you can get a
 DocConsumerPerThread for your indexing thread which then consumes the
 document you are processing right now.

 I hope that helps

 simon


 On Tue, Dec 28, 2010 at 4:19 AM, xu cheng xcheng@gmail.com wrote:
  hi all:
  I'm new to dev
  these days I'm reading the source code in the index package
  and I was confused.
  there are classes with suffix PerThread such as
  DocFieldProcessorPerThread,
  DocInverterPerThread, TermsHashPerThread, FreqProxTermWriterPerThread.
  in this mailing-list, I was told that they are multithreaded.
  however, there are some difficulties for me to understand!
  I see no sign that they inherited from the Thread , or implement the
  Runnable, or something else??
  how do they map to the OS thread??
  thanks ^_^

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974274#action_12974274
 ] 

Earwin Burrfoot commented on LUCENE-2829:
-

Term lookup misses can be alleviated by a simple Bloom Filter.
No caching misses required, helps both PK and near-PK queries.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974350#action_12974350
 ] 

Earwin Burrfoot commented on LUCENE-2829:
-

Nobody halts your progress, we're merely discussing.

I, on the other hand, have a feeling that Lucene is overflowing with single 
incremental improvements aka hacks, as they are easier and faster to 
implement than trying to get a bigger picture, and, yes, rebuilding everything 
:)
For example, better term dict code will make this issue (somewhat hackish, 
admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to 
keep the whole term dict in memory with reasonable lookup routine (eg. as FST).

Having said that, I reiterate, I'm not here to stop you or turn this issue into 
something else.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: RT branch status

2010-12-22 Thread Earwin Burrfoot
Cool! I'm getting to this on a weekend.

On Tue, Dec 21, 2010 at 11:44, Michael Busch busch...@gmail.com wrote:
 After merging trunk into the RT branch it's finally compiling again and
 up-to-date.

 Several tests are failing now after the merge (43 out of 1427 are failing),
 which is not too surprising, because so many things have changed
 (segment-deletes, flush control, termsHash refactoring, removal of doc
 stores, etc).

 Especially IndexWriter and DocumentsWriter are in a somewhat messy state,
 but I wanted to share my current state, so I committed the merge.  I'll try
 this week to understand the new changes (especially deletes) and make them
 work with the DWPT.  The following areas need work:
  * deletes
  * thread-safety
  * error handling and aborting
  * flush-by-ram (LUCENE-2573)

 Also, some tests deadlock.  Not surprisingly either, cause flushcontrol etc.
 introduce new synchronized blocks.

 Before the merge all tests were passing, except the ones testing
 flush-by-ram functionality.  I'll keep working on getting the branch back
 into that state again soon.

 Help is definitely welcome!  I'd love to get this branch ready so that we
 can merge it into trunk as soon as possible.  As Mike's experiments show
 having DWPTs will not only be beneficial for RT search, but also increase
 indexing performance in general.

  Michael

 PS: Thanks for the patience!

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Do we want 'nocommit' to fail the commit?

2010-12-18 Thread Earwin Burrfoot
But. Er. What if we happen to have nocommit in a string, or in some
docs, or as a name of variable?

On Sat, Dec 18, 2010 at 12:47, Michael McCandless
luc...@mikemccandless.com wrote:
 +1 this would be great :)

 Mike

 On Fri, Dec 17, 2010 at 10:45 PM, Shai Erera ser...@gmail.com wrote:
 Hi
 Out of curiosity, I searched if we can have a nocommit comment in the code
 fail the commit. As far as I see, we try to avoid accidental commits (of say
 debug messages) by putting a nocommit comment, but I don't know if svn ci
 would fail in the presence of such comment - I guess not because we've seen
 some accidental nocommits checked in already in the past.
 So I Googled around and found that if we have control of the svn repo, we
 can add a pre-commit hook that will check and fail the commit. Here is a
 nice article that explains how to add pre-commit hooks in general
 (http://wordaligned.org/articles/a-subversion-pre-commit-hook). I didn't try
 it yet (on our local svn instance), so I cannot say how well it works, but
 perhaps someone has experience with it ...
 So if this is interesting, and is doable for Lucene (say, open a JIRA issue
 for Infra?) I don't mind investigating it further and write the script
 (which can be as simple as 'grep the changed files and fail on the presence
 of nocommit string').
 Shai

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2818) abort() method for IndexOutput

2010-12-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972764#action_12972764
 ] 

Earwin Burrfoot commented on LUCENE-2818:
-

bq. Can abort() have a default impl in IndexOutput, such as close() followed by 
deleteFile() maybe? If so, then it won't break anything.
It can't. To call deleteFile you need both a reference to papa-Directory and a 
name of the file this IO writes to. Abstract IO class has neither. If we add 
them, they have to be passed to a new constructor, and that's an API break ;)

bq. Would abort() on Directory fit better? E.g., it can abort all currently 
open and modified files, instead of the caller calling abort() on each 
IndexOutput? Are you thinking of a case where a write failed, and the caller 
would call abort() immediately, instead of some higher-level code? If so, would 
rollback() be a better name?
Oh, no, no. No way. I don't want to push someone else's responsibility on 
Directory. This abort() is merely a shortcut.

Let's go with a usage example:
Here's FieldsWriter.java with LUCENE-2814 applied (skipping irrelevant parts) - 
https://gist.github.com/746358
Now, the same, with abort() - https://gist.github.com/746367

 abort() method for IndexOutput
 --

 Key: LUCENE-2818
 URL: https://issues.apache.org/jira/browse/LUCENE-2818
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot

 I'd like to see abort() method on IndexOutput that silently (no exceptions) 
 closes IO and then does silent papaDir.deleteFile(this.fileName()).
 This will simplify a bunch of error recovery code for IndexWriter and 
 friends, but constitutes an API backcompat break.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2818) abort() method for IndexOutput

2010-12-18 Thread Earwin Burrfoot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972765#action_12972765
 ] 

Earwin Burrfoot commented on LUCENE-2818:
-

bq. I think we can make a default impl that simply closes  suppresses 
exceptions? (We can't .deleteFile since an abstract IO doesn't know its Dir). 
Our concrete impls can override w/ versions that do delete the file...
I don't think we need a default impl? For some directory impls close() is a 
noop + what is more important, having abstract method forces you to implement 
it, you can't forget this, so we're not gonna see broken directories that don't 
do abort() properly.

 abort() method for IndexOutput
 --

 Key: LUCENE-2818
 URL: https://issues.apache.org/jira/browse/LUCENE-2818
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot

 I'd like to see abort() method on IndexOutput that silently (no exceptions) 
 closes IO and then does silent papaDir.deleteFile(this.fileName()).
 This will simplify a bunch of error recovery code for IndexWriter and 
 friends, but constitutes an API backcompat break.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2818) abort() method for IndexOutput

2010-12-18 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2818:


Priority: Minor  (was: Major)

This change is really minor, but I think, convinient.

You don't have to lug reference to Directory along, and recalculate the file 
name, if the only thing you want to say is that write was a failure and you no 
longer need this file.

 abort() method for IndexOutput
 --

 Key: LUCENE-2818
 URL: https://issues.apache.org/jira/browse/LUCENE-2818
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
Priority: Minor

 I'd like to see abort() method on IndexOutput that silently (no exceptions) 
 closes IO and then does silent papaDir.deleteFile(this.fileName()).
 This will simplify a bunch of error recovery code for IndexWriter and 
 friends, but constitutes an API backcompat break.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-18 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2814:


Attachment: LUCENE-2814.patch

Synced to trunk.

bq. Also, on the nocommit on exc in DW.addDocument, yes I think that 
(IFD.deleteNewFiles, not checkpoint) is still needed because DW can orphan the 
store files on abort?
Orphaned files are deleted directly in StoredFieldsWriter.abort() and 
TermVectorsTermsWriter.abort(). As I said - all the open files tracking is now 
gone.
Turns out checkpoint() is also no longer needed.

I have no other lingering cleanup urges, this is ready to be committed. I think.

 stop writing shared doc stores across segments
 --

 Key: LUCENE-2814
 URL: https://issues.apache.org/jira/browse/LUCENE-2814
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 3.1, 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, 
 LUCENE-2814.patch, LUCENE-2814.patch


 Shared doc stores enables the files for stored fields and term vectors to be 
 shared across multiple segments.  We've had this optimization since 2.1 I 
 think.
 It works best against a new index, where you open an IW, add lots of docs, 
 and then close it.  In that case all of the written segments will reference 
 slices a single shared doc store segment.
 This was a good optimization because it means we never need to merge these 
 files.  But, when you open another IW on that index, it writes a new set of 
 doc stores, and then whenever merges take place across doc stores, they must 
 now be merged.
 However, since we switched to shared doc stores, there have been two 
 optimizations for merging the stores.  First, we now bulk-copy the bytes in 
 these files if the field name/number assignment is congruent.  Second, we 
 now force congruent field name/number mapping in IndexWriter.  This means 
 this optimization is much less potent than it used to be.
 Furthermore, the optimization adds *a lot* of hair to 
 IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
 time, and causes odd behavior like a merge possibly forcing a flush when it 
 starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
 flushing, we can no longer share doc stores.
 So, I think we should turn off the write-side of shared doc stores to pave 
 the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
 reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2818) abort() method for IndexOutput

2010-12-17 Thread Earwin Burrfoot (JIRA)
abort() method for IndexOutput
--

 Key: LUCENE-2818
 URL: https://issues.apache.org/jira/browse/LUCENE-2818
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot


I'd like to see abort() method on IndexOutput that silently (no exceptions) 
closes IO and then does silent papaDir.deleteFile(this.fileName()).
This will simplify a bunch of error recovery code for IndexWriter and friends, 
but constitutes an API backcompat break.

What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-17 Thread Earwin Burrfoot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2814:


Attachment: LUCENE-2814.patch

New patch. Now with even more lines removed!

DocStore-related index chain components used to track open/closed files through 
DocumentsWriter.
Closed files list was unused, and is silently gone.
Open files list was used to:
* prevent not-yet-flushed shared docstores from being deleted by 
IndexFileDeleter.
** no shared docstores, no need + IFD no longer requires a reference to DW
* delete already opened docstore files, when aborting.
** index chain now handles this on its own + has cleaner error handling code.

 stop writing shared doc stores across segments
 --

 Key: LUCENE-2814
 URL: https://issues.apache.org/jira/browse/LUCENE-2814
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 3.1, 4.0
Reporter: Michael McCandless
Assignee: Michael McCandless
 Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, 
 LUCENE-2814.patch


 Shared doc stores enables the files for stored fields and term vectors to be 
 shared across multiple segments.  We've had this optimization since 2.1 I 
 think.
 It works best against a new index, where you open an IW, add lots of docs, 
 and then close it.  In that case all of the written segments will reference 
 slices a single shared doc store segment.
 This was a good optimization because it means we never need to merge these 
 files.  But, when you open another IW on that index, it writes a new set of 
 doc stores, and then whenever merges take place across doc stores, they must 
 now be merged.
 However, since we switched to shared doc stores, there have been two 
 optimizations for merging the stores.  First, we now bulk-copy the bytes in 
 these files if the field name/number assignment is congruent.  Second, we 
 now force congruent field name/number mapping in IndexWriter.  This means 
 this optimization is much less potent than it used to be.
 Furthermore, the optimization adds *a lot* of hair to 
 IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
 time, and causes odd behavior like a merge possibly forcing a flush when it 
 starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
 flushing, we can no longer share doc stores.
 So, I think we should turn off the write-side of shared doc stores to pave 
 the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
 reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: LogMergePolicy.setUseCompoundFile/DocStore

2010-12-16 Thread Earwin Burrfoot
Incoming LUCENE-2814 drops setUseCompoundDocStore()

On Thu, Dec 16, 2010 at 12:04, Shai Erera ser...@gmail.com wrote:
 Hi

 I find it very annoying that I need to set true/false on these methods
 whenever I want to control compound files creation. Is it really necessary
 to allow writing doc stores in non compound files vs. the other index files
 in a compound file? Does somebody know if this feature is used somewhere?

 If it's crucial to keep the two methods, then how about introducing a
 setCompoundMode(true/false) to turn on/off both at once? IndexWriter used to
 have it, before we switched to IndexWriterConfig and I think it was very
 useful.

 Shai




-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   4   5   6   7   >