from:"Earwin Burrfoot"

[
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980388#action_12980388
]

Earwin Burrfoot commented on LUCENE-2858:
-

bq. On the other side, atomic readers do not need reopen logic anymore? When a
segment changes, you need a new atomic reader?
There is a freakload of places that upgrade SegmentReader in various ways,
with deletions guilty only for the part of the cases. I'll try getting back to
LUCENE-2355 at the end of the week.

Separate SegmentReaders (and other atomic readers) from composite IndexReaders
--

Key: LUCENE-2858
URL: https://issues.apache.org/jira/browse/LUCENE-2858
Project: Lucene - Java
Issue Type: Task
Reporter: Uwe Schindler
Fix For: 4.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges


[ 
https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980390#action_12980390
 ] 

Earwin Burrfoot commented on LUCENE-2856:
-

A CompositeSegmentListener niftily removes the need for collection.

 Create IndexWriter event listener, specifically for merges
 --

 Key: LUCENE-2856
 URL: https://issues.apache.org/jira/browse/LUCENE-2856
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Jason Rutherglen
 Attachments: LUCENE-2856.patch


 The issue will allow users to monitor merges occurring within IndexWriter 
 using a callback notifier event listener.  This can be used by external 
 applications such as Solr to monitor large segment merges.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

[
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980400#action_12980400
]

Earwin Burrfoot commented on LUCENE-2793:
-

Looks crazy. In a -bad- tangled way.
You get IOFactory from Directory, put into IOContext, and then invoke it,
passing it (wow!) an IOContext and a Directory. What if you pass totally
different Directory? Different IOContext? It blows up eerily.

And there's no justification for this - we already have an IOFactory, it's
called Directory! It just needs an extra parameter on its factory methods
(createInput/Output), that's all.

Directory createOutput and openInput should take an IOContext
-

Key: LUCENE-2793
URL: https://issues.apache.org/jira/browse/LUCENE-2793
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Michael McCandless
Attachments: LUCENE-2793.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges


[ 
https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980448#action_12980448
 ] 

Earwin Burrfoot commented on LUCENE-2856:
-

A SegmentListener that has a number of children SLs and delegates 
eventHappened() calls to them. 

 Create IndexWriter event listener, specifically for merges
 --

 Key: LUCENE-2856
 URL: https://issues.apache.org/jira/browse/LUCENE-2856
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: 4.0
Reporter: Jason Rutherglen
 Attachments: LUCENE-2856.patch


 The issue will allow users to monitor merges occurring within IndexWriter 
 using a callback notifier event listener.  This can be used by external 
 applications such as Solr to monitor large segment merges.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

[
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980454#action_12980454
]

Earwin Burrfoot commented on LUCENE-2793:
-

{quote}
bq. You get IOFactory from Directory
That's for the default, the main use is the static IOFactory class.
{quote}
You lost me here. If you got A from B, you don't have to pass B again to invoke
A, if you do - that's 99% a design mistake.
But still, my point was that you don't need IOFactory at all.

bq. Right, however we're basically trying to intermix Directory's, which
doesn't work when pointed at the same underlying File. I thought about a
meta-Directory that routes based on the IOContext, however we'd still need a
way to create an IndexInput and IndexOutput, from different Directory
implementations.
What Directories are you trying to intermix? What for?

I thought the only thing done in that issue is an attempt to give Directory
hints as to why we're going to open its streams.
A simple enum IOContext and extra parameter on createOutput/Input would
suffice. But with Lucene's micromanagement attitude, an enum turns into
slightly more complex thing, with bufferSizes and whatnot.
Still - no need for mixing Directories.

Directory createOutput and openInput should take an IOContext
-

Key: LUCENE-2793
URL: https://issues.apache.org/jira/browse/LUCENE-2793
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Michael McCandless
Attachments: LUCENE-2793.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

[
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12980458#action_12980458
]

Earwin Burrfoot commented on LUCENE-2793:
-

In fact, I suggest dropping bufferSize altogether. As far as I can recall, it
was introduced as a precursor to IOContext and can now be safely replaced.

Even if we want to give user control over buffer size for all streams, or only
those opened in specific IOContext, he can pass these numbers as config
parameters to his Directory impl.
That makes total sense, as:
1. IndexWriter/IndexReader couldn't care less about buffer sizes, it just
passes them to the Directory. It's not their concern.
2. A bunch of Directories doesn't use said bufferSize at all, making this
parameter not only private Directory affairs, but even further -
implementation-specific.

So my bet is - introduce IOContext as a simple Enum, change bufferSize
parameter on createInput/Output to IOContext, done.

Directory createOutput and openInput should take an IOContext
-

Key: LUCENE-2793
URL: https://issues.apache.org/jira/browse/LUCENE-2793
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Michael McCandless
Attachments: LUCENE-2793.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2011-01-10 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979522#action_12979522
 ] 

Earwin Burrfoot commented on LUCENE-2312:
-

Some questions to align myself with impending reality.

Is that right that future RT readers are no longer immutable snapshots (in a 
sense that they have variable maxDoc)?
If it is so, are you keeping current NRT mode, with fast turnaround, yet 
immutable readers?

 Search on IndexWriter's RAM Buffer
 --

 Key: LUCENE-2312
 URL: https://issues.apache.org/jira/browse/LUCENE-2312
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: Realtime Branch
Reporter: Jason Rutherglen
Assignee: Michael Busch
 Fix For: Realtime Branch

 Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch


 In order to offer user's near realtime search, without incurring
 an indexing performance penalty, we can implement search on
 IndexWriter's RAM buffer. This is the buffer that is filled in
 RAM as documents are indexed. Currently the RAM buffer is
 flushed to the underlying directory (usually disk) before being
 made searchable. 
 Todays Lucene based NRT systems must incur the cost of merging
 segments, which can slow indexing. 
 Michael Busch has good suggestions regarding how to handle deletes using max 
 doc ids.  
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
 The area that isn't fully fleshed out is the terms dictionary,
 which needs to be sorted prior to queries executing. Currently
 IW implements a specialized hash table. Michael B has a
 suggestion here: 
 https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2474) Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey)

2011-01-10 Thread Earwin Burrfoot (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979888#action_12979888
]

Earwin Burrfoot commented on LUCENE-2474:
-

bq. Earwin's working on improving this, I think, under LUCENE-2355
I stalled, and then there were just so many changes under trunk, so I have to
restart now :) Thanks for another kick.

Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean
custom caches that use the IndexReader (getFieldCacheKey)

Key: LUCENE-2474
URL: https://issues.apache.org/jira/browse/LUCENE-2474
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shay Banon
Attachments: LUCENE-2474.patch, LUCENE-2474.patch

Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean
custom caches that use the IndexReader (getFieldCacheKey).
A spin of: https://issues.apache.org/jira/browse/LUCENE-2468. Basically, its
make a lot of sense to cache things based on IndexReader#getFieldCacheKey,
even Lucene itself uses it, for example, with the CachingWrapperFilter.
FieldCache enjoys being called explicitly to purge its cache when possible
(which is tricky to know from the outside, especially when using NRT -
reader attack of the clones).
The provided patch allows to plug a CacheEvictionListener which will be
called when the cache should be purged for an IndexReader.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

[
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979276#action_12979276
]

Earwin Burrfoot commented on LUCENE-2840:
-

bq. But doesn't that mean that an app w/ rare queries but each query is massive
fails to use all available concurrency?
Yes. But that's not my case. And likely not someone else's.

I think if you want to be super-generic, it's better to defer exact threading
to the user, instead of doing a one-size-fits-all solution. Else you risk
conjuring another ConcurrentMergeScheduler.
While we're at it, we can throw in some sample implementation, which can
satisfy some of the users, but not everyone.

Multi-Threading in IndexSearcher (after removal of MultiSearcher and
ParallelMultiSearcher)
---

Key: LUCENE-2840
URL: https://issues.apache.org/jira/browse/LUCENE-2840
Project: Lucene - Java
Issue Type: Sub-task
Components: Search
Reporter: Uwe Schindler
Priority: Minor
Fix For: 4.0

Spin-off from parent issue:
{quote}
We should discuss about how many threads should be spawned. If you have an
index with many segments, even small ones, I think only the larger segments
should be separate threads, all others should be handled sequentially. So
maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then
only spawn maxThreads-1 threads for the bigger readers and then one
additional thread for the rest?
{quote}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

[
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979277#action_12979277
]

Earwin Burrfoot commented on LUCENE-2843:
-

And we're nearing a day when we keep the whole term dictionary in memory (as
Sphinx does for instance).
At that point a gazillion of term lookup-related hacks (like lookup cache)
become obsolete :)
Term dictionary itself can also be memory-mapped after this, instead of being
read and built from disk, which makes new segment opening
near-instantaneous.

Add variable-gap terms index impl.
--

Key: LUCENE-2843
URL: https://issues.apache.org/jira/browse/LUCENE-2843
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 4.0

Attachments: LUCENE-2843.patch, LUCENE-2843.patch

PrefixCodedTermsReader/Writer (used by all real core codecs) already
supports pluggable terms index impls.
The only impl we have now is FixedGapTermsIndexReader/Writer, which
picks every Nth (default 32) term and holds it in efficient packed
int/byte arrays in RAM. This is already an enormous improvement (RAM
reduction, init time) over 3.x.
This patch adds another impl, VariableGapTermsIndexReader/Writer,
which lets you specify an arbitrary IndexTermSelector to pick which
terms are indexed, and then uses an FST to hold the indexed terms.
This is typically even more memory efficient than packed int/byte
arrays, though, it does not support ord() so it's not quite a fair
comparison.
I had to relax the terms index plugin api for
PrefixCodedTermsReader/Writer to not assume that the terms index impl
supports ord.
I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor
when the FST is used as a terms index but seekCeil when it's holding
all terms in the index (ie which SimpleText uses FSTs for).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

[
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979305#action_12979305
]

Earwin Burrfoot commented on LUCENE-2843:
-

As I said, there's already a search server with strictly in-memory (in mmap
sense. it can theoretically be paged out) terms dict AND widespread adoption.
Their users somehow manage.

My guess is that's because people with insane number of terms store various
crap like unique timestamps as terms. With CSF (attributes in Sphinx lingo),
and some nice filters that can work over CSF, there's no longer any need to
stuff your timestamps in the same place you stuff your texts. That can be
reflected in documentation, and then, suddenly, we can drop on-disk only
support.

Add variable-gap terms index impl.
--

Attachments: LUCENE-2843.patch, LUCENE-2843.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

[
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979306#action_12979306
]

Earwin Burrfoot commented on LUCENE-2840:
-

A lot of fork-join type frameworks don't even care. Even though scheduling
threads is something people supposedly use them for.
Why? I guess that's due to low yield/cost ratio.
You frequently quote progress, not perfection in relation to the code, but
why don't we apply this same principle to our threading guarantees?
I don't want to use allowed concurrency fully. That's not realistic. I want 85%
of it. That's already a huge leap ahead of single-threaded searches.

Multi-Threading in IndexSearcher (after removal of MultiSearcher and
ParallelMultiSearcher)
---

Key: LUCENE-2840
URL: https://issues.apache.org/jira/browse/LUCENE-2840
Project: Lucene - Java
Issue Type: Sub-task
Components: Search
Reporter: Uwe Schindler
Priority: Minor
Fix For: 4.0

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

[
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979346#action_12979346
]

Earwin Burrfoot commented on LUCENE-2843:
-

bq. I don't like the reasoning that, just because sphinx does it and their
'users manage', that makes it ok.
I'm in no way advocating it as an all-round better solution. It has it's
wrinkles just as anything else.
My reasoning is merely that alternative exists, and it is viable. As proven by
pretty high-profile users.
They have memory-resident term dictionary, and it works, I heard no complaints
regarding this ever.

bq. sphinx also requires mysql
Have you read anything at all? It has an integration ready, for the layman user
who just wants to stick a fulltext search into their little app, but it is in
no way reliant on it.
Sphinx is a direct alternative to Solr.

{quote}
But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good
decisions on what gets swapped out is risky - Lucene is better informed than
the OS on which data structures are worth spending RAM on (norms, terms index,
field cache, del docs).
If indeed the terms dict (thanks to FSTs) becomes small enough to fit in RAM,
then we should load it into RAM (and do away w/ the terms index).
{quote}
That's a bit delusional. If a system is forced to swap out, it'll swap your
explicitly managed RAM just as likely as memory-mapped files. I've seen this
countless times.
But then, you have a number of benefits - like sharing filesystem cache when
opening same file multiple times, offloading things from Java heap (which is
almost always a good thing), fastest load-into-memory times possible.

Sorry, if I sound offending at times, but, damn, there's a whole world of
simple and efficient code lying ahead in that direction :)

Add variable-gap terms index impl.
--

Attachments: LUCENE-2843.patch, LUCENE-2843.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

[
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979366#action_12979366
]

Earwin Burrfoot commented on LUCENE-2843:
-

bq. Nope, havent looked at their code... i think i stopped at the documentation
when i saw how they analyzed text!
All my points are contained within their documentation. No need to look at the
code (it's as shady as Lucene's).
In the same manner, Lucene had crappy analyzis for years, until you've taken
hold of (unicode) police baton.
So let's not allow color differences between our analyzers affect our judgement
on other parts of ours : )

bq. In other words, Test2BTerms in src/test should pass on my 32-bit windows
machine with whatever we default to.
I'm questioning is there any legal, adequate reason to have that much terms.
I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms
though :/

A hybrid solution, with term-dict being loaded completely into memory (either
via mmap, or into arrays) on per-field basis, is probably best in the end,
however sad it may be.

Add variable-gap terms index impl.
--

Attachments: LUCENE-2843.patch, LUCENE-2843.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [jira] Commented: (SOLR-2218) Performance of start= and rows= parameters are exponentially slow with large data sets

2011-01-08 Thread Earwin Burrfoot

On Mon, Jan 3, 2011 at 18:18, Yonik Seeley yo...@lucidimagination.com wrote:
 On Thu, Nov 11, 2010 at 3:22 PM, Jan Høydahl / 
 Cominventjan@cominvent.com wrote:
 The problem with large start is probably worse when sharding is involved. 
 Anyone know how the shard component goes about fetching 
 start=100rows=10 from say 10 shards? Does it have to merge sorted lists 
 of 1mill+10 docsids from each shard which is the worst case?

 Yep, that's how it works today.


Technically, if your docs have a non-biased (in regards to their
sort-value) distribution across shards, you can fetch much less than
topN docs from each shard.
I played with the idea, and it worked for me. Though later I dropped
the opto, as it complicated things somewhat and my users aren't
querying gazillions of docs often.


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2010-12-30 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12976027#action_12976027
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

I use the following scheme:
* There is a fixed pool of threads shared by all searches, that limits total 
concurrency.
* Each new search apprehends at most a fixed number of threads from this pool 
(say, 2-3 of 8 in my setup),
* and these threads churn through segments as through a queue (in maxDoc order, 
but I think even that is unnecessary).

No special smart binding between threads and segments (eg. 1 thread for each 
biggie, 1 thread for all of the small ones) -
means simpler code, and zero possibility of stalling, when there are threads to 
run, segments to search, but binding policy does not connect them.
Using fewer threads per-search than total available is a precaution against 
biggie searches blocking fast ones.

 Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
 ParallelMultiSearcher)
 ---

 Key: LUCENE-2840
 URL: https://issues.apache.org/jira/browse/LUCENE-2840
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: Search
Reporter: Uwe Schindler
Priority: Minor
 Fix For: 4.0


 Spin-off from parent issue:
 {quote}
 We should discuss about how many threads should be spawned. If you have an 
 index with many segments, even small ones, I think only the larger segments 
 should be separate threads, all others should be handled sequentially. So 
 maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
 only spawn maxThreads-1 threads for the bigger readers and then one 
 additional thread for the rest?
 {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: strange problem of PForDelta decoder

2010-12-30 Thread Earwin Burrfoot

until we fix Lucene to run a single search concurrently (which we
badly need to do).
 I am interested in this idea.(I have posted it before) do you have some
 resources such as papers or tech articles about it?
 I have tried but it need to modify index format dramatically and we use
 solr distributed search to relieve the problem of response time. so finally
 give it up.
 lucene4's index format is more flexible that it supports customed codecs
 and it's now on development, I think it's good time to take it into
 consideration
 that let it support multithread searching for a single query.
 I have a naive solution. dividing docList into many groups
 e.g grouping docIds by it's even or odd
 term1 df1=4  docList =  0  4  8  10
 term1 df2=4  docList = 1  3  9  11

 term2 df1=4  docList = 0  6  8  12
 term2 df2=4  docList = 3  9  11 15
   then we can use 2 threads to search topN docs on even group and odd group
 and finally merge their results into a single on just like solr
 distributed search.
 But it's better than solr distributed search.
   First, it's in a single process and data communication between
 threads is much
 faster than network.
   Second, each threads process the same number of documents.For solr
 distributed
 search, one shard may process 7 documents and another shard may 1 document
 Even if we can make each shard have the same document number. we can not
 make it uniformly for each term.
    e.g. shard1 has doc1 doc2
           shard2 has doc3 doc4
    but term1 may only occur in doc1 and doc2
    while term2 may only occur in doc3 and doc4
    we may modify it
           shard1 doc1 doc3
           shard2 doc2 doc4
    it's good for term1 and term2
    but term3 may occur in doc1 and doc3...
    So I think it's fine-grained distributed in index while solr
 distributed search is coarse-
 grained.
This is just crazy :)

The simple way is just to search different segments in parallel.
BalancedSegmentMergePolicy makes sure you have roughly even-sized
large segments (and small ones don't count, they're small!).
If you're bound on squeezing out that extra millisecond (and making
your life miserable along the way), you can search a single segment
with multiple threads (by dividing it in even chunks, and then doing
skipTo to position your iterators to the beginning of each chunk).

First approach is really easy to implement. Second one is harder, but
still doesn't require you to cook the number of CPU cores available
into your index!

It's the law of diminishing returns at play here. You're most likely
to search in parallel over mostly memory-resident index
(RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems
tend to slow down considerably on parallel sequential reads, so you
already have pretty decent speed.
Searching different segments in parallel (with BSMP) makes you several
times faster.
Searching in parallel within a segment requires some weird hacks, but
has maybe a few percent advantage over previous solution.
Sharding posting lists requires a great deal of weird hacks, makes
index machine-bound, and boosts speed by another couple of percent.
Sounds worthless.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: is the classes ended with PerThread(*PerThread) multithread

2010-12-28 Thread Earwin Burrfoot

There is a single indexchain, with a single instance of each chain
component, except those ending in -PerThread.

Though that's gonna change with
https://issues.apache.org/jira/browse/LUCENE-2324

On Tue, Dec 28, 2010 at 13:10, Simon Willnauer
simon.willna...@googlemail.com wrote:
 On Tue, Dec 28, 2010 at 10:57 AM, xu cheng xcheng@gmail.com wrote:
 hi simon
 thanks for replying very much.
 after reading the source code with your suggestion, here's my understanding,
 and I don't know whether it's right:
 the DocumentsWriter actually don't create threads, but the codes that use
 DocumentsWriter can do the multithreading(say, several threads call
 updateDocument). and each thread has its DocumentsWriterThreadState, in the
 mean while, each DocumentsWriterThreadState has its own objects(the
 *PerThread such as DocFieldProcessorPerThread, DocInverterPerThread and so
 on )
 as the methods of DocumentsWriter are called by multiple threads, for
 example, 4 threads, there are 4 DocumentsWriterThreadState objects, and 4
 index chains, ( each index chain has it's own *PerThread objects ,  to
 process the document).
 am I right??

 that sounds about right

 simon
 thanks for replying again!


 2010/12/28 Simon Willnauer simon.willna...@googlemail.com

 Hey there,

 so what you are looking at are classes that are created per Thread
 rather than shared with other threads. Lucene internally rarely
 creates threads or subclasses Thread, Runnable or Callable
 (ParallelMultiSearcher is an exception or some of the merging code).
 Yet, inside the indexer when you add (update) a document Lucene
 utilizes the callers thread rather than spanning a new one. When you
 look at DocumentsWriter.java there should be a method callled
 getThreadState. Each indexing thread, lets say in updateDocument, gets
 its Thread-Private DocumentsWriterThreadState. This thread state holds
 a DocConsumerPerThread obtained from the DocumentsWriters DocConsumer
 (see the indexing chain). DocConsumerPerThread in that case is some
 kind of decorator that hold other DocConsumerPerThread instances like
 TermsHashPerThread etc.

 The general pattern is for each DocConsumer you can get a
 DocConsumerPerThread for your indexing thread which then consumes the
 document you are processing right now.

 I hope that helps

 simon


 On Tue, Dec 28, 2010 at 4:19 AM, xu cheng xcheng@gmail.com wrote:
  hi all:
  I'm new to dev
  these days I'm reading the source code in the index package
  and I was confused.
  there are classes with suffix PerThread such as
  DocFieldProcessorPerThread,
  DocInverterPerThread, TermsHashPerThread, FreqProxTermWriterPerThread.
  in this mailing-list, I was told that they are multithreaded.
  however, there are some difficulties for me to understand!
  I see no sign that they inherited from the Thread , or implement the
  Runnable, or something else??
  how do they map to the OS thread??
  thanks ^_^

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974274#action_12974274
 ] 

Earwin Burrfoot commented on LUCENE-2829:
-

Term lookup misses can be alleviated by a simple Bloom Filter.
No caching misses required, helps both PK and near-PK queries.

 improve termquery pk lookup performance
 -

 Key: LUCENE-2829
 URL: https://issues.apache.org/jira/browse/LUCENE-2829
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Robert Muir
 Attachments: LUCENE-2829.patch


 For things that are like primary keys and don't exist in some segments (worst 
 case is primary/unique key that only exists in 1)
 we do wasted seeks.
 While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
 concerned we could every backport that to 3.1 for example.
 This is a simpler solution here just to solve this one problem in 
 termquery... we could just revert it in trunk when we resolve LUCENE-2694,
 but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2829) improve termquery pk lookup performance

2010-12-22 Thread Earwin Burrfoot (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12974350#action_12974350
]

Earwin Burrfoot commented on LUCENE-2829:
-

Nobody halts your progress, we're merely discussing.

I, on the other hand, have a feeling that Lucene is overflowing with single
incremental improvements aka hacks, as they are easier and faster to
implement than trying to get a bigger picture, and, yes, rebuilding everything
:)
For example, better term dict code will make this issue (somewhat hackish,
admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to
keep the whole term dict in memory with reasonable lookup routine (eg. as FST).

Having said that, I reiterate, I'm not here to stop you or turn this issue into
something else.

improve termquery pk lookup performance
-

Key: LUCENE-2829
URL: https://issues.apache.org/jira/browse/LUCENE-2829
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Robert Muir
Attachments: LUCENE-2829.patch

For things that are like primary keys and don't exist in some segments (worst
case is primary/unique key that only exists in 1)
we do wasted seeks.
While LUCENE-2694 tries to solve some of this issue with TermState, I'm
concerned we could every backport that to 3.1 for example.
This is a simpler solution here just to solve this one problem in
termquery... we could just revert it in trunk when we resolve LUCENE-2694,
but I don't think we should leave things as they are in 3.x

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: RT branch status

2010-12-22 Thread Earwin Burrfoot

Cool! I'm getting to this on a weekend.

On Tue, Dec 21, 2010 at 11:44, Michael Busch busch...@gmail.com wrote:
 After merging trunk into the RT branch it's finally compiling again and
 up-to-date.

 Several tests are failing now after the merge (43 out of 1427 are failing),
 which is not too surprising, because so many things have changed
 (segment-deletes, flush control, termsHash refactoring, removal of doc
 stores, etc).

 Especially IndexWriter and DocumentsWriter are in a somewhat messy state,
 but I wanted to share my current state, so I committed the merge.  I'll try
 this week to understand the new changes (especially deletes) and make them
 work with the DWPT.  The following areas need work:
  * deletes
  * thread-safety
  * error handling and aborting
  * flush-by-ram (LUCENE-2573)

 Also, some tests deadlock.  Not surprisingly either, cause flushcontrol etc.
 introduce new synchronized blocks.

 Before the merge all tests were passing, except the ones testing
 flush-by-ram functionality.  I'll keep working on getting the branch back
 into that state again soon.

 Help is definitely welcome!  I'd love to get this branch ready so that we
 can merge it into trunk as soon as possible.  As Mike's experiments show
 having DWPTs will not only be beneficial for RT search, but also increase
 indexing performance in general.

  Michael

 PS: Thanks for the patience!

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Do we want 'nocommit' to fail the commit?

2010-12-18 Thread Earwin Burrfoot

But. Er. What if we happen to have nocommit in a string, or in some
docs, or as a name of variable?

On Sat, Dec 18, 2010 at 12:47, Michael McCandless
luc...@mikemccandless.com wrote:
 +1 this would be great :)

 Mike

 On Fri, Dec 17, 2010 at 10:45 PM, Shai Erera ser...@gmail.com wrote:
 Hi
 Out of curiosity, I searched if we can have a nocommit comment in the code
 fail the commit. As far as I see, we try to avoid accidental commits (of say
 debug messages) by putting a nocommit comment, but I don't know if svn ci
 would fail in the presence of such comment - I guess not because we've seen
 some accidental nocommits checked in already in the past.
 So I Googled around and found that if we have control of the svn repo, we
 can add a pre-commit hook that will check and fail the commit. Here is a
 nice article that explains how to add pre-commit hooks in general
 (http://wordaligned.org/articles/a-subversion-pre-commit-hook). I didn't try
 it yet (on our local svn instance), so I cannot say how well it works, but
 perhaps someone has experience with it ...
 So if this is interesting, and is doable for Lucene (say, open a JIRA issue
 for Infra?) I don't mind investigating it further and write the script
 (which can be as simple as 'grep the changed files and fail on the presence
 of nocommit string').
 Shai

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2818) abort() method for IndexOutput


[ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972764#action_12972764
 ] 

Earwin Burrfoot commented on LUCENE-2818:
-

bq. Can abort() have a default impl in IndexOutput, such as close() followed by 
deleteFile() maybe? If so, then it won't break anything.
It can't. To call deleteFile you need both a reference to papa-Directory and a 
name of the file this IO writes to. Abstract IO class has neither. If we add 
them, they have to be passed to a new constructor, and that's an API break ;)

bq. Would abort() on Directory fit better? E.g., it can abort all currently 
open and modified files, instead of the caller calling abort() on each 
IndexOutput? Are you thinking of a case where a write failed, and the caller 
would call abort() immediately, instead of some higher-level code? If so, would 
rollback() be a better name?
Oh, no, no. No way. I don't want to push someone else's responsibility on 
Directory. This abort() is merely a shortcut.

Let's go with a usage example:
Here's FieldsWriter.java with LUCENE-2814 applied (skipping irrelevant parts) - 
https://gist.github.com/746358
Now, the same, with abort() - https://gist.github.com/746367

 abort() method for IndexOutput
 --

 Key: LUCENE-2818
 URL: https://issues.apache.org/jira/browse/LUCENE-2818
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot

 I'd like to see abort() method on IndexOutput that silently (no exceptions) 
 closes IO and then does silent papaDir.deleteFile(this.fileName()).
 This will simplify a bunch of error recovery code for IndexWriter and 
 friends, but constitutes an API backcompat break.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2818) abort() method for IndexOutput


[ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12972765#action_12972765
 ] 

Earwin Burrfoot commented on LUCENE-2818:
-

bq. I think we can make a default impl that simply closes  suppresses 
exceptions? (We can't .deleteFile since an abstract IO doesn't know its Dir). 
Our concrete impls can override w/ versions that do delete the file...
I don't think we need a default impl? For some directory impls close() is a 
noop + what is more important, having abstract method forces you to implement 
it, you can't forget this, so we're not gonna see broken directories that don't 
do abort() properly.

 abort() method for IndexOutput
 --

 Key: LUCENE-2818
 URL: https://issues.apache.org/jira/browse/LUCENE-2818
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot

 I'd like to see abort() method on IndexOutput that silently (no exceptions) 
 closes IO and then does silent papaDir.deleteFile(this.fileName()).
 This will simplify a bunch of error recovery code for IndexWriter and 
 friends, but constitutes an API backcompat break.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2818) abort() method for IndexOutput


 [ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2818:


Priority: Minor  (was: Major)

This change is really minor, but I think, convinient.

You don't have to lug reference to Directory along, and recalculate the file 
name, if the only thing you want to say is that write was a failure and you no 
longer need this file.

 abort() method for IndexOutput
 --

 Key: LUCENE-2818
 URL: https://issues.apache.org/jira/browse/LUCENE-2818
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
Priority: Minor

 I'd like to see abort() method on IndexOutput that silently (no exceptions) 
 closes IO and then does silent papaDir.deleteFile(this.fileName()).
 This will simplify a bunch of error recovery code for IndexWriter and 
 friends, but constitutes an API backcompat break.
 What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments