no, i'm not doing any caching but as mentioned it did require some work to
become almost completely i/o bound due to the nature of my wacky queries,
example removing O(n) behavior from fuzzy and regexp.

probably the os cache is not helping much because indexes are very large.
I'm very happy being i/o bound because now and especially in the future i
think it will be cheaper to speed up with additional ram and faster storage.

still even out of box without any tricks lucene performs *much* better than
the commercial alternatives i have fought with. lucene was evaluated a while
ago before 2.3 and this was not the case, but I re-evaluated around 2.3
release and it is now.

On Thu, Dec 4, 2008 at 2:45 AM, John Wang <[EMAIL PROTECTED]> wrote:

> Thanks Robert, definitely interested!
> We are too, looking into SSDs for performance.
> 2.4 allows you to create extend QueryParser and create your own "leaf"
> queries.
> I am surprised you are mostly IO bound. Lucene does a good job caching. Do
> you do some sort of caching yourself? If your index is not changing often,
> there is a lot you can do without SSDs.
>
> -John
>
>
> On Wed, Dec 3, 2008 at 11:27 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
>
>> yeah i am using read-only.
>>
>> i will admit to subclassing queryparser and having customized query/scorer
>> for several. all queries contain fuzzy queries so this was necessary.
>>
>> "high" throughput i guess is a matter of opinion. in attempting to profile
>> high-throughput, again customized query/scorer made it easy for me to
>> simplify some things, such as some math in termquery that doesn't make sense
>> (redundant) for my Similarity. everything is pretty much i/o bound now so if
>> tehre is some throughput issue i will look into SSD for high volume indexes.
>>
>> i posted on Use Cases on the wiki how I made fuzzy and regex fast if you
>> are curious.
>>
>>
>> On Thu, Dec 4, 2008 at 2:10 AM, John Wang <[EMAIL PROTECTED]> wrote:
>>
>>> Thanks Robert for sharing.
>>> Good to hear it is working for what you need it to do.
>>>
>>> 3) Especially with ReadOnlyIndexReaders, you should not be blocked while
>>> indexing. Especially if you have multicore machines.
>>> 4) do you stay with sub-second responses with high thru-put?
>>>
>>> -John
>>>
>>>
>>> On Wed, Dec 3, 2008 at 11:03 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Dec 4, 2008 at 1:24 AM, John Wang <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Nice!
>>>>> Some questions:
>>>>>
>>>>> 1) one index?
>>>>>
>>>> no, but two individual ones today were around 100M docs
>>>>
>>>>> 2) how big is your document? e.g. how many terms etc.
>>>>>
>>>> last one built has over 4M terms
>>>>
>>>>> 3) are you serving(searching) the docs in realtime?
>>>>>
>>>> i dont understand this question, but searching is slower if i am
>>>> indexing on a disk thats also being searched.
>>>>
>>>>>
>>>>> 4) search speed?
>>>>>
>>>> usually subsecond (or close) after some warmup. while this might seem
>>>> slow its fast compared to the competition, trust me.
>>>>
>>>>>
>>>>> I'd love to learn more about your architecture.
>>>>>
>>>> i hate to say you would be disappointed, but theres nothign fancy.
>>>> probably why it works...
>>>>
>>>>>
>>>>> -John
>>>>>
>>>>>
>>>>> On Wed, Dec 3, 2008 at 10:13 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> sorry gotta speak up on this. i indexed 300m docs today. I'm using an
>>>>>> out of box jar.
>>>>>>
>>>>>> yeah i have some special subclasses but if i thought any of this stuff
>>>>>> was general enough to be useful to others i'd submit it. I'm just happy 
>>>>>> to
>>>>>> have something scalable that i can customize to my peculiarities.
>>>>>>
>>>>>> so i think i fit in your 10% and im not stressing on either
>>>>>> scalability or api.
>>>>>>
>>>>>> thanks,
>>>>>> robert
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 4, 2008 at 12:36 AM, John Wang <[EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>> Grant:
>>>>>>>         I am sorry that I disagree with some points:
>>>>>>>
>>>>>>> 1) "I think it's a sign that Lucene is pretty stable." - While lucene
>>>>>>> is a great project, especially with 2.x releases, great improvements are
>>>>>>> made, but do we really have a clear picture on how lucene is being used 
>>>>>>> and
>>>>>>> deployed. While lucene works great running as a vanilla search library, 
>>>>>>> when
>>>>>>> pushed to limits, one needs to "hack" into lucene to make certain things
>>>>>>> work. If 90% of the user base use it to build small indexes and using 
>>>>>>> the
>>>>>>> vanilla api, and the other 10% is really stressing both on the 
>>>>>>> scalability
>>>>>>> and api side and are running into issues, would you still say: "running 
>>>>>>> well
>>>>>>> for 90% of the users, therefore it is stable or extensible"? I think it 
>>>>>>> is
>>>>>>> unfair to the project itself to be measured by the vanilla use-case. I 
>>>>>>> have
>>>>>>> done couple of large deployments, e.g. >30 million documents indexed and
>>>>>>> searched in realtime., and I really had to do some tweaking.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Robert Muir
>>>>>> [EMAIL PROTECTED]
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Robert Muir
>>>> [EMAIL PROTECTED]
>>>>
>>>
>>>
>>
>>
>> --
>> Robert Muir
>> [EMAIL PROTECTED]
>>
>
>


-- 
Robert Muir
[EMAIL PROTECTED]

Reply via email to