good open source projects should be better than the commercial counter
parts.

I really like 2.4. The DocIDSet/Filter apis really allowed me to do some
interesting stuff.

I feel lucene has potential to be more than just a full text search library.

-John

On Wed, Dec 3, 2008 at 11:58 PM, Robert Muir <[EMAIL PROTECTED]> wrote:

> no, i'm not doing any caching but as mentioned it did require some work to
> become almost completely i/o bound due to the nature of my wacky queries,
> example removing O(n) behavior from fuzzy and regexp.
>
> probably the os cache is not helping much because indexes are very large.
> I'm very happy being i/o bound because now and especially in the future i
> think it will be cheaper to speed up with additional ram and faster storage.
>
> still even out of box without any tricks lucene performs *much* better than
> the commercial alternatives i have fought with. lucene was evaluated a while
> ago before 2.3 and this was not the case, but I re-evaluated around 2.3
> release and it is now.
>
>
> On Thu, Dec 4, 2008 at 2:45 AM, John Wang <[EMAIL PROTECTED]> wrote:
>
>> Thanks Robert, definitely interested!
>> We are too, looking into SSDs for performance.
>> 2.4 allows you to create extend QueryParser and create your own "leaf"
>> queries.
>> I am surprised you are mostly IO bound. Lucene does a good job caching. Do
>> you do some sort of caching yourself? If your index is not changing often,
>> there is a lot you can do without SSDs.
>>
>> -John
>>
>>
>> On Wed, Dec 3, 2008 at 11:27 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
>>
>>> yeah i am using read-only.
>>>
>>> i will admit to subclassing queryparser and having customized
>>> query/scorer for several. all queries contain fuzzy queries so this was
>>> necessary.
>>>
>>> "high" throughput i guess is a matter of opinion. in attempting to
>>> profile high-throughput, again customized query/scorer made it easy for me
>>> to simplify some things, such as some math in termquery that doesn't make
>>> sense (redundant) for my Similarity. everything is pretty much i/o bound now
>>> so if tehre is some throughput issue i will look into SSD for high volume
>>> indexes.
>>>
>>> i posted on Use Cases on the wiki how I made fuzzy and regex fast if you
>>> are curious.
>>>
>>>
>>> On Thu, Dec 4, 2008 at 2:10 AM, John Wang <[EMAIL PROTECTED]> wrote:
>>>
>>>> Thanks Robert for sharing.
>>>> Good to hear it is working for what you need it to do.
>>>>
>>>> 3) Especially with ReadOnlyIndexReaders, you should not be blocked while
>>>> indexing. Especially if you have multicore machines.
>>>> 4) do you stay with sub-second responses with high thru-put?
>>>>
>>>> -John
>>>>
>>>>
>>>> On Wed, Dec 3, 2008 at 11:03 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 4, 2008 at 1:24 AM, John Wang <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Nice!
>>>>>> Some questions:
>>>>>>
>>>>>> 1) one index?
>>>>>>
>>>>> no, but two individual ones today were around 100M docs
>>>>>
>>>>>> 2) how big is your document? e.g. how many terms etc.
>>>>>>
>>>>> last one built has over 4M terms
>>>>>
>>>>>> 3) are you serving(searching) the docs in realtime?
>>>>>>
>>>>> i dont understand this question, but searching is slower if i am
>>>>> indexing on a disk thats also being searched.
>>>>>
>>>>>>
>>>>>> 4) search speed?
>>>>>>
>>>>> usually subsecond (or close) after some warmup. while this might seem
>>>>> slow its fast compared to the competition, trust me.
>>>>>
>>>>>>
>>>>>> I'd love to learn more about your architecture.
>>>>>>
>>>>> i hate to say you would be disappointed, but theres nothign fancy.
>>>>> probably why it works...
>>>>>
>>>>>>
>>>>>> -John
>>>>>>
>>>>>>
>>>>>> On Wed, Dec 3, 2008 at 10:13 PM, Robert Muir <[EMAIL PROTECTED]>wrote:
>>>>>>
>>>>>>> sorry gotta speak up on this. i indexed 300m docs today. I'm using an
>>>>>>> out of box jar.
>>>>>>>
>>>>>>> yeah i have some special subclasses but if i thought any of this
>>>>>>> stuff was general enough to be useful to others i'd submit it. I'm just
>>>>>>> happy to have something scalable that i can customize to my 
>>>>>>> peculiarities.
>>>>>>>
>>>>>>> so i think i fit in your 10% and im not stressing on either
>>>>>>> scalability or api.
>>>>>>>
>>>>>>> thanks,
>>>>>>> robert
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 4, 2008 at 12:36 AM, John Wang <[EMAIL PROTECTED]>wrote:
>>>>>>>
>>>>>>>> Grant:
>>>>>>>>         I am sorry that I disagree with some points:
>>>>>>>>
>>>>>>>> 1) "I think it's a sign that Lucene is pretty stable." - While
>>>>>>>> lucene is a great project, especially with 2.x releases, great 
>>>>>>>> improvements
>>>>>>>> are made, but do we really have a clear picture on how lucene is being 
>>>>>>>> used
>>>>>>>> and deployed. While lucene works great running as a vanilla search 
>>>>>>>> library,
>>>>>>>> when pushed to limits, one needs to "hack" into lucene to make certain
>>>>>>>> things work. If 90% of the user base use it to build small indexes and 
>>>>>>>> using
>>>>>>>> the vanilla api, and the other 10% is really stressing both on the
>>>>>>>> scalability and api side and are running into issues, would you still 
>>>>>>>> say:
>>>>>>>> "running well for 90% of the users, therefore it is stable or 
>>>>>>>> extensible"? I
>>>>>>>> think it is unfair to the project itself to be measured by the vanilla
>>>>>>>> use-case. I have done couple of large deployments, e.g. >30 million
>>>>>>>> documents indexed and searched in realtime., and I really had to do 
>>>>>>>> some
>>>>>>>> tweaking.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Robert Muir
>>>>>>> [EMAIL PROTECTED]
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Robert Muir
>>>>> [EMAIL PROTECTED]
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> [EMAIL PROTECTED]
>>>
>>
>>
>
>
> --
> Robert Muir
> [EMAIL PROTECTED]
>

Reply via email to