good open source projects should be better than the commercial counter parts.
I really like 2.4. The DocIDSet/Filter apis really allowed me to do some interesting stuff. I feel lucene has potential to be more than just a full text search library. -John On Wed, Dec 3, 2008 at 11:58 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > no, i'm not doing any caching but as mentioned it did require some work to > become almost completely i/o bound due to the nature of my wacky queries, > example removing O(n) behavior from fuzzy and regexp. > > probably the os cache is not helping much because indexes are very large. > I'm very happy being i/o bound because now and especially in the future i > think it will be cheaper to speed up with additional ram and faster storage. > > still even out of box without any tricks lucene performs *much* better than > the commercial alternatives i have fought with. lucene was evaluated a while > ago before 2.3 and this was not the case, but I re-evaluated around 2.3 > release and it is now. > > > On Thu, Dec 4, 2008 at 2:45 AM, John Wang <[EMAIL PROTECTED]> wrote: > >> Thanks Robert, definitely interested! >> We are too, looking into SSDs for performance. >> 2.4 allows you to create extend QueryParser and create your own "leaf" >> queries. >> I am surprised you are mostly IO bound. Lucene does a good job caching. Do >> you do some sort of caching yourself? If your index is not changing often, >> there is a lot you can do without SSDs. >> >> -John >> >> >> On Wed, Dec 3, 2008 at 11:27 PM, Robert Muir <[EMAIL PROTECTED]> wrote: >> >>> yeah i am using read-only. >>> >>> i will admit to subclassing queryparser and having customized >>> query/scorer for several. all queries contain fuzzy queries so this was >>> necessary. >>> >>> "high" throughput i guess is a matter of opinion. in attempting to >>> profile high-throughput, again customized query/scorer made it easy for me >>> to simplify some things, such as some math in termquery that doesn't make >>> sense (redundant) for my Similarity. everything is pretty much i/o bound now >>> so if tehre is some throughput issue i will look into SSD for high volume >>> indexes. >>> >>> i posted on Use Cases on the wiki how I made fuzzy and regex fast if you >>> are curious. >>> >>> >>> On Thu, Dec 4, 2008 at 2:10 AM, John Wang <[EMAIL PROTECTED]> wrote: >>> >>>> Thanks Robert for sharing. >>>> Good to hear it is working for what you need it to do. >>>> >>>> 3) Especially with ReadOnlyIndexReaders, you should not be blocked while >>>> indexing. Especially if you have multicore machines. >>>> 4) do you stay with sub-second responses with high thru-put? >>>> >>>> -John >>>> >>>> >>>> On Wed, Dec 3, 2008 at 11:03 PM, Robert Muir <[EMAIL PROTECTED]> wrote: >>>> >>>>> >>>>> >>>>> On Thu, Dec 4, 2008 at 1:24 AM, John Wang <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Nice! >>>>>> Some questions: >>>>>> >>>>>> 1) one index? >>>>>> >>>>> no, but two individual ones today were around 100M docs >>>>> >>>>>> 2) how big is your document? e.g. how many terms etc. >>>>>> >>>>> last one built has over 4M terms >>>>> >>>>>> 3) are you serving(searching) the docs in realtime? >>>>>> >>>>> i dont understand this question, but searching is slower if i am >>>>> indexing on a disk thats also being searched. >>>>> >>>>>> >>>>>> 4) search speed? >>>>>> >>>>> usually subsecond (or close) after some warmup. while this might seem >>>>> slow its fast compared to the competition, trust me. >>>>> >>>>>> >>>>>> I'd love to learn more about your architecture. >>>>>> >>>>> i hate to say you would be disappointed, but theres nothign fancy. >>>>> probably why it works... >>>>> >>>>>> >>>>>> -John >>>>>> >>>>>> >>>>>> On Wed, Dec 3, 2008 at 10:13 PM, Robert Muir <[EMAIL PROTECTED]>wrote: >>>>>> >>>>>>> sorry gotta speak up on this. i indexed 300m docs today. I'm using an >>>>>>> out of box jar. >>>>>>> >>>>>>> yeah i have some special subclasses but if i thought any of this >>>>>>> stuff was general enough to be useful to others i'd submit it. I'm just >>>>>>> happy to have something scalable that i can customize to my >>>>>>> peculiarities. >>>>>>> >>>>>>> so i think i fit in your 10% and im not stressing on either >>>>>>> scalability or api. >>>>>>> >>>>>>> thanks, >>>>>>> robert >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 4, 2008 at 12:36 AM, John Wang <[EMAIL PROTECTED]>wrote: >>>>>>> >>>>>>>> Grant: >>>>>>>> I am sorry that I disagree with some points: >>>>>>>> >>>>>>>> 1) "I think it's a sign that Lucene is pretty stable." - While >>>>>>>> lucene is a great project, especially with 2.x releases, great >>>>>>>> improvements >>>>>>>> are made, but do we really have a clear picture on how lucene is being >>>>>>>> used >>>>>>>> and deployed. While lucene works great running as a vanilla search >>>>>>>> library, >>>>>>>> when pushed to limits, one needs to "hack" into lucene to make certain >>>>>>>> things work. If 90% of the user base use it to build small indexes and >>>>>>>> using >>>>>>>> the vanilla api, and the other 10% is really stressing both on the >>>>>>>> scalability and api side and are running into issues, would you still >>>>>>>> say: >>>>>>>> "running well for 90% of the users, therefore it is stable or >>>>>>>> extensible"? I >>>>>>>> think it is unfair to the project itself to be measured by the vanilla >>>>>>>> use-case. I have done couple of large deployments, e.g. >30 million >>>>>>>> documents indexed and searched in realtime., and I really had to do >>>>>>>> some >>>>>>>> tweaking. >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Robert Muir >>>>>>> [EMAIL PROTECTED] >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Robert Muir >>>>> [EMAIL PROTECTED] >>>>> >>>> >>>> >>> >>> >>> -- >>> Robert Muir >>> [EMAIL PROTECTED] >>> >> >> > > > -- > Robert Muir > [EMAIL PROTECTED] >