> I don't think we'd do the post-filtering solution, but instead maybe > resolve the deletes "live" and store them in a transactional data
I think Michael B. aptly described the sequence ID approach for 'live' deletes? On Mon, Jun 13, 2011 at 3:00 PM, Michael McCandless <luc...@mikemccandless.com> wrote: > Yes, adding deletes to Twitter's approach will be a challenge! > > I don't think we'd do the post-filtering solution, but instead maybe > resolve the deletes "live" and store them in a transactional data > structure of some kind... but even then we will pay a perf hit to > lookup del docs against it. > > So, yeah, there will presumably be a tradeoff with this approach too. > However, turning around changes from the adds should be faster (no > segment gets flushed). > > Mike McCandless > > http://blog.mikemccandless.com > > On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko <ita...@code972.com> > wrote: >> Thanks Mike, much appreciated. >> >> >> Wouldn't Twitter's approach fall for the exact same pit-hole you described >> Zoie does (or did) when it'll handle deletes too? I don't thing there is any >> other way of handling deletes other than post-filtering results. But perhaps >> the IW cache would be smaller than Zoie's RAMDirectory(ies)? >> >> >> I'll give all that a serious dive and report back with results or if more >> input will be required... >> >> >> Itamar. >> >> >> On 13/06/2011 19:01, Michael McCandless wrote: >> >>> Here's a blog post describing some details of Twitter's approach: >>> >>> >>> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html >>> >>> And here's a talk Michael did last October (Lucene Revolutions): >>> >>> >>> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter >>> >>> Twitter's case is simpler since they never delete ;) So we have to >>> fix that to do it in Lucene... there are also various open issues that >>> begin to explore some of the ideas here. >>> >>> But this ("immediate consistency") would be a deep and complex change, >>> and I don't see many apps that actually require it. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<ita...@code972.com> >>> wrote: >>>> >>>> Thanks for your detailed answer. We'll have to tackle this and see whats >>>> more important to us then. I'd definitely love to hear Zoie has overcame >>>> all >>>> that... >>>> >>>> >>>> Any pointers to Michael Busch's approach? I take this has something to do >>>> with the core itself or index format, probably using the Flex version? >>>> >>>> >>>> Itamar. >>>> >>>> >>>> On 12/06/2011 23:12, Michael McCandless wrote: >>>> >>>>>> From what I understand of Zoie (and it's been some time since I last >>>>> >>>>> looked... so this could be wrong now), the biggest difference vs NRT >>>>> is that Zoie aims for "immediate consistency", ie index changes are >>>>> always made visible to the very next query, vs NRT which is >>>>> "controlled consistency", a blend between immediate and eventual >>>>> consistency where your app decides when the changes must become >>>>> visible. >>>>> >>>>> But in exchange for that, Zoie pays a price: each search has a higher >>>>> cost per collected hit, since it must post-filter for deleted docs. >>>>> And since Zoie necessarily adds complexity, there's more risk; eg >>>>> there were some nasty Zoie bugs that took quite some time to track >>>>> down (under https://issues.apache.org/jira/browse/LUCENE-2729). >>>>> >>>>> Anyway, I don't think that's a good tradeoff, in general, for our >>>>> users, because very few apps truly require immediate consistency from >>>>> Lucene (can anyone give an example where their app depends on >>>>> immediate consistency...?). I think it's better to spend time during >>>>> reopen so that searches aren't slower. >>>>> >>>>> That said, Lucene has already incorporated one big part of Zoie >>>>> (caching small segments in RAM) via the new NRTCachingDirectory (in >>>>> contrib/misc). Also, the upcoming NRTManager >>>>> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over >>>>> visibility of specific indexing changes to queries that need to see >>>>> the changes. >>>>> >>>>> Finally, even better would be to not have to make any tradeoff >>>>> whatsoever ;) Twitter's approach (created by Michael Busch) seems to >>>>> bring immediate consistency with no search performance hit, so if we >>>>> do anything here likely it'll be similar to what Michael has done >>>>> (though, those changes are not simple either!). >>>>> >>>>> Mike McCandless >>>>> >>>>> http://blog.mikemccandless.com >>>>> >>>>> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<ita...@code972.com> >>>>> wrote: >>>>>> >>>>>> Mike, >>>>>> >>>>>> >>>>>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT >>>>>> apparently >>>>>> isn't fast enough if Zoie was needed, and now that Zoie is around are >>>>>> there >>>>>> any plans to make it Lucene's default? or: why would one still use NRT >>>>>> when >>>>>> Zoie seem to work much better? >>>>>> >>>>>> >>>>>> Itamar. >>>>>> >>>>>> >>>>>> On 12/06/2011 13:16, Michael McCandless wrote: >>>>>> >>>>>>> Remember that memory-mapping is not a panacea: at the end of the day, >>>>>>> if there just isn't enough RAM on the machine to keep your full >>>>>>> "working set" hot, then the OS will have to hit the disk, regardless >>>>>>> of whether the access is through MMap or a "traditional" IO request. >>>>>>> >>>>>>> That said, on Fedora Linux anyway, I generally see better performance >>>>>>> from MMap than from NIOFSDir; eg see the 2nd chart here: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html >>>>>>> >>>>>>> Mike McCandless >>>>>>> >>>>>>> http://blog.mikemccandless.com >>>>>>> >>>>>>> On Sun, Jun 12, 2011 at 4:10 AM, Itamar >>>>>>> Syn-Hershko<ita...@code972.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> >>>>>>>> The whole point of my question was to find out if and how to make >>>>>>>> balancing >>>>>>>> on the SAME machine. Apparently thats not going to help and at a >>>>>>>> certain >>>>>>>> point we will just have to prompt the user to buy more hardware... >>>>>>>> >>>>>>>> >>>>>>>> Out of curiosity, isn't there anything that we can do to avoid that? >>>>>>>> for >>>>>>>> instance using memory-mapped files for the indexes? anything that >>>>>>>> would >>>>>>>> help >>>>>>>> us overcome OS limitations of that sort... >>>>>>>> >>>>>>>> >>>>>>>> Also, you mention a scheduled job to check for performance >>>>>>>> degradation; >>>>>>>> any >>>>>>>> idea how serious such a drop should be for sharding to be really >>>>>>>> beneficial? >>>>>>>> or is it application specific too? >>>>>>>> >>>>>>>> >>>>>>>> Itamar. >>>>>>>> >>>>>>>> >>>>>>>> On 12/06/2011 06:43, Shai Erera wrote: >>>>>>>> >>>>>>>>> I agree w/ Erick, there is no cutoff point (index size for that >>>>>>>>> matter) >>>>>>>>> above which you start sharding. >>>>>>>>> >>>>>>>>> What you can do is create a scheduled job in your system that runs a >>>>>>>>> select >>>>>>>>> list of queries and monitors their performance. Once it degrades, it >>>>>>>>> shards >>>>>>>>> the index by either splitting it (you can use IndexSplitter under >>>>>>>>> contrib) >>>>>>>>> or create a new shard, and direct new documents to it. >>>>>>>>> >>>>>>>>> I think I read somewhere, not sure if it was in Solr or >>>>>>>>> ElasticSearch >>>>>>>>> documentation, about a Balancer object, which moves shards around in >>>>>>>>> order >>>>>>>>> to balance the load on the cluster. You can implement something >>>>>>>>> similar >>>>>>>>> which tries to balance the index sizes, creates new shards >>>>>>>>> on-the-fly, >>>>>>>>> even >>>>>>>>> merge shards if suddenly a whole source is being removed from the >>>>>>>>> system >>>>>>>>> etc. >>>>>>>>> >>>>>>>>> Also, note that the 'largest index size' threshold is really a >>>>>>>>> machine >>>>>>>>> constraint and not Lucene's. So if you decide that 10 GB is your >>>>>>>>> cutoff, >>>>>>>>> it >>>>>>>>> is pointless to create 10x10GB shards on the same machine -- >>>>>>>>> searching >>>>>>>>> them >>>>>>>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps >>>>>>>>> it's >>>>>>>>> even >>>>>>>>> worse because you consume more RAM when the indexes are split (e.g., >>>>>>>>> terms >>>>>>>>> index, field infos etc.). >>>>>>>>> >>>>>>>>> Shai >>>>>>>>> >>>>>>>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick >>>>>>>>> Erickson<erickerick...@gmail.com>wrote: >>>>>>>>> >>>>>>>>>> <<<We can't assume anything about the machine running it, >>>>>>>>>> so testing won't really tell us much>>> >>>>>>>>>> >>>>>>>>>> Hmmm, then it's pretty hopeless I think. Problem is that >>>>>>>>>> anything you say about running on a machine with >>>>>>>>>> 2G available memory on a single processor is completely >>>>>>>>>> incomparable to running on a machine with 64G of >>>>>>>>>> memory available for Lucene and 16 processors. >>>>>>>>>> >>>>>>>>>> There's really no such thing as an "optimum" Lucene index >>>>>>>>>> size, it always relates to the characteristics of the >>>>>>>>>> underlying hardware. >>>>>>>>>> >>>>>>>>>> I think the best you can do is actually test on various >>>>>>>>>> configurations, then at least you can say "on configuration >>>>>>>>>> X this is the tipping point". >>>>>>>>>> >>>>>>>>>> Sorry there isn't a better answer that I know of, but... >>>>>>>>>> >>>>>>>>>> Best >>>>>>>>>> Erick >>>>>>>>>> >>>>>>>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar >>>>>>>>>> Syn-Hershko<ita...@code972.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I know Lucene indexes to be at their optimum up to a certain size >>>>>>>>>>> - >>>>>>>>>>> said >>>>>>>>>> >>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> be around several GBs. I haven't found a good discussion over >>>>>>>>>>> this, >>>>>>>>>>> but >>>>>>>>>> >>>>>>>>>> its >>>>>>>>>>> >>>>>>>>>>> my understanding that at some point its better to split an index >>>>>>>>>>> into >>>>>>>>>> >>>>>>>>>> parts >>>>>>>>>>> >>>>>>>>>>> (a la sharding) than to continue searching on a huge-size index. I >>>>>>>>>>> assume >>>>>>>>>>> this has to do with OS and IO configurations. Can anyone point me >>>>>>>>>>> to >>>>>>>>>>> more >>>>>>>>>>> info on this? >>>>>>>>>>> >>>>>>>>>>> We have a product that is using Lucene for various searches, and >>>>>>>>>>> at >>>>>>>>>>> the >>>>>>>>>>> moment each type of search is using its own Lucene index. We plan >>>>>>>>>>> on >>>>>>>>>>> refactoring the way it works and to combine all indexes into one - >>>>>>>>>>> making >>>>>>>>>>> the whole system more robust and with a smaller memory footprint, >>>>>>>>>>> among >>>>>>>>>>> other things. >>>>>>>>>>> >>>>>>>>>>> Assuming the above is true, we are interested in knowing how to do >>>>>>>>>>> this >>>>>>>>>>> correctly. Initially all our indexes will be run in one big index, >>>>>>>>>>> but >>>>>>>>>>> if >>>>>>>>>> >>>>>>>>>> at >>>>>>>>>>> >>>>>>>>>>> some index size there is a severe performance degradation we would >>>>>>>>>>> like >>>>>>>>>> >>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> handle that correctly by starting a new FSDirectory index to flush >>>>>>>>>>> into, >>>>>>>>>> >>>>>>>>>> or >>>>>>>>>>> >>>>>>>>>>> by re-indexing and moving large indexes into their own Lucene >>>>>>>>>>> index. >>>>>>>>>>> >>>>>>>>>>> Are there are any guidelines for measuring or estimating this >>>>>>>>>>> correctly? >>>>>>>>>>> what we should be aware of while considering all that? We can't >>>>>>>>>>> assume >>>>>>>>>>> anything about the machine running it, so testing won't really >>>>>>>>>>> tell >>>>>>>>>>> us >>>>>>>>>>> much... >>>>>>>>>>> >>>>>>>>>>> Thanks in advance for any input on this, >>>>>>>>>>> >>>>>>>>>>> Itamar. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>> >>>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>> >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org