Please purchase and read Lucene In Action 2. It will explain how all of this works and how to used Lucene efficiently.
http://www.lucidimagination.com/blog/2009/03/11/lia2/ http://www.lucidimagination.com/blog/2010/08/01/lucene-in-action-2d-edition-authors-round-table-podcast/ http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos/Lucene-in-Action It is available from Manning as a downloadable E-Book (PDF) so you don't have to wait for it to be shipped. 2010/12/30 Li Li <fancye...@gmail.com>: > is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/) > it says Multithreading. Indices can be queried and scored concurrently. > maybe we can learn something from it. > > 2010/12/31 Li Li <fancye...@gmail.com>: >> plus >> 2 means search a term need seek many times for tis(if it's not cached in tii) >> >> 2010/12/31 Li Li <fancye...@gmail.com>: >>> searching multi segments is a alternative solution but it has some >>> disadvantages. >>> 1. idf is not global?(I am not familiar with its implementation) maybe >>> it's easy to solve it by share global idf >>> 2. each segments will has it's own tii and tis files, which may make >>> search slower(that's why optimization of >>> index is neccessary) >>> 3. one term's docList is distributed in many files rather than one. >>> more than one frq files means >>> hard disk must seek different tracks, it's time consuming. if there is >>> only one segment, the are likely >>> stored in a single track. >>> >>> >>> 2010/12/31 Earwin Burrfoot <ear...@gmail.com>: >>>>>>>until we fix Lucene to run a single search concurrently (which we >>>>>>>badly need to do). >>>>> I am interested in this idea.(I have posted it before) do you have some >>>>> resources such as papers or tech articles about it? >>>>> I have tried but it need to modify index format dramatically and we use >>>>> solr distributed search to relieve the problem of response time. so >>>>> finally >>>>> give it up. >>>>> lucene4's index format is more flexible that it supports customed codecs >>>>> and it's now on development, I think it's good time to take it into >>>>> consideration >>>>> that let it support multithread searching for a single query. >>>>> I have a naive solution. dividing docList into many groups >>>>> e.g grouping docIds by it's even or odd >>>>> term1 df1=4 docList = 0 4 8 10 >>>>> term1 df2=4 docList = 1 3 9 11 >>>>> >>>>> term2 df1=4 docList = 0 6 8 12 >>>>> term2 df2=4 docList = 3 9 11 15 >>>>> then we can use 2 threads to search topN docs on even group and odd >>>>> group >>>>> and finally merge their results into a single on just like solr >>>>> distributed search. >>>>> But it's better than solr distributed search. >>>>> First, it's in a single process and data communication between >>>>> threads is much >>>>> faster than network. >>>>> Second, each threads process the same number of documents.For solr >>>>> distributed >>>>> search, one shard may process 7 documents and another shard may 1 document >>>>> Even if we can make each shard have the same document number. we can not >>>>> make it uniformly for each term. >>>>> e.g. shard1 has doc1 doc2 >>>>> shard2 has doc3 doc4 >>>>> but term1 may only occur in doc1 and doc2 >>>>> while term2 may only occur in doc3 and doc4 >>>>> we may modify it >>>>> shard1 doc1 doc3 >>>>> shard2 doc2 doc4 >>>>> it's good for term1 and term2 >>>>> but term3 may occur in doc1 and doc3... >>>>> So I think it's fine-grained distributed in index while solr >>>>> distributed search is coarse- >>>>> grained. >>>> This is just crazy :) >>>> >>>> The simple way is just to search different segments in parallel. >>>> BalancedSegmentMergePolicy makes sure you have roughly even-sized >>>> large segments (and small ones don't count, they're small!). >>>> If you're bound on squeezing out that extra millisecond (and making >>>> your life miserable along the way), you can search a single segment >>>> with multiple threads (by dividing it in even chunks, and then doing >>>> skipTo to position your iterators to the beginning of each chunk). >>>> >>>> First approach is really easy to implement. Second one is harder, but >>>> still doesn't require you to cook the number of CPU cores available >>>> into your index! >>>> >>>> It's the law of diminishing returns at play here. You're most likely >>>> to search in parallel over mostly memory-resident index >>>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems >>>> tend to slow down considerably on parallel sequential reads, so you >>>> already have pretty decent speed. >>>> Searching different segments in parallel (with BSMP) makes you several >>>> times faster. >>>> Searching in parallel within a segment requires some weird hacks, but >>>> has maybe a few percent advantage over previous solution. >>>> Sharding posting lists requires a great deal of weird hacks, makes >>>> index machine-bound, and boosts speed by another couple of percent. >>>> Sounds worthless. >>>> >>>> -- >>>> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) >>>> Phone: +7 (495) 683-567-4 >>>> ICQ: 104465785 >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Lance Norskog goks...@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org