is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/) it says Multithreading. Indices can be queried and scored concurrently. maybe we can learn something from it.
2010/12/31 Li Li <fancye...@gmail.com>: > plus > 2 means search a term need seek many times for tis(if it's not cached in tii) > > 2010/12/31 Li Li <fancye...@gmail.com>: >> searching multi segments is a alternative solution but it has some >> disadvantages. >> 1. idf is not global?(I am not familiar with its implementation) maybe >> it's easy to solve it by share global idf >> 2. each segments will has it's own tii and tis files, which may make >> search slower(that's why optimization of >> index is neccessary) >> 3. one term's docList is distributed in many files rather than one. >> more than one frq files means >> hard disk must seek different tracks, it's time consuming. if there is >> only one segment, the are likely >> stored in a single track. >> >> >> 2010/12/31 Earwin Burrfoot <ear...@gmail.com>: >>>>>>until we fix Lucene to run a single search concurrently (which we >>>>>>badly need to do). >>>> I am interested in this idea.(I have posted it before) do you have some >>>> resources such as papers or tech articles about it? >>>> I have tried but it need to modify index format dramatically and we use >>>> solr distributed search to relieve the problem of response time. so finally >>>> give it up. >>>> lucene4's index format is more flexible that it supports customed codecs >>>> and it's now on development, I think it's good time to take it into >>>> consideration >>>> that let it support multithread searching for a single query. >>>> I have a naive solution. dividing docList into many groups >>>> e.g grouping docIds by it's even or odd >>>> term1 df1=4 docList = 0 4 8 10 >>>> term1 df2=4 docList = 1 3 9 11 >>>> >>>> term2 df1=4 docList = 0 6 8 12 >>>> term2 df2=4 docList = 3 9 11 15 >>>> then we can use 2 threads to search topN docs on even group and odd group >>>> and finally merge their results into a single on just like solr >>>> distributed search. >>>> But it's better than solr distributed search. >>>> First, it's in a single process and data communication between >>>> threads is much >>>> faster than network. >>>> Second, each threads process the same number of documents.For solr >>>> distributed >>>> search, one shard may process 7 documents and another shard may 1 document >>>> Even if we can make each shard have the same document number. we can not >>>> make it uniformly for each term. >>>> e.g. shard1 has doc1 doc2 >>>> shard2 has doc3 doc4 >>>> but term1 may only occur in doc1 and doc2 >>>> while term2 may only occur in doc3 and doc4 >>>> we may modify it >>>> shard1 doc1 doc3 >>>> shard2 doc2 doc4 >>>> it's good for term1 and term2 >>>> but term3 may occur in doc1 and doc3... >>>> So I think it's fine-grained distributed in index while solr >>>> distributed search is coarse- >>>> grained. >>> This is just crazy :) >>> >>> The simple way is just to search different segments in parallel. >>> BalancedSegmentMergePolicy makes sure you have roughly even-sized >>> large segments (and small ones don't count, they're small!). >>> If you're bound on squeezing out that extra millisecond (and making >>> your life miserable along the way), you can search a single segment >>> with multiple threads (by dividing it in even chunks, and then doing >>> skipTo to position your iterators to the beginning of each chunk). >>> >>> First approach is really easy to implement. Second one is harder, but >>> still doesn't require you to cook the number of CPU cores available >>> into your index! >>> >>> It's the law of diminishing returns at play here. You're most likely >>> to search in parallel over mostly memory-resident index >>> (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems >>> tend to slow down considerably on parallel sequential reads, so you >>> already have pretty decent speed. >>> Searching different segments in parallel (with BSMP) makes you several >>> times faster. >>> Searching in parallel within a segment requires some weird hacks, but >>> has maybe a few percent advantage over previous solution. >>> Sharding posting lists requires a great deal of weird hacks, makes >>> index machine-bound, and boosts speed by another couple of percent. >>> Sounds worthless. >>> >>> -- >>> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) >>> Phone: +7 (495) 683-567-4 >>> ICQ: 104465785 >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org