Re: strange problem of PForDelta decoder
our recent experiments show that PFOR is not a good solution for and query we tested it with our dataset and users' queries. for most case, PFOR is slower than vint. we analyzed the reason may be that it's very likely there is a low-frequent term in most queries. So the scoring time is the majority while decoding is not. e.g in our index, term beijing's df is 2557916 and park is 2313201, both them are hight frequent terms. but the count of documents containing both is only 1552 for vint, it only need decode 1552 documents, while PFOR, it may decode many blocks. for most search engines, and query is used. So PFOR is only good for or query and and query whose terms are all high frequent. So we have to give up this in our application. partial decoder for PFOR? for all high frequent terms, using normal PFOR decoder ;for quries with low frequent terms, using partial decoder? partial decoder of PFOR many need many if/else and will be slower. Any one has any solution for this? 2010/12/27 Li Li fancye...@gmail.com: I integrated pfor codec into lucene 2.9.3 and the search time comparsion is as follows: single term and query or query VINT in lucene 2.9.3 11.2 36.5 38.6 PFor in lucene 2.9.3 8.7 27.6 33.4 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lcuene 4 branch 8.1 22.5 30.7 My test terms are high frequncy terms because we are interested in bad case It seems lucene 4 branch's implementation of and query(conjuction query) is well optimized that even for VINT codec, it's faster than PFor in lucene 2.9.3. Could any one tell me what optimization is done? is store docIDs and freqs separately making it faster? or anything else? Another querstion, Is there anyone interested in integrating pfor codec into lucene 2.9.3 as me( we have to use lucene 2.9 and solr 1.4). And how do I contribute this patch? 2010/12/24 Michael McCandless luc...@mikemccandless.com: Well, an early patch somewhere was able to run PFor on trunk, but the performance wasn't great because the trunk bulk-read API is a bottleneck (this is why the bulk postings branch was created). Mike On Wed, Dec 22, 2010 at 9:45 PM, Li Li fancye...@gmail.com wrote: I used the bulkpostings branch(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene) does trunk have PForDelta decoder/encoder ? 2010/12/23 Michael McCandless luc...@mikemccandless.com: Those are nice speedups! Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this test? Mike On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote: great improvement! I did a test in our data set. doc count is about 2M+ and index size after optimization is about 13.3GB(including fdt) it seems lucene4's index format is better than lucene2.9.3. and PFor give good results. Besides BlockEncoder for frq and pos. is there any other modification for lucene 4? decoder \ avg time single word(ms) and query(ms) or query(ms) VINT in lucene 2.9 11.2 36.5 38.6 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lucene 4 branch 8.1 22.5 30.7 2010/12/21 Li Li fancye...@gmail.com: OK we should have a look at that one still. We need to converge on a good default codec for 4.0. Fortunately it's trivial to take any int block encoder (fixed or variable block) and make a Lucene codec out of it! I suggests you not to use this one, I fixed dozens of bugs but it still failed when with random tests. it's codes is hand coded rather than generated by program. But we may learn something from it. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
Great :) Also, realize that certain queries (MultiTermQuery) spend lots of time in rewrite and (sometimes, anyway) comparatively small amounts of time in actually running the query. So it'd be nice to have a good way to let multiple threads participate in rewrite too. Thread scheduling/prioritization is also important, so that a big slow query doesn't starve execution for fast queries. Ideally (unfortunately!) I think we'd want something like how the OS schedules processes, ie new processes run with higher priority than long running resource consuming processes. So a new query would get higher priority, but, if it runs for a long time, it's priority is reduced. Not sure how we should solve that one... Mike On Wed, Jan 5, 2011 at 10:09 PM, Li Li fancye...@gmail.com wrote: we recently are interested in this problem. if we come up with a patch, I'd like to share it with everyone. 2011/1/4 Michael McCandless luc...@mikemccandless.com: 2011/1/4 Li Li fancye...@gmail.com: I agree with you that we should not tie concurrency w/in a single search to index segments. That solution is just a hack. will lucene 4 support multithreads search for a single query? I haven't found any patch about this. Well, as things stand now, Lucene 4 will only support the thread per segment hack. The patch on LUCENE-2837 (still needs work) merges ParallelMultiSearcher into IndexSearcher, carrying over that hack. But this discussion seems like it could lead to a nice patch? (If someone has the time/energy/itch to cons one up). Just dividing up the docID space equally seems like a simple solution that'd work well... Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
we recently are interested in this problem. if we come up with a patch, I'd like to share it with everyone. 2011/1/4 Michael McCandless luc...@mikemccandless.com: 2011/1/4 Li Li fancye...@gmail.com: I agree with you that we should not tie concurrency w/in a single search to index segments. That solution is just a hack. will lucene 4 support multithreads search for a single query? I haven't found any patch about this. Well, as things stand now, Lucene 4 will only support the thread per segment hack. The patch on LUCENE-2837 (still needs work) merges ParallelMultiSearcher into IndexSearcher, carrying over that hack. But this discussion seems like it could lead to a nice patch? (If someone has the time/energy/itch to cons one up). Just dividing up the docID space equally seems like a simple solution that'd work well... Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
2011/1/4 Li Li fancye...@gmail.com: I agree with you that we should not tie concurrency w/in a single search to index segments. That solution is just a hack. will lucene 4 support multithreads search for a single query? I haven't found any patch about this. Well, as things stand now, Lucene 4 will only support the thread per segment hack. The patch on LUCENE-2837 (still needs work) merges ParallelMultiSearcher into IndexSearcher, carrying over that hack. But this discussion seems like it could lead to a nice patch? (If someone has the time/energy/itch to cons one up). Just dividing up the docID space equally seems like a simple solution that'd work well... Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
Here's the paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.8091 I haven't read it yet... In general I don't like tying concurrency w/in a single search to index segments; I'd rather they be (relatively?) independent. EG an optimized index would then force single thread queries, which is crazy... In Lucene it should be easy to chop up the maxDoc(), ie if I have 10M docs and I want to use 4 threads I just give each thread a chunk of 2.5M docs. Each thread would first skip to its start doc, and then go from there. I'm not sure we'd want to share a single collector vs private collector per thread and then merge in the end (this is a similar tradeoff as we've discussed on the per-segment collectors). Mike 2011/1/1 Li Li fancye...@gmail.com: I sent a mail to MG4J group and Sebastiano Vigna recommended the paper Reducing query latencies in web search using fine-grained parallelism. World Wide Web, 12(4):441-460, 2009. I read it roughly. But there are some questions it says: it first coalesces all disk reads in a single process, and then distributes the index data among the parallel threads. So when the index server receives a query, it loads from storage (disk or cache) the required posting lists, initializes the query execution, and then spawns lightweight threads, one per core. Each thread receives an equal-sized subset of document IDs to scan, together covering the entire index partition. All threads execute the same code on the same query, but with private data structures. The only writable shared data structure is a heap of the top-scoring hits, protected by a lock. At the end of the threads' execution, the heap contains the highest-scoring hits in the entire index partition, which is then transmitted to the query integrator as usual. Since the index contains skip-lists that permit near-random-access to any document ID, and since hits are generally distributed evenly in the index, our measurements show that all threads complete their work with little variability, within 1-2% of each other. I have some questions 1.Since the index contains skip-lists that permit near-random-access to any document ID skip list can be near random access?(especially when it's in hard disk) 2.For or query, it's easy. e.g. we search search and engine, we using one main thread to get postings the the 2 terms and divide their postings into 2 groups(e.g. even docIds and odd docIds) and using 2 threads to score them and finally merge it(or using locked priority queue) but for and query, we usally don't decode all the docIDs because we can skip many documents. especially when searching low frequent terms with high frequent terms. only a small number of docIDs of high frequent term is decoded. btw. I think or query is often much slower than and query. If we can parallel or query well, it's also very useful. On Dec 31, 2010, at 7:25 AM, Li Li wrote: Which one is used in MG4J to support multithreads searching? Are 2010/12/31 Li Li fancye...@gmail.com: is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/) it says Multithreading. Indices can be queried and scored concurrently. maybe we can learn something from it. 2010/12/31 Li Li fancye...@gmail.com: plus 2 means search a term need seek many times for tis(if it's not cached in tii) 2010/12/31 Li Li fancye...@gmail.com: searching multi segments is a alternative solution but it has some disadvantages. 1. idf is not global?(I am not familiar with its implementation) maybe it's easy to solve it by share global idf 2. each segments will has it's own tii and tis files, which may make search slower(that's why optimization of index is neccessary) 3. one term's docList is distributed in many files rather than one. more than one frq files means hard disk must seek different tracks, it's time consuming. if there is only one segment, the are likely stored in a single track. 2010/12/31 Earwin Burrfoot ear...@gmail.com: until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a
Re: strange problem of PForDelta decoder
I agree with you that we should not tie concurrency w/in a single search to index segments. That solution is just a hack. will lucene 4 support multithreads search for a single query? I haven't found any patch about this. 2011/1/4 Michael McCandless luc...@mikemccandless.com: Here's the paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.8091 I haven't read it yet... In general I don't like tying concurrency w/in a single search to index segments; I'd rather they be (relatively?) independent. EG an optimized index would then force single thread queries, which is crazy... In Lucene it should be easy to chop up the maxDoc(), ie if I have 10M docs and I want to use 4 threads I just give each thread a chunk of 2.5M docs. Each thread would first skip to its start doc, and then go from there. I'm not sure we'd want to share a single collector vs private collector per thread and then merge in the end (this is a similar tradeoff as we've discussed on the per-segment collectors). Mike 2011/1/1 Li Li fancye...@gmail.com: I sent a mail to MG4J group and Sebastiano Vigna recommended the paper Reducing query latencies in web search using fine-grained parallelism. World Wide Web, 12(4):441-460, 2009. I read it roughly. But there are some questions it says: it first coalesces all disk reads in a single process, and then distributes the index data among the parallel threads. So when the index server receives a query, it loads from storage (disk or cache) the required posting lists, initializes the query execution, and then spawns lightweight threads, one per core. Each thread receives an equal-sized subset of document IDs to scan, together covering the entire index partition. All threads execute the same code on the same query, but with private data structures. The only writable shared data structure is a heap of the top-scoring hits, protected by a lock. At the end of the threads' execution, the heap contains the highest-scoring hits in the entire index partition, which is then transmitted to the query integrator as usual. Since the index contains skip-lists that permit near-random-access to any document ID, and since hits are generally distributed evenly in the index, our measurements show that all threads complete their work with little variability, within 1-2% of each other. I have some questions 1.Since the index contains skip-lists that permit near-random-access to any document ID skip list can be near random access?(especially when it's in hard disk) 2.For or query, it's easy. e.g. we search search and engine, we using one main thread to get postings the the 2 terms and divide their postings into 2 groups(e.g. even docIds and odd docIds) and using 2 threads to score them and finally merge it(or using locked priority queue) but for and query, we usally don't decode all the docIDs because we can skip many documents. especially when searching low frequent terms with high frequent terms. only a small number of docIDs of high frequent term is decoded. btw. I think or query is often much slower than and query. If we can parallel or query well, it's also very useful. On Dec 31, 2010, at 7:25 AM, Li Li wrote: Which one is used in MG4J to support multithreads searching? Are 2010/12/31 Li Li fancye...@gmail.com: is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/) it says Multithreading. Indices can be queried and scored concurrently. maybe we can learn something from it. 2010/12/31 Li Li fancye...@gmail.com: plus 2 means search a term need seek many times for tis(if it's not cached in tii) 2010/12/31 Li Li fancye...@gmail.com: searching multi segments is a alternative solution but it has some disadvantages. 1. idf is not global?(I am not familiar with its implementation) maybe it's easy to solve it by share global idf 2. each segments will has it's own tii and tis files, which may make search slower(that's why optimization of index is neccessary) 3. one term's docList is distributed in many files rather than one. more than one frq files means hard disk must seek different tracks, it's time consuming. if there is only one segment, the are likely stored in a single track. 2010/12/31 Earwin Burrfoot ear...@gmail.com: until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping
Re: strange problem of PForDelta decoder
I sent a mail to MG4J group and Sebastiano Vigna recommended the paper Reducing query latencies in web search using fine-grained parallelism. World Wide Web, 12(4):441-460, 2009. I read it roughly. But there are some questions it says: it first coalesces all disk reads in a single process, and then distributes the index data among the parallel threads. So when the index server receives a query, it loads from storage (disk or cache) the required posting lists, initializes the query execution, and then spawns lightweight threads, one per core. Each thread receives an equal-sized subset of document IDs to scan, together covering the entire index partition. All threads execute the same code on the same query, but with private data structures. The only writable shared data structure is a heap of the top-scoring hits, protected by a lock. At the end of the threads' execution, the heap contains the highest-scoring hits in the entire index partition, which is then transmitted to the query integrator as usual. Since the index contains skip-lists that permit near-random-access to any document ID, and since hits are generally distributed evenly in the index, our measurements show that all threads complete their work with little variability, within 1-2% of each other. I have some questions 1.Since the index contains skip-lists that permit near-random-access to any document ID skip list can be near random access?(especially when it's in hard disk) 2.For or query, it's easy. e.g. we search search and engine, we using one main thread to get postings the the 2 terms and divide their postings into 2 groups(e.g. even docIds and odd docIds) and using 2 threads to score them and finally merge it(or using locked priority queue) but for and query, we usally don't decode all the docIDs because we can skip many documents. especially when searching low frequent terms with high frequent terms. only a small number of docIDs of high frequent term is decoded. btw. I think or query is often much slower than and query. If we can parallel or query well, it's also very useful. On Dec 31, 2010, at 7:25 AM, Li Li wrote: Which one is used in MG4J to support multithreads searching? Are 2010/12/31 Li Li fancye...@gmail.com: is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/) it says Multithreading. Indices can be queried and scored concurrently. maybe we can learn something from it. 2010/12/31 Li Li fancye...@gmail.com: plus 2 means search a term need seek many times for tis(if it's not cached in tii) 2010/12/31 Li Li fancye...@gmail.com: searching multi segments is a alternative solution but it has some disadvantages. 1. idf is not global?(I am not familiar with its implementation) maybe it's easy to solve it by share global idf 2. each segments will has it's own tii and tis files, which may make search slower(that's why optimization of index is neccessary) 3. one term's docList is distributed in many files rather than one. more than one frq files means hard disk must seek different tracks, it's time consuming. if there is only one segment, the are likely stored in a single track. 2010/12/31 Earwin Burrfoot ear...@gmail.com: until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained
Re: strange problem of PForDelta decoder
Please purchase and read Lucene In Action 2. It will explain how all of this works and how to used Lucene efficiently. http://www.lucidimagination.com/blog/2009/03/11/lia2/ http://www.lucidimagination.com/blog/2010/08/01/lucene-in-action-2d-edition-authors-round-table-podcast/ http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos/Lucene-in-Action It is available from Manning as a downloadable E-Book (PDF) so you don't have to wait for it to be shipped. 2010/12/30 Li Li fancye...@gmail.com: is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/) it says Multithreading. Indices can be queried and scored concurrently. maybe we can learn something from it. 2010/12/31 Li Li fancye...@gmail.com: plus 2 means search a term need seek many times for tis(if it's not cached in tii) 2010/12/31 Li Li fancye...@gmail.com: searching multi segments is a alternative solution but it has some disadvantages. 1. idf is not global?(I am not familiar with its implementation) maybe it's easy to solve it by share global idf 2. each segments will has it's own tii and tis files, which may make search slower(that's why optimization of index is neccessary) 3. one term's docList is distributed in many files rather than one. more than one frq files means hard disk must seek different tracks, it's time consuming. if there is only one segment, the are likely stored in a single track. 2010/12/31 Earwin Burrfoot ear...@gmail.com: until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained distributed in index while solr distributed search is coarse- grained. This is just crazy :) The simple way is just to search different segments in parallel. BalancedSegmentMergePolicy makes sure you have roughly even-sized large segments (and small ones don't count, they're small!). If you're bound on squeezing out that extra millisecond (and making your life miserable along the way), you can search a single segment with multiple threads (by dividing it in even chunks, and then doing skipTo to position your iterators to the beginning of each chunk). First approach is really easy to implement. Second one is harder, but still doesn't require you to cook the number of CPU cores available into your index! It's the law of diminishing returns at play here. You're most likely to search in parallel over mostly memory-resident index (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems tend to slow down considerably on parallel sequential reads, so you already have pretty decent speed. Searching different segments in parallel (with BSMP) makes you several times faster. Searching in parallel within a segment requires some weird hacks, but has maybe a few percent advantage over previous solution. Sharding posting lists requires a great deal of weird hacks, makes index machine-bound, and boosts speed by another couple of percent. Sounds worthless. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
On Mon, Dec 27, 2010 at 5:08 AM, Li Li fancye...@gmail.com wrote: I integrated pfor codec into lucene 2.9.3 and the search time comparsion is as follows: single term and query or query VINT in lucene 2.9.3 11.2 36.5 38.6 PFor in lucene 2.9.3 8.7 27.6 33.4 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lcuene 4 branch 8.1 22.5 30.7 My test terms are high frequncy terms because we are interested in bad case I agree it's the bad cases we should focus on in general. If a super fast query gets somewhat slower it's relatively harmless (just a capacity question for high volume sites) but if the bad queries get slower it's awful (requires faster cutover to sharded architecture), until we fix Lucene to run a single search concurrently (which we badly need to do). It seems lucene 4 branch's implementation of and query(conjuction query) is well optimized that even for VINT codec, it's faster than PFor in lucene 2.9.3. Could any one tell me what optimization is done? is store docIDs and freqs separately making it faster? or anything else? Actually vInt on the bulkpostings branch stores freq/doc together. Ie the format is the same as 2.9.x's format. I think it could be the fact that AND query does block reads (64 doc/freqs at once) instead of doc-at-once? Ie, because of this, the query is efficitively scanning the next block of 64 docs instead of skipping to them? Our skipping impl is unfortunately rather costly so if skip will not skip that many docs it's better to scan. Another querstion, Is there anyone interested in integrating pfor codec into lucene 2.9.3 as me( we have to use lucene 2.9 and solr 1.4). And how do I contribute this patch? Realistically I don't think we can commit this to 2.9.x -- that branch is purely bug fixes at this point. Still it's possible others could make use of such a patch so if it's not too much work you may as well post it? It can lead to improvements on the bulk postings branch too :) The more patches the merrier! You only use PFor for the very high freq terms in 2.9.x right? I've wondered if we should do the same on bulkpostings... problem is for eg range queries, that visit all docs for all terms b/w X and Y, you want the bulk decode even for low freq terms... Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
I did another test using lucene 4 trunk with default codecs. it's file is the same as lucene 2.9. the speed is almost the same as lucene 2.9 I think it could be the fact that AND query does block reads (64 doc/freqs at once) instead of doc-at-once? Ie, because of this, the query is efficitively scanning the next block of 64 docs instead of skipping to them? Our skipping impl is unfortunately rather costly so if skip will not skip that many docs it's better to scan. I agree with this explanation. for high frequency terms, the skiplist can not skip over many docs. it seems there are something need optimization. e.g. for high frequent terms, we use scanning; for low frequent terms, we use skiplist. but if we only care bad case, we can just care high frequent terms only. You only use PFor for the very high freq terms in 2.9.x right? I use PFor if df is greater than 128. if not, I use VINT until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained distributed in index while solr distributed search is coarse- grained. 2010/12/30 Michael McCandless luc...@mikemccandless.com: On Mon, Dec 27, 2010 at 5:08 AM, Li Li fancye...@gmail.com wrote: I integrated pfor codec into lucene 2.9.3 and the search time comparsion is as follows: single term and query or query VINT in lucene 2.9.3 11.2 36.5 38.6 PFor in lucene 2.9.3 8.7 27.6 33.4 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lcuene 4 branch 8.1 22.5 30.7 My test terms are high frequncy terms because we are interested in bad case I agree it's the bad cases we should focus on in general. If a super fast query gets somewhat slower it's relatively harmless (just a capacity question for high volume sites) but if the bad queries get slower it's awful (requires faster cutover to sharded architecture), until we fix Lucene to run a single search concurrently (which we badly need to do). It seems lucene 4 branch's implementation of and query(conjuction query) is well optimized that even for VINT codec, it's faster than PFor in lucene 2.9.3. Could any one tell me what optimization is done? is store docIDs and freqs separately making it faster? or anything else? Actually vInt on the bulkpostings branch stores freq/doc together. Ie the format is the same as 2.9.x's format. I think it could be the fact that AND query does block reads (64 doc/freqs at once) instead of doc-at-once? Ie, because of this, the query is efficitively scanning the next block of 64 docs instead of skipping to them? Our skipping impl is unfortunately rather costly so if skip will not skip that many docs it's better to scan. Another querstion, Is there anyone interested in integrating pfor codec into lucene 2.9.3 as me( we have to use lucene 2.9 and solr 1.4). And how do I contribute this patch? Realistically I don't think we can commit this to 2.9.x -- that branch is purely bug fixes at this point. Still it's possible others could make use of such a patch so if it's not too much work you may as well post it? It can lead to improvements on the bulk postings branch too :) The more patches the merrier! You only use PFor for the very high freq terms
Re: strange problem of PForDelta decoder
until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained distributed in index while solr distributed search is coarse- grained. This is just crazy :) The simple way is just to search different segments in parallel. BalancedSegmentMergePolicy makes sure you have roughly even-sized large segments (and small ones don't count, they're small!). If you're bound on squeezing out that extra millisecond (and making your life miserable along the way), you can search a single segment with multiple threads (by dividing it in even chunks, and then doing skipTo to position your iterators to the beginning of each chunk). First approach is really easy to implement. Second one is harder, but still doesn't require you to cook the number of CPU cores available into your index! It's the law of diminishing returns at play here. You're most likely to search in parallel over mostly memory-resident index (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems tend to slow down considerably on parallel sequential reads, so you already have pretty decent speed. Searching different segments in parallel (with BSMP) makes you several times faster. Searching in parallel within a segment requires some weird hacks, but has maybe a few percent advantage over previous solution. Sharding posting lists requires a great deal of weird hacks, makes index machine-bound, and boosts speed by another couple of percent. Sounds worthless. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
searching multi segments is a alternative solution but it has some disadvantages. 1. idf is not global?(I am not familiar with its implementation) maybe it's easy to solve it by share global idf 2. each segments will has it's own tii and tis files, which may make search slower(that's why optimization of index is neccessary) 3. one term's docList is distributed in many files rather than one. more than one frq files means hard disk must seek different tracks, it's time consuming. if there is only one segment, the are likely stored in a single track. 2010/12/31 Earwin Burrfoot ear...@gmail.com: until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained distributed in index while solr distributed search is coarse- grained. This is just crazy :) The simple way is just to search different segments in parallel. BalancedSegmentMergePolicy makes sure you have roughly even-sized large segments (and small ones don't count, they're small!). If you're bound on squeezing out that extra millisecond (and making your life miserable along the way), you can search a single segment with multiple threads (by dividing it in even chunks, and then doing skipTo to position your iterators to the beginning of each chunk). First approach is really easy to implement. Second one is harder, but still doesn't require you to cook the number of CPU cores available into your index! It's the law of diminishing returns at play here. You're most likely to search in parallel over mostly memory-resident index (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems tend to slow down considerably on parallel sequential reads, so you already have pretty decent speed. Searching different segments in parallel (with BSMP) makes you several times faster. Searching in parallel within a segment requires some weird hacks, but has maybe a few percent advantage over previous solution. Sharding posting lists requires a great deal of weird hacks, makes index machine-bound, and boosts speed by another couple of percent. Sounds worthless. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
plus 2 means search a term need seek many times for tis(if it's not cached in tii) 2010/12/31 Li Li fancye...@gmail.com: searching multi segments is a alternative solution but it has some disadvantages. 1. idf is not global?(I am not familiar with its implementation) maybe it's easy to solve it by share global idf 2. each segments will has it's own tii and tis files, which may make search slower(that's why optimization of index is neccessary) 3. one term's docList is distributed in many files rather than one. more than one frq files means hard disk must seek different tracks, it's time consuming. if there is only one segment, the are likely stored in a single track. 2010/12/31 Earwin Burrfoot ear...@gmail.com: until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained distributed in index while solr distributed search is coarse- grained. This is just crazy :) The simple way is just to search different segments in parallel. BalancedSegmentMergePolicy makes sure you have roughly even-sized large segments (and small ones don't count, they're small!). If you're bound on squeezing out that extra millisecond (and making your life miserable along the way), you can search a single segment with multiple threads (by dividing it in even chunks, and then doing skipTo to position your iterators to the beginning of each chunk). First approach is really easy to implement. Second one is harder, but still doesn't require you to cook the number of CPU cores available into your index! It's the law of diminishing returns at play here. You're most likely to search in parallel over mostly memory-resident index (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems tend to slow down considerably on parallel sequential reads, so you already have pretty decent speed. Searching different segments in parallel (with BSMP) makes you several times faster. Searching in parallel within a segment requires some weird hacks, but has maybe a few percent advantage over previous solution. Sharding posting lists requires a great deal of weird hacks, makes index machine-bound, and boosts speed by another couple of percent. Sounds worthless. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
is there anyone familiar with MG4J(http://mg4j.dsi.unimi.it/) it says Multithreading. Indices can be queried and scored concurrently. maybe we can learn something from it. 2010/12/31 Li Li fancye...@gmail.com: plus 2 means search a term need seek many times for tis(if it's not cached in tii) 2010/12/31 Li Li fancye...@gmail.com: searching multi segments is a alternative solution but it has some disadvantages. 1. idf is not global?(I am not familiar with its implementation) maybe it's easy to solve it by share global idf 2. each segments will has it's own tii and tis files, which may make search slower(that's why optimization of index is neccessary) 3. one term's docList is distributed in many files rather than one. more than one frq files means hard disk must seek different tracks, it's time consuming. if there is only one segment, the are likely stored in a single track. 2010/12/31 Earwin Burrfoot ear...@gmail.com: until we fix Lucene to run a single search concurrently (which we badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained distributed in index while solr distributed search is coarse- grained. This is just crazy :) The simple way is just to search different segments in parallel. BalancedSegmentMergePolicy makes sure you have roughly even-sized large segments (and small ones don't count, they're small!). If you're bound on squeezing out that extra millisecond (and making your life miserable along the way), you can search a single segment with multiple threads (by dividing it in even chunks, and then doing skipTo to position your iterators to the beginning of each chunk). First approach is really easy to implement. Second one is harder, but still doesn't require you to cook the number of CPU cores available into your index! It's the law of diminishing returns at play here. You're most likely to search in parallel over mostly memory-resident index (RAMDir/mmap/filesys cache - doesn't matter), as most of IO subsystems tend to slow down considerably on parallel sequential reads, so you already have pretty decent speed. Searching different segments in parallel (with BSMP) makes you several times faster. Searching in parallel within a segment requires some weird hacks, but has maybe a few percent advantage over previous solution. Sharding posting lists requires a great deal of weird hacks, makes index machine-bound, and boosts speed by another couple of percent. Sounds worthless. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Phone: +7 (495) 683-567-4 ICQ: 104465785 - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
I integrated pfor codec into lucene 2.9.3 and the search time comparsion is as follows: single term and query or query VINT in lucene 2.9.3 11.236.5 38.6 PFor in lucene 2.9.3 8.7 27.6 33.4 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lcuene 4 branch8.1 22.5 30.7 My test terms are high frequncy terms because we are interested in bad case It seems lucene 4 branch's implementation of and query(conjuction query) is well optimized that even for VINT codec, it's faster than PFor in lucene 2.9.3. Could any one tell me what optimization is done? is store docIDs and freqs separately making it faster? or anything else? Another querstion, Is there anyone interested in integrating pfor codec into lucene 2.9.3 as me( we have to use lucene 2.9 and solr 1.4). And how do I contribute this patch? 2010/12/24 Michael McCandless luc...@mikemccandless.com: Well, an early patch somewhere was able to run PFor on trunk, but the performance wasn't great because the trunk bulk-read API is a bottleneck (this is why the bulk postings branch was created). Mike On Wed, Dec 22, 2010 at 9:45 PM, Li Li fancye...@gmail.com wrote: I used the bulkpostings branch(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene) does trunk have PForDelta decoder/encoder ? 2010/12/23 Michael McCandless luc...@mikemccandless.com: Those are nice speedups! Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this test? Mike On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote: great improvement! I did a test in our data set. doc count is about 2M+ and index size after optimization is about 13.3GB(including fdt) it seems lucene4's index format is better than lucene2.9.3. and PFor give good results. Besides BlockEncoder for frq and pos. is there any other modification for lucene 4? decoder \ avg time single word(ms) and query(ms) or query(ms) VINT in lucene 2.9 11.2 36.5 38.6 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lucene 4 branch 8.1 22.5 30.7 2010/12/21 Li Li fancye...@gmail.com: OK we should have a look at that one still. We need to converge on a good default codec for 4.0. Fortunately it's trivial to take any int block encoder (fixed or variable block) and make a Lucene codec out of it! I suggests you not to use this one, I fixed dozens of bugs but it still failed when with random tests. it's codes is hand coded rather than generated by program. But we may learn something from it. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
Well, an early patch somewhere was able to run PFor on trunk, but the performance wasn't great because the trunk bulk-read API is a bottleneck (this is why the bulk postings branch was created). Mike On Wed, Dec 22, 2010 at 9:45 PM, Li Li fancye...@gmail.com wrote: I used the bulkpostings branch(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene) does trunk have PForDelta decoder/encoder ? 2010/12/23 Michael McCandless luc...@mikemccandless.com: Those are nice speedups! Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this test? Mike On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote: great improvement! I did a test in our data set. doc count is about 2M+ and index size after optimization is about 13.3GB(including fdt) it seems lucene4's index format is better than lucene2.9.3. and PFor give good results. Besides BlockEncoder for frq and pos. is there any other modification for lucene 4? decoder \ avg time single word(ms) and query(ms) or query(ms) VINT in lucene 2.9 11.2 36.5 38.6 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lucene 4 branch 8.1 22.5 30.7 2010/12/21 Li Li fancye...@gmail.com: OK we should have a look at that one still. We need to converge on a good default codec for 4.0. Fortunately it's trivial to take any int block encoder (fixed or variable block) and make a Lucene codec out of it! I suggests you not to use this one, I fixed dozens of bugs but it still failed when with random tests. it's codes is hand coded rather than generated by program. But we may learn something from it. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
Those are nice speedups! Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this test? Mike On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote: great improvement! I did a test in our data set. doc count is about 2M+ and index size after optimization is about 13.3GB(including fdt) it seems lucene4's index format is better than lucene2.9.3. and PFor give good results. Besides BlockEncoder for frq and pos. is there any other modification for lucene 4? decoder \ avg time single word(ms) and query(ms) or query(ms) VINT in lucene 2.9 11.2 36.5 38.6 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lucene 4 branch 8.1 22.5 30.7 2010/12/21 Li Li fancye...@gmail.com: OK we should have a look at that one still. We need to converge on a good default codec for 4.0. Fortunately it's trivial to take any int block encoder (fixed or variable block) and make a Lucene codec out of it! I suggests you not to use this one, I fixed dozens of bugs but it still failed when with random tests. it's codes is hand coded rather than generated by program. But we may learn something from it. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
I used the bulkpostings branch(https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene) does trunk have PForDelta decoder/encoder ? 2010/12/23 Michael McCandless luc...@mikemccandless.com: Those are nice speedups! Did you use the 4.0 branch (ie trunk) or the bulkpostings branch for this test? Mike On Tue, Dec 21, 2010 at 9:59 PM, Li Li fancye...@gmail.com wrote: great improvement! I did a test in our data set. doc count is about 2M+ and index size after optimization is about 13.3GB(including fdt) it seems lucene4's index format is better than lucene2.9.3. and PFor give good results. Besides BlockEncoder for frq and pos. is there any other modification for lucene 4? decoder \ avg time single word(ms) and query(ms) or query(ms) VINT in lucene 2.9 11.2 36.5 38.6 VINT in lucene 4 branch 10.6 26.5 35.4 PFor in lucene 4 branch 8.1 22.5 30.7 2010/12/21 Li Li fancye...@gmail.com: OK we should have a look at that one still. We need to converge on a good default codec for 4.0. Fortunately it's trivial to take any int block encoder (fixed or variable block) and make a Lucene codec out of it! I suggests you not to use this one, I fixed dozens of bugs but it still failed when with random tests. it's codes is hand coded rather than generated by program. But we may learn something from it. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
I think random test is not sufficient. for normal situation, some branches are not executed. I tested http://code.google.com/p/integer-array-compress-kit/ with many random int arrays and it works. But when I use it in real indexing, when in optimize stage, it corrupted. Because PForDelta will choose best numFrameBits and some bit such as 31 is hardly generated by random arrays. So I force the encoder to choose all possible numFrameBits to test all the decode1 ...decode32 and find some bugs in it. what's pfor2? using s9/s16 to encode exception and offset? In http://code.google.com/p/integer-array-compress-kit/ it's s9 for NewPForDelta also have many bugs and also need test each branch to ensure it works well. 2010/12/20 Michael McCandless luc...@mikemccandless.com: It's autogen'd from the Python script in that dir (gendecompress), but, we are actively experimenting with the numerous ways to feed it data from the file, to see what the JVM can most efficiently execute. For testing, we need better coverage here. But I have an initial encode random ints test that I'm about to commit to the bulkpostings branch... the pfor1 impl passes it, but pfor2 doesn't yet (I think maybe because Simple16 can't handle ints = 2^28?). Mike On Sun, Dec 19, 2010 at 10:06 PM, Li Li fancye...@gmail.com wrote: is ForDecompressImpl generated by codes or manully coded? I am frustrated by http://code.google.com/p/integer-array-compress-kit/ which contains too many bugs( I fixed more than 20 but there still existed bugs) Because decoder has too many branches and in normal situation, some branches seldom occurs . these decoder implemented in branch assume blockSize==128, it has less branches than decoder which support arbitrary blockSize. How do you test these decoder to ensure every branch is tested? 2010/12/16 Michael McCandless luc...@mikemccandless.com: On the bulkpostings branch you can do something like this: CodecProvider cp = new CodecProvider(); cp.register(new PatchedFrameOfRefCodec()); cp.setDefaultFieldCodec(PatchedFrameOfRef); Then whenever you create an IW or IR, use the advanced method that accepts a CodecProvider. Then the index will always use PForDelta to write/read. I suspect conjunction queries got faster because we no longer skip if the docID we seek is already in the current buffer (currently sized 64). Ie, skip is very costly when the target isn't far. This was sort of an accidental byproduct of forcing even conjunction queries using Standard (vInt) codec to work on block buffers, but I think it's an important opto that we should more generally apply. Skipping for block codecs and Standard/vInt are done w/ the same class now. It's just that the block codec must store the long filePointer where the block starts *and* the int offset into the block, vs Standard codec that just stores a filePointer. On how do we implement bulk read this is the core change on the bulkpostings branch -- we have a new API to separately bulk-read docDeltas, freqs, positionDeltas. But we are rapidly iterating on improving this (and getting to a clean PFor/For impl) now... Mike On Thu, Dec 16, 2010 at 4:29 AM, Li Li fancye...@gmail.com wrote: hi Michael, lucene 4 has so much changes that I don't know how to index and search with specified codec. could you please give me some code snipplets that using PFor codec so I can trace the codes. in you blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html you said The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. I am also curious about the result because VINT only need decode part of the doclist while PFor need decode the whole block. But I think with conjuction queries, the main time is used for searching in skiplist. I haven't read your codes yet. But I guess the skiplist for VINT and the skiplist for PFor is different. e.g. lucene 2.9's default skipInterval is 16, so it like level 1 256 level 0 16 32 48 64 80 96 112 128 ... 256 when need skipTo(60) we need read 0 16 32 48 64 in level0 but when use block, e.g. block size is 128, my implementation of skiplist is level 1 256 level 0 128 256 when skipTo(60) we only read 2 item in level0 and decode the first block which contains 128 docIDs How do you implement bulk read? I did like this: I decode a block and cache it in a int array. I think I can buffer up to 100K docIDs and tfs for disjuction queries(it cost less than 1MB memory for each term) SegmentTermDocs.read(final int[] docs, final int[] freqs) ... while (i length count df) { if (curBlockIdx = curBlockSize) { //this condition is often false, we may optimize it. but JVM hotspots will cache hot codes.
Re: strange problem of PForDelta decoder
On Mon, Dec 20, 2010 at 5:49 AM, Li Li fancye...@gmail.com wrote: I think random test is not sufficient. for normal situation, some branches are not executed. I tested http://code.google.com/p/integer-array-compress-kit/ with many random int arrays and it works. But when I use it in real indexing, when in optimize stage, it corrupted. Because PForDelta will choose best numFrameBits and some bit such as 31 is hardly generated by random arrays. So I force the encoder to choose all possible numFrameBits to test all the decode1 ...decode32 and find some bugs in it. Good point -- we need to make sure we cover all numFrameBits. And a series of 128 random ints in a row will heavily bias for the high num bits cases. Maybe if we doing a better job w/ the random source to try to target all numBits, w/ varying numbers of exceptions, etc... I'll put a nocommit for this. what's pfor2? using s9/s16 to encode exception and offset? Yeah I just committed pfor2 this morning on the bulk branch. You can check it out from https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings pfor2 came from the patch attached on https://issues.apache.org/jira/browse/LUCENE-1410 by Hao Yan (thanks!). It uses s16 for the exceptions (though, there's a bug somewhere, because it fails the random test), and it takes a different approachy for encoding exceptions. In http://code.google.com/p/integer-array-compress-kit/ it's s9 for NewPForDelta also have many bugs and also need test each branch to ensure it works well. OK we should have a look at that one still. We need to converge on a good default codec for 4.0. Fortunately it's trivial to take any int block encoder (fixed or variable block) and make a Lucene codec out of it! Mike - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
OK we should have a look at that one still. We need to converge on a good default codec for 4.0. Fortunately it's trivial to take any int block encoder (fixed or variable block) and make a Lucene codec out of it! I suggests you not to use this one, I fixed dozens of bugs but it still failed when with random tests. it's codes is hand coded rather than generated by program. But we may learn something from it. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
is ForDecompressImpl generated by codes or manully coded? I am frustrated by http://code.google.com/p/integer-array-compress-kit/ which contains too many bugs( I fixed more than 20 but there still existed bugs) Because decoder has too many branches and in normal situation, some branches seldom occurs . these decoder implemented in branch assume blockSize==128, it has less branches than decoder which support arbitrary blockSize. How do you test these decoder to ensure every branch is tested? 2010/12/16 Michael McCandless luc...@mikemccandless.com: On the bulkpostings branch you can do something like this: CodecProvider cp = new CodecProvider(); cp.register(new PatchedFrameOfRefCodec()); cp.setDefaultFieldCodec(PatchedFrameOfRef); Then whenever you create an IW or IR, use the advanced method that accepts a CodecProvider. Then the index will always use PForDelta to write/read. I suspect conjunction queries got faster because we no longer skip if the docID we seek is already in the current buffer (currently sized 64). Ie, skip is very costly when the target isn't far. This was sort of an accidental byproduct of forcing even conjunction queries using Standard (vInt) codec to work on block buffers, but I think it's an important opto that we should more generally apply. Skipping for block codecs and Standard/vInt are done w/ the same class now. It's just that the block codec must store the long filePointer where the block starts *and* the int offset into the block, vs Standard codec that just stores a filePointer. On how do we implement bulk read this is the core change on the bulkpostings branch -- we have a new API to separately bulk-read docDeltas, freqs, positionDeltas. But we are rapidly iterating on improving this (and getting to a clean PFor/For impl) now... Mike On Thu, Dec 16, 2010 at 4:29 AM, Li Li fancye...@gmail.com wrote: hi Michael, lucene 4 has so much changes that I don't know how to index and search with specified codec. could you please give me some code snipplets that using PFor codec so I can trace the codes. in you blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html you said The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. I am also curious about the result because VINT only need decode part of the doclist while PFor need decode the whole block. But I think with conjuction queries, the main time is used for searching in skiplist. I haven't read your codes yet. But I guess the skiplist for VINT and the skiplist for PFor is different. e.g. lucene 2.9's default skipInterval is 16, so it like level 1 256 level 0 16 32 48 64 80 96 112 128 ... 256 when need skipTo(60) we need read 0 16 32 48 64 in level0 but when use block, e.g. block size is 128, my implementation of skiplist is level 1 256 level 0 128 256 when skipTo(60) we only read 2 item in level0 and decode the first block which contains 128 docIDs How do you implement bulk read? I did like this: I decode a block and cache it in a int array. I think I can buffer up to 100K docIDs and tfs for disjuction queries(it cost less than 1MB memory for each term) SegmentTermDocs.read(final int[] docs, final int[] freqs) ... while (i length count df) { if (curBlockIdx = curBlockSize) { //this condition is often false, we may optimize it. but JVM hotspots will cache hot codes. So ... int idBlockBytes = freqStream.readVInt(); curBlockIdx = 0; for (int k = 0; k idBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockIds = code.decode(buffer,idBlockBytes); curBlockSize = blockIds.length; int tfBlockBytes = freqStream.readVInt(); for (int k = 0; k tfBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockTfs = code.decode(buffer, tfBlockBytes); assert curBlockSize == decoded.length; } freq = blockTfs[curBlockIdx]; doc += blockIds[curBlockIdx++]; count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc;
Re: strange problem of PForDelta decoder
hi Michael, lucene 4 has so much changes that I don't know how to index and search with specified codec. could you please give me some code snipplets that using PFor codec so I can trace the codes. in you blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html you said The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. I am also curious about the result because VINT only need decode part of the doclist while PFor need decode the whole block. But I think with conjuction queries, the main time is used for searching in skiplist. I haven't read your codes yet. But I guess the skiplist for VINT and the skiplist for PFor is different. e.g. lucene 2.9's default skipInterval is 16, so it like level 1 256 level 0 16 32 48 64 80 96 112 128 ... 256 when need skipTo(60) we need read 0 16 32 48 64 in level0 but when use block, e.g. block size is 128, my implementation of skiplist is level 1 256 level 0 128 256 when skipTo(60) we only read 2 item in level0 and decode the first block which contains 128 docIDs How do you implement bulk read? I did like this: I decode a block and cache it in a int array. I think I can buffer up to 100K docIDs and tfs for disjuction queries(it cost less than 1MB memory for each term) SegmentTermDocs.read(final int[] docs, final int[] freqs) ... while (i length count df) { if (curBlockIdx = curBlockSize) { //this condition is often false, we may optimize it. but JVM hotspots will cache hot codes. So ... int idBlockBytes = freqStream.readVInt(); curBlockIdx = 0; for (int k = 0; k idBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockIds = code.decode(buffer,idBlockBytes); curBlockSize = blockIds.length; int tfBlockBytes = freqStream.readVInt(); for (int k = 0; k tfBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockTfs = code.decode(buffer, tfBlockBytes); assert curBlockSize == decoded.length; } freq = blockTfs[curBlockIdx]; doc += blockIds[curBlockIdx++]; count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc; freqs[i] = freq; ++i; } } 2010/12/15 Michael McCandless luc...@mikemccandless.com: Hi Li Li, That issue has such a big patch, and enough of us are now iterating on it, that we cut a dedicated branch for it. But note that this branch is off of trunk (to be 4.0). You should be able to do this: svn checkout https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings And then run things in there. I just committed FOR/PFOR prototype codecs from LUCENE-1410 onto that branch, so eg you can run unit tests using those codecs by running ant test -Dtests.codec=PatchedFrameOfRef. Please post patches back if you improve things! We need all the help we can get :) Mike On Wed, Dec 15, 2010 at 5:54 AM, Li Li fancye...@gmail.com wrote: hi Michael you posted a patch here https://issues.apache.org/jira/browse/LUCENE-2723 I am not familiar with patch. do I need download LUCENE-2723.patch(there are many patches after this name, do I need the latest one?) and LUCENE-2723_termscorer.patch and patch them (patch -p1 LUCENE-2723.patch)? I just check out the latest source code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene 2010/12/14 Michael McCandless luc...@mikemccandless.com: Likely you are seeing the startup cost of hotspot compiling the PFOR code? Ie, does your test first warmup the JRE and then do the real test? I've also found that running -Xbatch produces more consistent results from run to run, however, those results may not be as fast as running w/o -Xbatch. Also, it's better to test on actual data (ie a Lucene index's postings), and in the full context of searching, because then we get a sense of what speedups a real app will see... micro-benching is nearly
Re: strange problem of PForDelta decoder
On the bulkpostings branch you can do something like this: CodecProvider cp = new CodecProvider(); cp.register(new PatchedFrameOfRefCodec()); cp.setDefaultFieldCodec(PatchedFrameOfRef); Then whenever you create an IW or IR, use the advanced method that accepts a CodecProvider. Then the index will always use PForDelta to write/read. I suspect conjunction queries got faster because we no longer skip if the docID we seek is already in the current buffer (currently sized 64). Ie, skip is very costly when the target isn't far. This was sort of an accidental byproduct of forcing even conjunction queries using Standard (vInt) codec to work on block buffers, but I think it's an important opto that we should more generally apply. Skipping for block codecs and Standard/vInt are done w/ the same class now. It's just that the block codec must store the long filePointer where the block starts *and* the int offset into the block, vs Standard codec that just stores a filePointer. On how do we implement bulk read this is the core change on the bulkpostings branch -- we have a new API to separately bulk-read docDeltas, freqs, positionDeltas. But we are rapidly iterating on improving this (and getting to a clean PFor/For impl) now... Mike On Thu, Dec 16, 2010 at 4:29 AM, Li Li fancye...@gmail.com wrote: hi Michael, lucene 4 has so much changes that I don't know how to index and search with specified codec. could you please give me some code snipplets that using PFor codec so I can trace the codes. in you blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html you said The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints. I am also curious about the result because VINT only need decode part of the doclist while PFor need decode the whole block. But I think with conjuction queries, the main time is used for searching in skiplist. I haven't read your codes yet. But I guess the skiplist for VINT and the skiplist for PFor is different. e.g. lucene 2.9's default skipInterval is 16, so it like level 1 256 level 0 16 32 48 64 80 96 112 128 ... 256 when need skipTo(60) we need read 0 16 32 48 64 in level0 but when use block, e.g. block size is 128, my implementation of skiplist is level 1 256 level 0 128 256 when skipTo(60) we only read 2 item in level0 and decode the first block which contains 128 docIDs How do you implement bulk read? I did like this: I decode a block and cache it in a int array. I think I can buffer up to 100K docIDs and tfs for disjuction queries(it cost less than 1MB memory for each term) SegmentTermDocs.read(final int[] docs, final int[] freqs) ... while (i length count df) { if (curBlockIdx = curBlockSize) { //this condition is often false, we may optimize it. but JVM hotspots will cache hot codes. So ... int idBlockBytes = freqStream.readVInt(); curBlockIdx = 0; for (int k = 0; k idBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockIds = code.decode(buffer,idBlockBytes); curBlockSize = blockIds.length; int tfBlockBytes = freqStream.readVInt(); for (int k = 0; k tfBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockTfs = code.decode(buffer, tfBlockBytes); assert curBlockSize == decoded.length; } freq = blockTfs[curBlockIdx]; doc += blockIds[curBlockIdx++]; count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc; freqs[i] = freq; ++i; } } 2010/12/15 Michael McCandless luc...@mikemccandless.com: Hi Li Li, That issue has such a big patch, and enough of us are now iterating on it, that we cut a dedicated branch for it. But note that this branch is off of trunk (to be 4.0). You should be able to do this: svn checkout https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings And then run things in there. I just committed FOR/PFOR prototype
Re: strange problem of PForDelta decoder
hi Michael you posted a patch here https://issues.apache.org/jira/browse/LUCENE-2723 I am not familiar with patch. do I need download LUCENE-2723.patch(there are many patches after this name, do I need the latest one?) and LUCENE-2723_termscorer.patch and patch them (patch -p1 LUCENE-2723.patch)? I just check out the latest source code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene 2010/12/14 Michael McCandless luc...@mikemccandless.com: Likely you are seeing the startup cost of hotspot compiling the PFOR code? Ie, does your test first warmup the JRE and then do the real test? I've also found that running -Xbatch produces more consistent results from run to run, however, those results may not be as fast as running w/o -Xbatch. Also, it's better to test on actual data (ie a Lucene index's postings), and in the full context of searching, because then we get a sense of what speedups a real app will see... micro-benching is nearly impossible in Java since Hotspot acts very differently vs the real test. Mike On Tue, Dec 14, 2010 at 2:50 AM, Li Li fancye...@gmail.com wrote: Hi I tried to integrate PForDelta into lucene 2.9 but confronted a problem. I use the implementation in http://code.google.com/p/integer-array-compress-kit/ it implements a basic PForDelta algorithm and an improved one(which called NewPForDelta, but there are many bugs and I have fixed them), But compare it with VInt and S9, it's speed is very slow when only decode small number of integer arrays. e.g. when I decoded int[256] arrays which values are randomly generated between 0 and 100, if decode just one array. PFor(or NewPFor) is very slow. when it continuously decodes many arrays such as 1, it's faster than s9 and vint. Another strange phenomena is that when call PFor decoder twice, the 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor is faster. reverse the call sequcence, the 2nd called decoder is faster e.g. ct.testNewPFDCodes(list); ct.testPFor(list); ct.testVInt(list); ct.testS9(list); NewPFD decode: 3614705 PForDelta decode: 17320 VINT decode: 16483 S9 decode: 19835 when I call by the following sequence ct.testPFor(list); ct.testNewPFDCodes(list); ct.testVInt(list); ct.testS9(list); PForDelta decode: 3212140 NewPFD decode: 19556 VINT decode: 16762 S9 decode: 16483 My implementation is -- group docIDs and termDocFreqs into block which contains 128 integers. when SegmentTermDocs's next method called(or read readNoTf).it decodes a block and save it to a cache. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
Hi Li Li, That issue has such a big patch, and enough of us are now iterating on it, that we cut a dedicated branch for it. But note that this branch is off of trunk (to be 4.0). You should be able to do this: svn checkout https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings And then run things in there. I just committed FOR/PFOR prototype codecs from LUCENE-1410 onto that branch, so eg you can run unit tests using those codecs by running ant test -Dtests.codec=PatchedFrameOfRef. Please post patches back if you improve things! We need all the help we can get :) Mike On Wed, Dec 15, 2010 at 5:54 AM, Li Li fancye...@gmail.com wrote: hi Michael you posted a patch here https://issues.apache.org/jira/browse/LUCENE-2723 I am not familiar with patch. do I need download LUCENE-2723.patch(there are many patches after this name, do I need the latest one?) and LUCENE-2723_termscorer.patch and patch them (patch -p1 LUCENE-2723.patch)? I just check out the latest source code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene 2010/12/14 Michael McCandless luc...@mikemccandless.com: Likely you are seeing the startup cost of hotspot compiling the PFOR code? Ie, does your test first warmup the JRE and then do the real test? I've also found that running -Xbatch produces more consistent results from run to run, however, those results may not be as fast as running w/o -Xbatch. Also, it's better to test on actual data (ie a Lucene index's postings), and in the full context of searching, because then we get a sense of what speedups a real app will see... micro-benching is nearly impossible in Java since Hotspot acts very differently vs the real test. Mike On Tue, Dec 14, 2010 at 2:50 AM, Li Li fancye...@gmail.com wrote: Hi I tried to integrate PForDelta into lucene 2.9 but confronted a problem. I use the implementation in http://code.google.com/p/integer-array-compress-kit/ it implements a basic PForDelta algorithm and an improved one(which called NewPForDelta, but there are many bugs and I have fixed them), But compare it with VInt and S9, it's speed is very slow when only decode small number of integer arrays. e.g. when I decoded int[256] arrays which values are randomly generated between 0 and 100, if decode just one array. PFor(or NewPFor) is very slow. when it continuously decodes many arrays such as 1, it's faster than s9 and vint. Another strange phenomena is that when call PFor decoder twice, the 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor is faster. reverse the call sequcence, the 2nd called decoder is faster e.g. ct.testNewPFDCodes(list); ct.testPFor(list); ct.testVInt(list); ct.testS9(list); NewPFD decode: 3614705 PForDelta decode: 17320 VINT decode: 16483 S9 decode: 19835 when I call by the following sequence ct.testPFor(list); ct.testNewPFDCodes(list); ct.testVInt(list); ct.testS9(list); PForDelta decode: 3212140 NewPFD decode: 19556 VINT decode: 16762 S9 decode: 16483 My implementation is -- group docIDs and termDocFreqs into block which contains 128 integers. when SegmentTermDocs's next method called(or read readNoTf).it decodes a block and save it to a cache. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
thanks. I'd like trying it and do some experiment on our dataset. 2010/12/15 Michael McCandless luc...@mikemccandless.com: Hi Li Li, That issue has such a big patch, and enough of us are now iterating on it, that we cut a dedicated branch for it. But note that this branch is off of trunk (to be 4.0). You should be able to do this: svn checkout https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings And then run things in there. I just committed FOR/PFOR prototype codecs from LUCENE-1410 onto that branch, so eg you can run unit tests using those codecs by running ant test -Dtests.codec=PatchedFrameOfRef. Please post patches back if you improve things! We need all the help we can get :) Mike On Wed, Dec 15, 2010 at 5:54 AM, Li Li fancye...@gmail.com wrote: hi Michael you posted a patch here https://issues.apache.org/jira/browse/LUCENE-2723 I am not familiar with patch. do I need download LUCENE-2723.patch(there are many patches after this name, do I need the latest one?) and LUCENE-2723_termscorer.patch and patch them (patch -p1 LUCENE-2723.patch)? I just check out the latest source code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene 2010/12/14 Michael McCandless luc...@mikemccandless.com: Likely you are seeing the startup cost of hotspot compiling the PFOR code? Ie, does your test first warmup the JRE and then do the real test? I've also found that running -Xbatch produces more consistent results from run to run, however, those results may not be as fast as running w/o -Xbatch. Also, it's better to test on actual data (ie a Lucene index's postings), and in the full context of searching, because then we get a sense of what speedups a real app will see... micro-benching is nearly impossible in Java since Hotspot acts very differently vs the real test. Mike On Tue, Dec 14, 2010 at 2:50 AM, Li Li fancye...@gmail.com wrote: Hi I tried to integrate PForDelta into lucene 2.9 but confronted a problem. I use the implementation in http://code.google.com/p/integer-array-compress-kit/ it implements a basic PForDelta algorithm and an improved one(which called NewPForDelta, but there are many bugs and I have fixed them), But compare it with VInt and S9, it's speed is very slow when only decode small number of integer arrays. e.g. when I decoded int[256] arrays which values are randomly generated between 0 and 100, if decode just one array. PFor(or NewPFor) is very slow. when it continuously decodes many arrays such as 1, it's faster than s9 and vint. Another strange phenomena is that when call PFor decoder twice, the 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor is faster. reverse the call sequcence, the 2nd called decoder is faster e.g. ct.testNewPFDCodes(list); ct.testPFor(list); ct.testVInt(list); ct.testS9(list); NewPFD decode: 3614705 PForDelta decode: 17320 VINT decode: 16483 S9 decode: 19835 when I call by the following sequence ct.testPFor(list); ct.testNewPFDCodes(list); ct.testVInt(list); ct.testS9(list); PForDelta decode: 3212140 NewPFD decode: 19556 VINT decode: 16762 S9 decode: 16483 My implementation is -- group docIDs and termDocFreqs into block which contains 128 integers. when SegmentTermDocs's next method called(or read readNoTf).it decodes a block and save it to a cache. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
Likely you are seeing the startup cost of hotspot compiling the PFOR code? Ie, does your test first warmup the JRE and then do the real test? I've also found that running -Xbatch produces more consistent results from run to run, however, those results may not be as fast as running w/o -Xbatch. Also, it's better to test on actual data (ie a Lucene index's postings), and in the full context of searching, because then we get a sense of what speedups a real app will see... micro-benching is nearly impossible in Java since Hotspot acts very differently vs the real test. Mike On Tue, Dec 14, 2010 at 2:50 AM, Li Li fancye...@gmail.com wrote: Hi I tried to integrate PForDelta into lucene 2.9 but confronted a problem. I use the implementation in http://code.google.com/p/integer-array-compress-kit/ it implements a basic PForDelta algorithm and an improved one(which called NewPForDelta, but there are many bugs and I have fixed them), But compare it with VInt and S9, it's speed is very slow when only decode small number of integer arrays. e.g. when I decoded int[256] arrays which values are randomly generated between 0 and 100, if decode just one array. PFor(or NewPFor) is very slow. when it continuously decodes many arrays such as 1, it's faster than s9 and vint. Another strange phenomena is that when call PFor decoder twice, the 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor is faster. reverse the call sequcence, the 2nd called decoder is faster e.g. ct.testNewPFDCodes(list); ct.testPFor(list); ct.testVInt(list); ct.testS9(list); NewPFD decode: 3614705 PForDelta decode: 17320 VINT decode: 16483 S9 decode: 19835 when I call by the following sequence ct.testPFor(list); ct.testNewPFDCodes(list); ct.testVInt(list); ct.testS9(list); PForDelta decode: 3212140 NewPFD decode: 19556 VINT decode: 16762 S9 decode: 16483 My implementation is -- group docIDs and termDocFreqs into block which contains 128 integers. when SegmentTermDocs's next method called(or read readNoTf).it decodes a block and save it to a cache. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: strange problem of PForDelta decoder
but I let VINT and S9 decoder run first, their time is the same as when they are called in the end. it's stable 2010/12/14 Michael McCandless luc...@mikemccandless.com: Likely you are seeing the startup cost of hotspot compiling the PFOR code? Ie, does your test first warmup the JRE and then do the real test? I've also found that running -Xbatch produces more consistent results from run to run, however, those results may not be as fast as running w/o -Xbatch. Also, it's better to test on actual data (ie a Lucene index's postings), and in the full context of searching, because then we get a sense of what speedups a real app will see... micro-benching is nearly impossible in Java since Hotspot acts very differently vs the real test. Mike On Tue, Dec 14, 2010 at 2:50 AM, Li Li fancye...@gmail.com wrote: Hi I tried to integrate PForDelta into lucene 2.9 but confronted a problem. I use the implementation in http://code.google.com/p/integer-array-compress-kit/ it implements a basic PForDelta algorithm and an improved one(which called NewPForDelta, but there are many bugs and I have fixed them), But compare it with VInt and S9, it's speed is very slow when only decode small number of integer arrays. e.g. when I decoded int[256] arrays which values are randomly generated between 0 and 100, if decode just one array. PFor(or NewPFor) is very slow. when it continuously decodes many arrays such as 1, it's faster than s9 and vint. Another strange phenomena is that when call PFor decoder twice, the 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor is faster. reverse the call sequcence, the 2nd called decoder is faster e.g. ct.testNewPFDCodes(list); ct.testPFor(list); ct.testVInt(list); ct.testS9(list); NewPFD decode: 3614705 PForDelta decode: 17320 VINT decode: 16483 S9 decode: 19835 when I call by the following sequence ct.testPFor(list); ct.testNewPFDCodes(list); ct.testVInt(list); ct.testS9(list); PForDelta decode: 3212140 NewPFD decode: 19556 VINT decode: 16762 S9 decode: 16483 My implementation is -- group docIDs and termDocFreqs into block which contains 128 integers. when SegmentTermDocs's next method called(or read readNoTf).it decodes a block and save it to a cache. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org