[jira] [Commented] (LUCENE-1879) Parallel incremental indexing
[ https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058072#comment-13058072 ] hao yan commented on LUCENE-1879: - Hi, Michael Is there any lastest progress on this topic? I am very interested in this! Parallel incremental indexing - Key: LUCENE-1879 URL: https://issues.apache.org/jira/browse/LUCENE-1879 Project: Lucene - Java Issue Type: New Feature Components: core/index Reporter: Michael Busch Assignee: Michael Busch Fix For: 4.0 Attachments: parallel_incremental_indexing.tar A new feature that allows building parallel indexes and keeping them in sync on a docID level, independent of the choice of the MergePolicy/MergeScheduler. Find details on the wiki page for this feature: http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing Discussion on java-dev: http://markmail.org/thread/ql3oxzkob7aqf3jd -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3096) MultiSearcher does not work correctly with Not on NumericRange
[ https://issues.apache.org/jira/browse/LUCENE-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034289#comment-13034289 ] hao yan commented on LUCENE-3096: - Thanks! Uwe! MultiSearcher does not work correctly with Not on NumericRange -- Key: LUCENE-3096 URL: https://issues.apache.org/jira/browse/LUCENE-3096 Project: Lucene - Java Issue Type: Bug Components: core/search Affects Versions: 3.0.2 Reporter: John Wang Fix For: 3.1 Hi, Keith My colleague xiaoyang and I just confirmed that this is actually due to a lucene bug on Multisearcher. In particular, If we search with Not on NumericRange and we use MultiSearcher, we will wrong search results (However, if we use IndexSearcher, the result is correct). Basically the NotOfNumericRange does not have impact on multisearcher. We suspect it is because the createWeight() function in MultiSearcher and hope you can help us to fix this bug of lucene. I attached the code to reproduce this case. Please check it out. In the attached code, I have two separate functions : (1) testNumericRangeSingleSearcher(Query query) where I create 6 documents, with a field called id= 1,2,3,4,5,6 respectively . Then I search by the query which is +MatchAllDocs -NumericRange(3,3). The expected result then should be 5 hits since the document 3 is MUST_NOT. (2) testNumericRangeMultiSearcher(Query query) where i create 2 RamDirectory(), each of which has 3 documents, 1,2,3; and 4,5,6. Then I search by the same query as above using multiSearcher. The expected result should also be 5 hits. However, from (1), we get 5 hits = expected results, while in (2) we get 6 hits != expected results. We also experimented this with our zoie/bobo open source tools and get the same results because our multi-bobo-browser is built on multi-searcher in lucene. I already emailed the lucene community group. Hopefully we can get some feedback soon. If you have any further concern, pls let me know! Thank you very much! Code: (based on lucene 3.0.x) import java.io.IOException; import java.io.PrintStream; import java.text.DecimalFormat; import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.NumericField; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.FieldCache; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.MatchAllDocsQuery; import org.apache.lucene.search.MultiSearcher; import org.apache.lucene.search.NumericRangeQuery; import org.apache.lucene.search.Query; import org.apache.lucene.search.ScoreDoc; import org.apache.lucene.search.Searchable; import org.apache.lucene.search.Sort; import org.apache.lucene.search.SortField; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.BooleanClause.Occur; import org.apache.lucene.store.Directory; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.store.RAMDirectory; import com.convertlucene.ConvertFrom2To3; public class TestNumericRange { public final static void main(String[] args) { try { BooleanQuery query = new BooleanQuery(); query.add(NumericRangeQuery.newIntRange(numId, 3, 3, true, true), Occur.MUST_NOT); query.add(new MatchAllDocsQuery(), Occur.MUST); testNumericRangeSingleSearcher(query); testNumericRangeMultiSearcher(query); } catch(Exception e) { e.printStackTrace(); } } public static void testNumericRangeSingleSearcher(Query query) throws CorruptIndexException, LockObtainFailedException, IOException { String[] ids = {1, 2, 3, 4, 5, 6}; Directory directory = new RAMDirectory(); IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED); for (int i = 0; i ids.length; i++) { Document doc = new Document(); doc.add(new Field(id, ids[i], Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new NumericField(numId).setIntValue(Integer.valueOf(ids[i]))); writer.addDocument(doc); } writer.close(); IndexSearcher searcher = new IndexSearcher(directory); TopDocs docs = searcher.search(query, 10); System.out.println(SingleSearcher: testNumericRange: hitNum: + docs.totalHits); for(ScoreDoc doc : docs.scoreDocs) { System.out.println(searcher.explain(query, doc.doc)); } searcher.close();
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995436#comment-12995436 ] hao yan commented on LUCENE-2903: - Thank both of you! Thanks for testing my codec so quickly, Michael! RE: One question: it looks like this PFOR impl can only handle up to 28 bit wide ints? Which means... could it could fail on some cases? Though I suppose you would never see too many of these immense ints in one block, and so they'd always be encoded as exceptions and so it's actually safe...? Hao: This won't fail. In my PFOR impl, I will first checkBigNumbers() to see if there is any number = 2^28, if there is, i will force encoding the lower 4 bits using the 128 4-bit slots. Thus, all exceptions left to simple16 are 2^28, which can definitely be handled. So, there is no failure cases!!! :) . BTW, my PFOR impl will save more index size than VInt and other PFOR impls. Thus, if the user case is real-time search which requires loading index from disk to memory frequently, my PFOR impl may save even more. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE-2903.patch, for_pfor.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: LUCENE-2903.patch This new patch provides PForDeltaFixedIntBlockWithIntBufferCodec (PatchedFrameOfRef4) which improves the performance of previous couterparts(PatchedFrameOfRef4,5,6). Note that the PatchedFrameOfRef4 is different from the previous PatchedFrameOfRef4. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: LUCENE-2903.patch This patch improves the performance of previous PatchedFrameOfRef4 and removed the PatchedFrameOfRef5 and PatchedFrameOfRef6. Now the performance ofPatchedFrameOfRef4 is better than BulkVInt and comparable to PatchedFrameOfRef in my tests. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: (was: LUCENE_2903.patch) Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: (was: LUCENE-2903.patch) Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: (was: LUCENE-2903.patch) Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992687#comment-12992687 ] hao yan commented on LUCENE-2903: - Hi, Robert and Michael In order to test if ByteBuffer/IntBuffer works better than int[]-byte[] conversion, I now separate them into 3 different codecs. All of them use the same PForDelta implementation except that they use different indexinput/indexoutput as follows. 1. PatchedFrameOfRef3 - use in.readBytes(), it will convert int[] - byte[] manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java 2. PatchedFrameOfRef4 - use in.readBytes(), it will convert int[] - byte[] by ByteBuffer/IntBuffer. Its corresponding java code is: PForDeltaFixedIntBlockWithByteBufferCodec.java 3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need conversion. Its corresponding java code is: PForDeltaFixedIntBlockWithReadIntCodec.java I tested them against BulkVInt on MacOS. The detailed results are attached. Here is the conclusion: 1) Yes, Michael and Robert, you guys are right! ByteBuffer/IntBuffer are faster then my manual conversion btw byte[]/int[]. I guess the reason I thought they were worse is that i did not separate codecs before, such that the test results is not stable due to JVM/JIT. 2) Now, PatchedFrameOfRef4 is still worse than BulkVInt in many kinds of queries. However, it seems that it can do better for fuzzy queries and wildcardquery. 3) Of course, these PatchedFrameOfRef3,4,5 are all better than PatchedFrameOfRef and FrameOfRef for almost all queries. 4) The new patched is just uploaded, please check them out. The following is the experimental results for 0.1M data. (1) bulkVInt VS patchedFrameOfRef4 (withByteBuffer, in.readBytes(..) ) QueryQPS bulkVIntQPS pathcedFrameofref4-withByteBuffer Pct diff united states 389.26 361.79 -7.1% united states~3 234.52 228.99 -2.4% +nebraska +states 1138.95 992.06-12.9% +united +states 670.69 603.86-10.0% doctimesecnum:[1 TO 6] 415.28 447.83 7.8% doctitle:.*[Uu]nited.* 496.03 522.47 5.3% spanFirst(unit, 5) 1176.47 1086.96 -7.6% spanNear([unit, state], 10, true) 502.26 423.73-15.6% states 1612.90 1453.49 -9.9% u*d 167.95 171.17 1.9% un*d 260.69 275.33 5.6% uni* 602.41 577.37 -4.2% unit* 1016.26 1041.67 2.5% united states 617.28 549.45-11.0% united~0.6 12.22 12.93 5.9% united~0.75 53.88 56.78 5.4% unit~0.5 12.58 13.19 4.9% unit~0.7 52.41 54.93 4.8% (2) bulkVInt VS patchedFrameOfRef3 (with my own int[] - byte[] conversion, still in.readBytes(..)) QueryQPS bulkVIntQPS pathcedFrameofref3 Pct diff united states 388.50 363.24 -6.5% united states~3 234.80 223.56 -4.8% +nebraska +states 1138.95 1016.26-10.8% +united +states 671.14 607.90 -9.4% doctimesecnum:[1 TO 6] 418.24 441.89 5.7% doctitle:.*[Uu]nited.* 489.00 522.74 6.9% spanFirst(unit, 5) 1246.88 1127.40 -9.6% spanNear([unit, state], 10, true) 514.14 473.71 -7.9% states 1612.90 1488.10 -7.7% u*d 170.77 167.31 -2.0% un*d 261.37 264.48 1.2% uni* 609.38 602.41 -1.1% unit* 1028.81 1052.63 2.3% united states 614.25 564.33 -8.1% united~0.6 12.05 12.11 0.5% united~0.75 53.16 54.97 3.4% unit~0.5 12.43 12.50 0.6% unit~0.7 52.81 53.23 0.8% (3) bulkVInt VS patchedFrameOfRef5 (with my own int[] - byte[] conversion, still in.readBytes(..)) QueryQPS bulkVIntQPS pathcedFrameofref5-withReadInt Pct diff united states 391.24 366.70 -6.3% united states~3 235.40 235.07 -0.1% +nebraska +states 1137.66 1072.96 -5.7% +united +states 673.40 642.26 -4.6% doctimesecnum:[1 TO 6] 414.25 407.66 -1.6% doctitle:.*[Uu]nited.* 492.61 538.21 9.3% spanFirst(unit, 5) 1253.13 1175.09 -6.2% spanNear([unit, state], 10, true) 511.25 483.56 -5.4% states 1642.04 1490.31 -9.2% u*d 166.78 160.28 -3.9% un*d 261.64 255.36 -2.4% uni* 609.38 593.47 -2.6% unit* 1026.69
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: LUCENE-2903.patch This patch is to further improve pfordelta codec (PForDeltaFixedIntBlockCodec). I used 3 different implementations (3 codecs) for inputindex/outputindex. In particular, 1. PatchedFrameOfRef3 use in.readBytes(), it will convert int[] byte[] manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java 2. PatchedFrameOfRef4 use in.readBytes(), it will convert int[] byte[] by ByteBuffer/IntBuffer. Its corresponding java code is: PForDeltaFixedIntBlockWithByteBufferCodec.java 3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need conversion. Its corresponding java code is: PForDeltaFixedIntBlockWithReadIntCodec.java Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992809#comment-12992809 ] hao yan commented on LUCENE-2903: - just uploaded. Sorry. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992237#comment-12992237 ] hao yan commented on LUCENE-2903: - I tried to move memory allocation out of readBlock() to BlockReader's constructor. It improves the performance a little. I also tried to use ByteBuffer/IntBuffer to replace my manual convertsion between bytes[]/int[]. It makes things worse. The following is my result for 0.1M data: (1) BulkVInt vs patchedFrameoFRef3 QueryQPS bulkVIntQPS patchedFrameoFRef3 Pct diff united states 393.55 362.84 -7.8% united states~3 243.84 236.80 -2.9% +nebraska +states 1140.25 998.00-12.5% +united +states 687.76 633.31 -7.9% doctimesecnum:[1 TO 6] 413.56 427.53 3.4% doctitle:.*[Uu]nited.* 510.46 534.47 4.7% spanFirst(unit, 5) 1240.69 1108.65-10.6% spanNear([unit, state], 10, true) 511.77 463.18 -9.5% states 1626.02 1483.68 -8.8% u*d 164.23 162.79 -0.9% un*d 257.53 252.97 -1.8% uni* 607.53 591.02 -2.7% unit* 1024.59 1043.84 1.9% united states 627.35 578.70 -7.8% united~0.6 11.51 11.36 -1.3% united~0.75 52.58 53.57 1.9% unit~0.5 12.08 11.93 -1.2% unit~0.7 50.98 51.30 0.6% (2) FrameOfRef VS PatchcedFrameOfRef3 QueryQPSpatchedFrameofrefQPS pathcedFrameofref3 Pct diff united states 314.76 362.71 15.2% united states~3 227.53 237.08 4.2% +nebraska +states 1075.27 1025.64 -4.6% +united +states 646.41 626.57 -3.1% doctimesecnum:[1 TO 6] 412.88 429.37 4.0% doctitle:.*[Uu]nited.* 481.70 528.82 9.8% spanFirst(unit, 5) 1060.45 1118.57 5.5% spanNear([unit, state], 10, true) 409.33 467.73 14.3% states 1353.18 1479.29 9.3% u*d 158.91 165.98 4.4% un*d 237.36 256.41 8.0% uni* 560.22 593.12 5.9% unit* 946.97 1043.84 10.2% united states 431.22 583.09 35.2% united~0.6 10.91 11.37 4.2% united~0.75 50.30 53.30 5.9% unit~0.5 11.54 11.94 3.5% unit~0.7 47.38 50.38 6.3% (3) PatchedFrameOfRef VS PatchedFrameOfRef3 QueryQPS FrameOfRefQPS pathcedFrameofref3 Pct diff united states 326.26 360.49 10.5% united states~3 226.50 234.69 3.6% +nebraska +states 1077.59 1021.45 -5.2% +united +states 648.51 630.52 -2.8% doctimesecnum:[1 TO 6] 324.46 428.45 32.0% doctitle:.*[Uu]nited.* 485.44 527.70 8.7% spanFirst(unit, 5) 1007.05 .11 10.3% spanNear([unit, state], 10, true) 446.03 465.55 4.4% states 1449.28 1459.85 0.7% u*d 158.43 161.79 2.1% un*d 246.37 256.28 4.0% uni* 548.85 594.88 8.4% unit* 920.81 1042.75 13.2% united states 450.65 576.37 27.9% united~0.6 11.07 11.26 1.7% united~0.75 50.70 52.60 3.8% unit~0.5 11.64 11.76 1.0% unit~0.7 49.04 50.70 3.4% Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991220#comment-12991220 ] hao yan commented on LUCENE-2903: - HI, Michael Did u try FrameOfRef and PatchedFrameOfRef? Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991222#comment-12991222 ] hao yan commented on LUCENE-2903: - And it sure complicate the pfordelta algorithm a lot by using intbuffer.set/get. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991221#comment-12991221 ] hao yan commented on LUCENE-2903: - Hi, Paul I tested ByteBuffer-IntBuffer, it is not faster than converting int[] - byte[]. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990214#comment-12990214 ] hao yan commented on LUCENE-2903: - I think essentially the above step also need to do int-byte-int conversion. So, there is no reason it can save more than I do it manually. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990480#comment-12990480 ] hao yan commented on LUCENE-2903: - Yes. Other PFOR impls (FrameOfRef and PatchedFrameOfRef) are even slower. (as long as you set -server when you run them). I am also wondering why. Actually I think wikipedia data is kind of biased. Do you have any other data sets? Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989754#comment-12989754 ] hao yan commented on LUCENE-2903: - Hi, Paul. thanks for the suggestions. I just uploaded a new patch which renamed the codec as PatchedFrameOfRef3. I actually have a question to ask. In BulkVInt codec, it writes the compressed byte stream as a chunk of bytes. However, in pfordelta-related codecs, the compressed results are in ints, i have to either write single int with a loop, or first convert int array to byte array and then call out.writeBytes(). Do you know any other smarter way to write an int array to indexOutput? Another try I did is to make PForDelta itself produce byte-wise compressed results. However, from my experimental results, it will slow down pfordelta significantly. Also, i do not think the NIO buffer used in FrameOfRef and PatchedFrameOfRef help since essentially it is like the way that we first convert int array to byte array and then writeBytes(). Do you have any good suggestions? thanks! Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: LUCENE_2903.patch This patch rename the NewPForDeltaCodec as PatchedFrameOfRef3 to follow the tradition. And also add back the BulkVInt allones trick. (I removed it accidently in the last patch). Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989872#comment-12989872 ] hao yan commented on LUCENE-2903: - Yes, using ByteBuffer.asIntBuffer() is the same as converting int/byte array to byte/int array. I think the underlying implementation ByteBuffer.asIntBuffer() cannot avoid. I also tried ByteBuffer/IntBuffer though, the result is worse which makes sense since it may incur extra costs. Where to holler? :) Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch, LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2903) Improvement of PForDelta Codec
Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-2903: Attachment: LUCENE_2903.patch Patch for the improvement of PForDeltaFixedIntBlockCodec Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec
[ https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989532#comment-12989532 ] hao yan commented on LUCENE-2903: - Hi, Robert Sorry. That was a mistake. I commented out that one just for debugging to see if that affect the performance. I should have changed it back. I will attach a new patch. thanks for pointing that out. Improvement of PForDelta Codec -- Key: LUCENE-2903 URL: https://issues.apache.org/jira/browse/LUCENE-2903 Project: Lucene - Java Issue Type: Improvement Reporter: hao yan Attachments: LUCENE_2903.patch There are 3 versions of PForDelta implementations in the Bulk Branch: FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2. The FrameOfRef is a very basic one which is essentially a binary encoding (may result in huge index size). The PatchedFrameOfRef is the implmentation based on the original version of PForDelta in the literatures. The PatchedFrameOfRef2 is my previous implementation which are improved this time. (The Codec name is changed to NewPForDelta.). In particular, the changes are: 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old PForDelta does not support very large exceptions (since the Simple16 does not support very large numbers). Now this has been fixed in the new LCPForDelta. 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other two PForDelta implementation in the bulk branch (FrameOfRef and PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the CodecProvider and PForDeltaFixedIntBlockCodec. 3. The performance test results are: 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for almost all kinds of queries, slightly worse then BulkVInt. 2) My NewPForDelta codec can result in the smallest index size among all 4 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself) 3) All performance test results are achieved by running with -server instead of -client -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1410) PFOR implementation
[ https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hao yan updated LUCENE-1410: Attachment: LUCENE-1410.patch This patch is to add codec support for PForDelta compression algorithms. Changes by Hao Yan (hyan2...@gmail.com) In summary, I added five files to support and test the codec. In Src, 1. org.apache.lucene.index.codecs.pfordelta.PForDelta.java 2. org.apache.lucene.index.codecs.pfordelta.Simple16.java 3. org.apache.lucene.index.codecs.PForDeltaFixedBlockCodec.java 4. org.apache.lucene.index.codecs.intblock.FixedIntBlockIndexOutputWithGetElementNum.java In Test, 5. org.apache.lucene.index.codecs.intblock.TestPForDeltaFixedIntBLockCodec.java 1) In particular, the firs class PForDelta is the core implementation of PForDelta algorithm, which compresses exceptions using Simple16 that is implemented in the second class Simple16. 2) The third classs PForDeltaFixedBlockCodec is similar to org.apache.lucene.index.codesc.ockintblock.MockFixedIntBlockCodec in Test, except that it uses PForDelta to encode the data in the buffer. 3) The fourth class is almost the same as org.apache.lucene.index.codecs.intblock.FixedIntBlockINdexOuput, except that it provides an additional public function to retrieve the value of the upto field, which is private filed in FixedIntBlockINdexOuput. The reason I added this public function is that the number of elements in the block that have meaningful values is not always equal to the blockSize or the buffer size since the last block/buffer of a stream of data usually only contain less number of data. In the case, I will fill all elements after the meaningful elements with 0s. Thus, we alwasy compress one entire block. 4) The last class is the unit test to test PForDeltaFixedIntBlockCodec which is very similar to org.apache.lucene.index.codecs.mintblock.TestIntBlockCodec. I also changed the LuceneTestCase class to add the new PForDeltaFixeIntBlockCOde. The unit tests and all lucence tests have passed. PFOR implementation --- Key: LUCENE-1410 URL: https://issues.apache.org/jira/browse/LUCENE-1410 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Paul Elschot Priority: Minor Fix For: Bulk Postings branch Attachments: autogen.tgz, for-summary.txt, LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, TestPFor2.java, TestPFor2.java Original Estimate: 21840h Remaining Estimate: 21840h Implementation of Patched Frame of Reference. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2750) add Kamikaze 3.0.1 into Lucene
add Kamikaze 3.0.1 into Lucene -- Key: LUCENE-2750 URL: https://issues.apache.org/jira/browse/LUCENE-2750 Project: Lucene - Java Issue Type: Sub-task Components: contrib/* Reporter: hao yan Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve significantly better performance then Kamikaze 2.0.0 in terms of both compressed size and decompression speed. The main difference between the two versions is Kamikaze 3.0.x uses the much more efficient implementation of the PForDelta compression algorithm. My goal is to integrate the highly efficient PForDelta implementation into Lucene Codec. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org