[jira] [Commented] (LUCENE-1879) Parallel incremental indexing

2011-06-30 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13058072#comment-13058072
 ] 

hao yan commented on LUCENE-1879:
-

Hi, Michael

Is there any lastest progress on this topic? I am very interested in this!

 Parallel incremental indexing
 -

 Key: LUCENE-1879
 URL: https://issues.apache.org/jira/browse/LUCENE-1879
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/index
Reporter: Michael Busch
Assignee: Michael Busch
 Fix For: 4.0

 Attachments: parallel_incremental_indexing.tar


 A new feature that allows building parallel indexes and keeping them in sync 
 on a docID level, independent of the choice of the MergePolicy/MergeScheduler.
 Find details on the wiki page for this feature:
 http://wiki.apache.org/lucene-java/ParallelIncrementalIndexing 
 Discussion on java-dev:
 http://markmail.org/thread/ql3oxzkob7aqf3jd

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3096) MultiSearcher does not work correctly with Not on NumericRange

2011-05-16 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034289#comment-13034289
 ] 

hao yan commented on LUCENE-3096:
-

Thanks! Uwe!



 MultiSearcher does not work correctly with Not on NumericRange
 --

 Key: LUCENE-3096
 URL: https://issues.apache.org/jira/browse/LUCENE-3096
 Project: Lucene - Java
  Issue Type: Bug
  Components: core/search
Affects Versions: 3.0.2
Reporter: John Wang
 Fix For: 3.1


 Hi, Keith
 My colleague xiaoyang and I just confirmed that this is actually due to a 
 lucene bug on Multisearcher. In particular,
 If we search with Not on NumericRange and we use MultiSearcher, we
 will wrong search results (However, if we use IndexSearcher, the
 result is correct).  Basically the NotOfNumericRange does not have
 impact on multisearcher. We suspect it is because the createWeight()
 function in MultiSearcher and hope you can help us to fix this bug of
 lucene. I attached the code to reproduce this case. Please check it
 out.
 In the attached code, I have two separate functions :
 (1) testNumericRangeSingleSearcher(Query query)
 where I create 6 documents, with a field called id= 1,2,3,4,5,6
 respectively . Then I search by the query which is
 +MatchAllDocs -NumericRange(3,3). The expected result then should
 be 5 hits since the document 3 is MUST_NOT.
 (2) testNumericRangeMultiSearcher(Query query)
 where i create 2 RamDirectory(), each of which has 3 documents,
 1,2,3; and 4,5,6. Then I search by the same query as above using
 multiSearcher. The expected result should also be 5 hits.
 However, from (1), we get 5 hits = expected results, while in (2) we
 get 6 hits != expected results.
 We also experimented this with our zoie/bobo open source tools and get
 the same results because our multi-bobo-browser is built on
 multi-searcher in lucene.
 I already emailed the lucene community group. Hopefully we can get some 
 feedback soon.
 If you have any further concern, pls let me know! 
 Thank you very much!
 Code:  (based on lucene 3.0.x)
 import java.io.IOException;
 import java.io.PrintStream;
 import java.text.DecimalFormat;
 import org.apache.lucene.analysis.WhitespaceAnalyzer;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.document.NumericField;
 import org.apache.lucene.index.CorruptIndexException;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.search.BooleanQuery;
 import org.apache.lucene.search.FieldCache;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.MatchAllDocsQuery;
 import org.apache.lucene.search.MultiSearcher;
 import org.apache.lucene.search.NumericRangeQuery;
 import org.apache.lucene.search.Query;
 import org.apache.lucene.search.ScoreDoc;
 import org.apache.lucene.search.Searchable;
 import org.apache.lucene.search.Sort;
 import org.apache.lucene.search.SortField;
 import org.apache.lucene.search.TermQuery;
 import org.apache.lucene.search.TopDocs;
 import org.apache.lucene.search.BooleanClause.Occur;
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.store.LockObtainFailedException;
 import org.apache.lucene.store.RAMDirectory;
 import com.convertlucene.ConvertFrom2To3;
 public class TestNumericRange
 {
  public final static void main(String[] args)
  {
try
{
  BooleanQuery query = new  BooleanQuery();
  query.add(NumericRangeQuery.newIntRange(numId, 3, 3, true,
 true), Occur.MUST_NOT);
  query.add(new MatchAllDocsQuery(), Occur.MUST);
  testNumericRangeSingleSearcher(query);
  testNumericRangeMultiSearcher(query);
}
catch(Exception e)
{
  e.printStackTrace();
}
  }
  public static void testNumericRangeSingleSearcher(Query query)
 throws CorruptIndexException, LockObtainFailedException, IOException
  {
 String[] ids = {1, 2, 3, 4, 5, 6};
Directory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new
 WhitespaceAnalyzer(),  IndexWriter.MaxFieldLength.UNLIMITED);
for (int i = 0; i  ids.length; i++)
{
  Document doc = new Document();
  doc.add(new Field(id, ids[i],
Field.Store.YES,
Field.Index.NOT_ANALYZED));
  doc.add(new NumericField(numId).setIntValue(Integer.valueOf(ids[i])));
  writer.addDocument(doc);
}
writer.close();
IndexSearcher searcher = new IndexSearcher(directory);
TopDocs docs = searcher.search(query, 10);
System.out.println(SingleSearcher: testNumericRange: hitNum:  +
 docs.totalHits);
for(ScoreDoc doc : docs.scoreDocs)
{
  System.out.println(searcher.explain(query, doc.doc));
}
searcher.close();

[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-16 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12995436#comment-12995436
 ] 

hao yan commented on LUCENE-2903:
-

Thank both of you! Thanks for testing my codec so quickly, Michael! 

RE: One question: it looks like this PFOR impl can only handle up to 28
bit wide ints? Which means... could it could fail on some cases?
Though I suppose you would never see too many of these immense ints in
one block, and so they'd always be encoded as exceptions and so it's
actually safe...?

Hao: This won't fail. In my PFOR impl, I will first checkBigNumbers() to see if 
there is any number = 2^28, if there is, i will force encoding the lower 4 
bits using the 128 4-bit slots. Thus, all exceptions left to simple16 are  
2^28, which can definitely be handled. So, there is no failure cases!!! :) . 

BTW, my PFOR impl will save more index size than VInt and other PFOR impls. 
Thus, if the user case is real-time search which requires loading index from 
disk to memory frequently, my PFOR impl may save even more. 


  





 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE-2903.patch, for_pfor.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-15 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: LUCENE-2903.patch

This new patch provides PForDeltaFixedIntBlockWithIntBufferCodec 
(PatchedFrameOfRef4) which improves the performance of previous 
couterparts(PatchedFrameOfRef4,5,6). Note that the PatchedFrameOfRef4 is 
different from the previous PatchedFrameOfRef4. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE-2903.patch, LUCENE_2903.patch, 
 LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-15 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: LUCENE-2903.patch

This patch improves the performance of previous PatchedFrameOfRef4 and removed 
the PatchedFrameOfRef5 and PatchedFrameOfRef6. Now the performance 
ofPatchedFrameOfRef4 is better than BulkVInt and comparable to 
PatchedFrameOfRef in my tests.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-15 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: (was: LUCENE_2903.patch)

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-15 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: (was: LUCENE-2903.patch)

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-15 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: (was: LUCENE-2903.patch)

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992687#comment-12992687
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Robert and Michael

In order to test if ByteBuffer/IntBuffer works better than int[]-byte[] 
conversion, I now separate them into 3 different codecs. All of them use the 
same PForDelta implementation except that they use different 
indexinput/indexoutput as follows.

1. PatchedFrameOfRef3 - use in.readBytes(), it will convert int[] - byte[] 
manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java

2. PatchedFrameOfRef4 - use in.readBytes(), it will convert int[] - byte[] 
by ByteBuffer/IntBuffer. Its corresponding java code is: 
PForDeltaFixedIntBlockWithByteBufferCodec.java

3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need 
conversion. Its corresponding java code is: 
PForDeltaFixedIntBlockWithReadIntCodec.java

I tested them against BulkVInt on MacOS. The detailed results are attached. 
Here is the conclusion:

1) Yes, Michael and Robert, you guys are right! ByteBuffer/IntBuffer are faster 
then my manual conversion btw byte[]/int[]. I guess the reason I thought they 
were worse is that i did not separate codecs before, such that the test results 
is not stable due to JVM/JIT. 

2) Now, PatchedFrameOfRef4 is still worse than BulkVInt in many kinds of 
queries. However, it seems that it can do better for fuzzy queries and 
wildcardquery.

3) Of course, these PatchedFrameOfRef3,4,5 are all better than 
PatchedFrameOfRef and FrameOfRef for almost all queries.

4) The new patched is just uploaded, please check them out. 

The following is the experimental results for 0.1M data.

(1) bulkVInt VS patchedFrameOfRef4 (withByteBuffer, in.readBytes(..) )

QueryQPS bulkVIntQPS pathcedFrameofref4-withByteBuffer  Pct diff
 united states  389.26  361.79 -7.1%
   united states~3  234.52  228.99 -2.4%
   +nebraska +states 1138.95  992.06-12.9%
 +united +states  670.69  603.86-10.0%
doctimesecnum:[1 TO 6]  415.28  447.83  7.8%
doctitle:.*[Uu]nited.*  496.03  522.47  5.3%
  spanFirst(unit, 5) 1176.47 1086.96 -7.6%
spanNear([unit, state], 10, true)  502.26  423.73-15.6%
  states 1612.90 1453.49 -9.9%
 u*d  167.95  171.17  1.9%
un*d  260.69  275.33  5.6%
uni*  602.41  577.37 -4.2%
   unit* 1016.26 1041.67  2.5%
   united states  617.28  549.45-11.0%
  united~0.6   12.22   12.93  5.9%
 united~0.75   53.88   56.78  5.4%
unit~0.5   12.58   13.19  4.9%
unit~0.7   52.41   54.93  4.8%

(2) bulkVInt VS patchedFrameOfRef3 (with my own int[] - byte[] conversion, 
still in.readBytes(..))

 QueryQPS bulkVIntQPS pathcedFrameofref3  Pct diff
 united states  388.50  363.24 -6.5%
   united states~3  234.80  223.56 -4.8%
   +nebraska +states 1138.95 1016.26-10.8%
 +united +states  671.14  607.90 -9.4%
doctimesecnum:[1 TO 6]  418.24  441.89  5.7%
doctitle:.*[Uu]nited.*  489.00  522.74  6.9%
  spanFirst(unit, 5) 1246.88 1127.40 -9.6%
spanNear([unit, state], 10, true)  514.14  473.71 -7.9%
  states 1612.90 1488.10 -7.7%
 u*d  170.77  167.31 -2.0%
un*d  261.37  264.48  1.2%
uni*  609.38  602.41 -1.1%
   unit* 1028.81 1052.63  2.3%
   united states  614.25  564.33 -8.1%
  united~0.6   12.05   12.11  0.5%
 united~0.75   53.16   54.97  3.4%
unit~0.5   12.43   12.50  0.6%
unit~0.7   52.81   53.23  0.8%


(3) bulkVInt VS patchedFrameOfRef5 (with my own int[] - byte[] conversion, 
still in.readBytes(..))

  QueryQPS bulkVIntQPS pathcedFrameofref5-withReadInt  Pct diff
 united states  391.24  366.70 -6.3%
   united states~3  235.40  235.07 -0.1%
   +nebraska +states 1137.66 1072.96 -5.7%
 +united +states  673.40  642.26 -4.6%
doctimesecnum:[1 TO 6]  414.25  407.66 -1.6%
doctitle:.*[Uu]nited.*  492.61  538.21  9.3%
  spanFirst(unit, 5) 1253.13 1175.09 -6.2%
spanNear([unit, state], 10, true)  511.25  483.56 -5.4%
  states 1642.04 1490.31 -9.2%
 u*d  166.78  160.28 -3.9%
un*d  261.64  255.36 -2.4%
uni*  609.38  593.47 -2.6%
   unit* 1026.69 

[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: LUCENE-2903.patch

This patch is to further improve pfordelta codec (PForDeltaFixedIntBlockCodec). 
I used 3 different implementations (3 codecs) for inputindex/outputindex. In 
particular, 

1. PatchedFrameOfRef3  use in.readBytes(), it will convert int[]  byte[] 
manually. Its corresponding java code is: PForDeltaFixedIntBlockCodec.java

2. PatchedFrameOfRef4  use in.readBytes(), it will convert int[]  byte[] by 
ByteBuffer/IntBuffer. Its corresponding java code is: 
PForDeltaFixedIntBlockWithByteBufferCodec.java

3. PatchedFrameOfRef5 - use in.readInt() with a loop, it does not need 
conversion. Its corresponding java code is: 
PForDeltaFixedIntBlockWithReadIntCodec.java




 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-09 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992809#comment-12992809
 ] 

hao yan commented on LUCENE-2903:
-

just uploaded. Sorry. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE-2903.patch, LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-08 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12992237#comment-12992237
 ] 

hao yan commented on LUCENE-2903:
-

I tried to move memory allocation out of readBlock() to BlockReader's 
constructor. It improves the performance a little. I also tried to use 
ByteBuffer/IntBuffer to replace my manual convertsion between bytes[]/int[]. It 
makes things worse.

The following is my result for 0.1M data:
(1) BulkVInt vs patchedFrameoFRef3
QueryQPS   bulkVIntQPS patchedFrameoFRef3  Pct diff
 united states  393.55  362.84 -7.8%
   united states~3  243.84  236.80 -2.9%
   +nebraska +states 1140.25  998.00-12.5%
 +united +states  687.76  633.31 -7.9%
doctimesecnum:[1 TO 6]  413.56  427.53  3.4%
doctitle:.*[Uu]nited.*  510.46  534.47  4.7%
  spanFirst(unit, 5) 1240.69 1108.65-10.6%
spanNear([unit, state], 10, true)  511.77  463.18 -9.5%
  states 1626.02 1483.68 -8.8%
 u*d  164.23  162.79 -0.9%
un*d  257.53  252.97 -1.8%
uni*  607.53  591.02 -2.7%
   unit* 1024.59 1043.84  1.9%
   united states  627.35  578.70 -7.8%
  united~0.6   11.51   11.36 -1.3%
 united~0.75   52.58   53.57  1.9%
unit~0.5   12.08   11.93 -1.2%
unit~0.7   50.98   51.30  0.6%

(2) FrameOfRef VS PatchcedFrameOfRef3
QueryQPSpatchedFrameofrefQPS pathcedFrameofref3  Pct diff
 united states  314.76  362.71 15.2%
   united states~3  227.53  237.08  4.2%
   +nebraska +states 1075.27 1025.64 -4.6%
 +united +states  646.41  626.57 -3.1%
doctimesecnum:[1 TO 6]  412.88  429.37  4.0%
doctitle:.*[Uu]nited.*  481.70  528.82  9.8%
  spanFirst(unit, 5) 1060.45 1118.57  5.5%
spanNear([unit, state], 10, true)  409.33  467.73 14.3%
  states 1353.18 1479.29  9.3%
 u*d  158.91  165.98  4.4%
un*d  237.36  256.41  8.0%
uni*  560.22  593.12  5.9%
   unit*  946.97 1043.84 10.2%
   united states  431.22  583.09 35.2%
  united~0.6   10.91   11.37  4.2%
 united~0.75   50.30   53.30  5.9%
unit~0.5   11.54   11.94  3.5%
unit~0.7   47.38   50.38  6.3%


(3) PatchedFrameOfRef VS PatchedFrameOfRef3

 QueryQPS FrameOfRefQPS pathcedFrameofref3  Pct diff
 united states  326.26  360.49 10.5%
   united states~3  226.50  234.69  3.6%
   +nebraska +states 1077.59 1021.45 -5.2%
 +united +states  648.51  630.52 -2.8%
doctimesecnum:[1 TO 6]  324.46  428.45 32.0%
doctitle:.*[Uu]nited.*  485.44  527.70  8.7%
  spanFirst(unit, 5) 1007.05 .11 10.3%
spanNear([unit, state], 10, true)  446.03  465.55  4.4%
  states 1449.28 1459.85  0.7%
 u*d  158.43  161.79  2.1%
un*d  246.37  256.28  4.0%
uni*  548.85  594.88  8.4%
   unit*  920.81 1042.75 13.2%
   united states  450.65  576.37 27.9%
  united~0.6   11.07   11.26  1.7%
 united~0.75   50.70   52.60  3.8%
unit~0.5   11.64   11.76  1.0%
unit~0.7   49.04   50.70  3.4%




 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the 

[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991220#comment-12991220
 ] 

hao yan commented on LUCENE-2903:
-

HI, Michael

Did u try FrameOfRef and PatchedFrameOfRef? 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991222#comment-12991222
 ] 

hao yan commented on LUCENE-2903:
-

And it sure complicate the pfordelta algorithm a lot by using intbuffer.set/get.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-06 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12991221#comment-12991221
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Paul

I tested ByteBuffer-IntBuffer, it is not faster than converting int[] - 
byte[]. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-03 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990214#comment-12990214
 ] 

hao yan commented on LUCENE-2903:
-

I think essentially the above step also need to do int-byte-int conversion. 
So, there is no reason it can save more than I do it manually.

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-03 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990480#comment-12990480
 ] 

hao yan commented on LUCENE-2903:
-

Yes. Other PFOR impls (FrameOfRef and PatchedFrameOfRef) are even slower. (as 
long as you set -server when you run them). I am also wondering why. Actually I 
think wikipedia data is kind of biased. Do you have any other data sets? 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989754#comment-12989754
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Paul. thanks for the suggestions. I just uploaded a new patch which renamed 
the codec as PatchedFrameOfRef3. 

I actually have a question to ask. In BulkVInt codec, it writes the compressed 
byte stream as a chunk of bytes. However, in pfordelta-related codecs, the 
compressed results are in ints, i have to either write single int with a loop, 
or first convert int array to byte array and then call out.writeBytes(). Do you 
know any other smarter way to write an int array to indexOutput? 

Another try I did is to make PForDelta itself produce byte-wise compressed 
results. However, from my experimental results, it will slow down pfordelta 
significantly. Also, i do not think the NIO buffer used in FrameOfRef and 
PatchedFrameOfRef help since essentially it is like the way that we first 
convert int array to byte array and then writeBytes().

Do you have any good suggestions? thanks! 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: LUCENE_2903.patch

This patch rename the NewPForDeltaCodec as PatchedFrameOfRef3 to follow the 
tradition.

And also add back the BulkVInt allones trick. (I removed it accidently in the 
last patch).

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-02 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989872#comment-12989872
 ] 

hao yan commented on LUCENE-2903:
-

Yes, using ByteBuffer.asIntBuffer() is the same as converting int/byte array to 
byte/int array. I think the underlying implementation ByteBuffer.asIntBuffer() 
cannot avoid. I also tried ByteBuffer/IntBuffer though, the result is worse 
which makes sense since it may incur extra costs.

Where to holler? :) 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch, LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-01 Thread hao yan (JIRA)
Improvement of PForDelta Codec
--

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan


There are 3 versions of PForDelta implementations in the Bulk Branch: 
FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.

The FrameOfRef is a very basic one which is essentially a binary encoding (may 
result in huge index size).
The PatchedFrameOfRef is the implmentation based on the original version of 
PForDelta in the literatures.
The PatchedFrameOfRef2 is my previous implementation which are improved this 
time. (The Codec name is changed to NewPForDelta.).

In particular, the changes are:
1. I fixed the bug of my previous version (in Lucene-1410.patch), where the old 
PForDelta does not support very large exceptions (since
the Simple16 does not support very large numbers). Now this has been fixed in 
the new LCPForDelta.

2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
two PForDelta implementation in the bulk branch (FrameOfRef and 
PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
CodecProvider and PForDeltaFixedIntBlockCodec.

3. The performance test results are:
1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef for 
almost all kinds of queries, slightly worse then BulkVInt.
2) My NewPForDelta codec can result in the smallest index size among all 4 
methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
3) All performance test results are achieved by running with -server instead 
of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-01 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-2903:


Attachment: LUCENE_2903.patch

Patch for the improvement of PForDeltaFixedIntBlockCodec

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2903) Improvement of PForDelta Codec

2011-02-01 Thread hao yan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12989532#comment-12989532
 ] 

hao yan commented on LUCENE-2903:
-

Hi, Robert

Sorry. That was a mistake. I commented out that one just for debugging to see 
if that affect the performance. I should have changed it back. I will attach a 
new patch. 

thanks for pointing that out. 

 Improvement of PForDelta Codec
 --

 Key: LUCENE-2903
 URL: https://issues.apache.org/jira/browse/LUCENE-2903
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: hao yan
 Attachments: LUCENE_2903.patch


 There are 3 versions of PForDelta implementations in the Bulk Branch: 
 FrameOfRef, PatchedFrameOfRef, and PatchedFrameOfRef2.
 The FrameOfRef is a very basic one which is essentially a binary encoding 
 (may result in huge index size).
 The PatchedFrameOfRef is the implmentation based on the original version of 
 PForDelta in the literatures.
 The PatchedFrameOfRef2 is my previous implementation which are improved this 
 time. (The Codec name is changed to NewPForDelta.).
 In particular, the changes are:
 1. I fixed the bug of my previous version (in Lucene-1410.patch), where the 
 old PForDelta does not support very large exceptions (since
 the Simple16 does not support very large numbers). Now this has been fixed in 
 the new LCPForDelta.
 2. I changed the PForDeltaFixedIntBlockCodec. Now it is faster than the other 
 two PForDelta implementation in the bulk branch (FrameOfRef and 
 PatchedFrameOfRef). The codec's name is NewPForDelta, as you can see in the 
 CodecProvider and PForDeltaFixedIntBlockCodec.
 3. The performance test results are:
 1) My NewPForDelta codec is faster then FrameOfRef and PatchedFrameOfRef 
 for almost all kinds of queries, slightly worse then BulkVInt.
 2) My NewPForDelta codec can result in the smallest index size among all 4 
 methods, including FrameOfRef, PatchedFrameOfRef, and BulkVInt, and itself)
 3) All performance test results are achieved by running with -server 
 instead of -client

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1410) PFOR implementation

2010-12-16 Thread hao yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hao yan updated LUCENE-1410:


Attachment: LUCENE-1410.patch

This patch is to add codec support for PForDelta compression algorithms.


Changes by Hao Yan (hyan2...@gmail.com)

In summary, I added five files to support and test the codec.

In Src,
1.  org.apache.lucene.index.codecs.pfordelta.PForDelta.java
2.  org.apache.lucene.index.codecs.pfordelta.Simple16.java
3.  org.apache.lucene.index.codecs.PForDeltaFixedBlockCodec.java
4.  
org.apache.lucene.index.codecs.intblock.FixedIntBlockIndexOutputWithGetElementNum.java

In Test,
5.  
org.apache.lucene.index.codecs.intblock.TestPForDeltaFixedIntBLockCodec.java

1)  In particular, the firs class PForDelta is the core implementation
of PForDelta algorithm, which compresses exceptions using Simple16
that is implemented in the second class Simple16.
2)  The third classs PForDeltaFixedBlockCodec is similar to
org.apache.lucene.index.codesc.ockintblock.MockFixedIntBlockCodec in
Test, except that it uses PForDelta to encode the data in the buffer.
3)  The fourth class is almost the same as
org.apache.lucene.index.codecs.intblock.FixedIntBlockINdexOuput,
except that it provides an additional public function to retrieve the
value of the upto field, which is private filed in
FixedIntBlockINdexOuput. The reason I added this public function is
that the number of elements in the block that have meaningful values is not 
always equal to the blockSize or the buffer
size since the last block/buffer of a stream of data usually only
contain less number of data. In the case, I will fill all elements after the 
meaningful elements with 0s. Thus, we alwasy compress one entire block.

4)  The last class is the unit test to test PForDeltaFixedIntBlockCodec
which is very similar to
org.apache.lucene.index.codecs.mintblock.TestIntBlockCodec.

I also changed the LuceneTestCase class to add the new
PForDeltaFixeIntBlockCOde.

The unit tests and all lucence tests have passed.


 PFOR implementation
 ---

 Key: LUCENE-1410
 URL: https://issues.apache.org/jira/browse/LUCENE-1410
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Paul Elschot
Priority: Minor
 Fix For: Bulk Postings branch

 Attachments: autogen.tgz, for-summary.txt, 
 LUCENE-1410-codecs.tar.bz2, LUCENE-1410.patch, LUCENE-1410.patch, 
 LUCENE-1410.patch, LUCENE-1410b.patch, LUCENE-1410c.patch, 
 LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java, 
 TestPFor2.java, TestPFor2.java

   Original Estimate: 21840h
  Remaining Estimate: 21840h

 Implementation of Patched Frame of Reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2750) add Kamikaze 3.0.1 into Lucene

2010-11-08 Thread hao yan (JIRA)
add Kamikaze 3.0.1 into Lucene
--

 Key: LUCENE-2750
 URL: https://issues.apache.org/jira/browse/LUCENE-2750
 Project: Lucene - Java
  Issue Type: Sub-task
  Components: contrib/*
Reporter: hao yan


Kamikaze 3.0.1 is the updated version of Kamikaze 2.0.0. It can achieve 
significantly better performance then Kamikaze 2.0.0 in terms of both 
compressed size and decompression speed. The main difference between the two 
versions is Kamikaze 3.0.x uses the much more efficient implementation of the 
PForDelta compression algorithm. My goal is to integrate the highly efficient 
PForDelta implementation into Lucene Codec.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org