Re: IndexWriter.deleteDocuments(Query query)

John Wang Wed, 01 Apr 2009 15:37:57 -0700

a code snippet is worth 1000 words :)


private static final Term UID_TERM = new Term("uid_payload", "_UID");

private static class SinglePayloadTokenStream extends TokenStream {

   private Token token = new Token(UID_TERM.text(), 0, 0);

   private byte[] buffer = new byte[4];

   private boolean returnToken = false;


   void setUID(int uid) {

     buffer[0] = (byte) (uid);

     buffer[1] = (byte) (uid >> 8);

     buffer[2] = (byte) (uid >> 16);

     buffer[3] = (byte) (uid >> 24);

     token.setPayload(new Payload(buffer));

     returnToken = true;

   }


   public Token next() throws IOException {

     if (returnToken) {

       returnToken = false;

       return token;

     } else {

       return null;

     }

   }

 }


When building docs:

f=new Field(UID_TERM.field(), singlePayloadTokenStream);

   doc.add(f);


When we load the index, we do:


int maxDoc = reader.maxDoc();

_uidArray = new int[maxDoc];

TermPositions tp = null;

byte[] payloadBuffer = new byte[4];       // four bytes for an int

try

{

          tp = reader.termPositions(UID_TERM);

          int idx = 0;

          while (tp.next())

          {

            int doc = tp.doc();

            assert doc < maxDoc;



            while(idx < doc) _uidArray[idx++] = -1; // fill the gap



            tp.nextPosition();

            tp.getPayload(payloadBuffer, 0);

            int uid = bytesToInt(payloadBuffer);

            if(uid < _minUID) _minUID = uid;

            if(uid > _maxUID) _maxUID = uid;

            _uidArray[idx++] = uid;

      }

}

finally

{

          if (tp!=null)

          {

          tp.close();

          }

}



This is actually code Mike B. posted a while back.


-John


On Wed, Apr 1, 2009 at 2:29 PM, Michael McCandless <
[email protected]> wrote:

> On Wed, Apr 1, 2009 at 5:22 PM, John Wang <[email protected]> wrote:
> > Hi Michael:
> >
> >    1) Yes, we use TermDocs, exactly what
> IndexWriter.deleteDocuments(Term)
> > is doing under the cover.
>
> This part I understand :)
>
> >    2) We iterate the docid->uid mapping, for each docid, get the
> > corresponding ui and check that to see if that is in the deleted set. If
> so,
> > add the docid to the list. There is no uid->docid lookup needed.
>
> Does this mean you iterate all docs in the index, and only when you
> come across a UID that's deleted, you add to deleted set?
>
> Do you have a separate payload field per document?  (I'm still unclear
> how you use payloads to encode the full docID -> UID map).
>
> >      However, in our sharded architecture, we partition by continuous
> uids,
> > in which case we keep both mappings since we know the range of the the
> uid.
> > In which case, uid->docid mapping is available.
>
> OK
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: IndexWriter.deleteDocuments(Query query)

Reply via email to