Re: BytesRef violates the principle of least astonishment

Olivier Binda Tue, 19 May 2015 22:21:26 -0700

My take :
Indeed BytesRef is mutable

This happens for performance reasons, to avoid unnecessary objectcreations and unecessary copying and Also to workaroundthe java "issue" that most of the time you need to pass an array withan offset and length in methods for performance but you don't want tocreate an array every time you have to do that

In your case, you are supposed to copy your bytes because, indeed, thebytesRef will change everytime you call a lucene method on it(it is mutable) and the array it points to will change too because thesemight be internal arrays of readers/buffers/codecs

(and you don't know the internal working of those)...


Also, in my opinion,
Lucene rocks


On 05/20/2015 06:19 AM, Trejkaz wrote:

Hi all.

The Lucene 4 migration guide "helpfully" suggests to work with
BytesRef directly rather than converting to string, but I disagree.
Take the following example of building up a List<Term> by iterating a
TermsEnum. I think it is written in a fairly straight-forward fashion.
I added some println which aren't really there, to illustrate the
place I have my breakpoints.

     protected List<Term> toList(String field, TermsEnum termsEnum)
throws IOException {
         List<Term> terms = new LinkedList<>();
         BytesRef text;
         //noinspection NestedAssignment
         while((text = termsEnum.next()) != null) {
             Term term = new Term(field, text);
             System.out.println("in loop: " + term);
             terms.add(term);
         }
         System.out.println("at end: " + terms);
         return terms;
     }

When you actually try to call this, weird shit happens.

     in loop: content:term
     at end: [content:testing]
     in loop: content:extractor
     at end: [content:for]

Basically, by the time you exit the while loop, the BytesRef you put
into the Term has changed to point to the next term in the index. So
okay, so BytesRef is mutable. I hate mutable stuff, but luckily we
have clone() on this class, so I'll just clone it when creating the
term:

             Term term = new Term(field, text.clone());

Now the output is:

     in loop: content:term
     at end: [content:test]
     in loop: content:extractor
     at end: [content:forractor]

WTF?

Now it seems like it clones the length of the slice but not the actual
data, and the actual data has still changed underneath it. Great. So
basically, the only safe way to use BytesRef is to treat it like a hot
potato and immediately call utf8ToString() to get hold of an object
you can trust.

             Term term = new Term(field, text.utf8ToString());

And then finally you get:

     in loop: content:term
     at end: [content:term]
     in loop: content:extractor
     at end: [content:extractor]

I will probably eventually formalise this in our code and making
utility wrappers which don't expose BytesRef to the caller, since it's
so easy to do the wrong thing with it.

They say a good measure of the quality of a library is the number of
times you say "WTF" while trying to figure out how to use it. I have
already lost count.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BytesRef violates the principle of least astonishment

Reply via email to