[
https://issues.apache.org/jira/browse/LUCENE-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-2944:
--------------------------------
Attachment: LUCENE-2944.patch
{quote}
I think producer should own the BytesRef and if consumer wants to hang onto it,
it should make a deep copy? This is consistent w/ TermAttribute...
{quote}
Here's a new patch implementing it this way. I refactored
TermToBytesRefAttribute into two methods, getBytesRef() and hash()... I find
this less confusing, removes some wasted bytesrefs being needlessly created
here and there (e.g. queryparser), and does things like allow an attribute to
say, pre-size its reusable BytesRef to a huge size or other things custom
attributes might want to do.
Here is the consumer code sample I added from the javadoc:
{code}
/*
* Consumers of this attribute call getBytesRef() up-front, and then
* invoke hash() for each term. Example:
*/
final TermToBytesRefAttribute termAtt =
tokenStream.getAttribute(TermToBytesRefAttribute.class);
final BytesRef bytes = termAtt.getBytesRef();
while (termAtt.incrementToken() {
/*
* you must call termAtt.hash() even if you don't need this hashCode.
* this encodes the term value (internally it might be a char[], etc) into
the bytes.
*/
int hashCode = termAtt.hash();
if (isInteresting(bytes)) {
/*
* because the bytes are reused by the attribute (like CharTermAttribute's
char[] buffer),
* you should make a copy if you need persistent access to the bytes,
otherwise they will
* be rewritten across calls to incrementToken()
*/
doSomethingWith(new BytesRef(bytes));
}
}
{code}
> BytesRef reuse bugs in QueryParser and analysis.jsp
> ---------------------------------------------------
>
> Key: LUCENE-2944
> URL: https://issues.apache.org/jira/browse/LUCENE-2944
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Robert Muir
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2944.patch, LUCENE-2944.patch, LUCENE-2944.patch,
> LUCENE-2944_option2.patch
>
>
> Some code uses BytesRef as if it were a "String", in this case consumers of
> TermToBytesRefAttribute.
> The thing is, while our general implementation works on char[] and then
> populates the consumers BytesRef,
> not all TermToBytesRefAttribute implementations do this, specifically ICU
> collation, it reuses the bytes and simply sets the pointers:
> {noformat}
> @Override
> public int toBytesRef(BytesRef target) {
> collator.getRawCollationKey(toString(), key);
> target.bytes = key.bytes;
> target.offset = 0;
> target.length = key.size;
> return target.hashCode();
> }
> {noformat}
> Most of the blame falls on me as I added this to the queryparser in
> LUCENE-2514.
> Attached is a patch so that these consumers re-use a 'spare' and copy the
> bytes when they are going to make a long lasting object such as a Term.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]