[
https://issues.apache.org/jira/browse/LUCENE-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870219#comment-13870219
]
Robert Muir commented on LUCENE-5393:
-------------------------------------
Mikhail, I think its fairly involved, there are two issues:
# ensuring client calls clone() when it has to save for later
# ensuring no lucene code tries to change any codec-private bytes
These are two different things, currently the clone() stuff is never a problem,
but some work to ensure everything is functional:
* relaxation of some current tests (the current behavior is actually tested,
see BaseDocValuesFormatTestCase.testCodecUsesOwnBytes etc).
* fixing/testing the behavior of Sorted/SortedSet enums, as in some codecs
these are backed by BinaryDocValues api as well
* adding tests for all consumers of these things.
The second part (changing bytes) is unrelated actually, and some user of Direct
or Memory or whatever can probably cause big trouble today already. But your
idea is really cool, e.g. on init the Asserting could take some checksum or
something and verify it on close. To fix that, we should probably fix Asserting
to be able to wrap any codec (currently: it is hardcoded). Or maybe Asserting
isnt the best place to do it, but i like the idea.
In general before i did this, i want to just do a hack patch with luceneutil
faceting to have a more formal version of the benchmark you ran, so we know the
benefits.
> remove codec byte[] cloning in BinaryDocValues api
> --------------------------------------------------
>
> Key: LUCENE-5393
> URL: https://issues.apache.org/jira/browse/LUCENE-5393
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Robert Muir
>
> I can attack this (at least in trunk/5.0, we can discuss if/when it should
> happen for 4.x).
> See the mailing list for more discussion. this was done intentionally, to
> prevent lots of reuse bugs.
> The issue is very simple, lots of old fieldcache-type logic has it because
> things used to be immutable Strings or because they rely on things being in a
> large array:
> {code}
> byte[] b1 = get(doc1);
> byte[] b2 = get(doc2);
> // some code that expects b1 to be unchanged.
> {code}
> Currently each get() internally is cloning the bytes, for safety. but this is
> really bad for code like faceting (which is going to decompress integers and
> never needs to save bytes), and its even stupid for things like
> fieldcomparator (where in general its doing comparisons, and only rarely
> needs to save a copy of the bytes for later).
> I can address it with lots of tests (i added a lot in general anyway since
> the time of adding this TODO, but more would make me feel safer).
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]