Re: Purging unused fields during merges

Erick Erickson Mon, 18 Jan 2016 12:34:04 -0800

If you look with Luke for instance, the now-referenced-by-no-document
fields are still reported as being in the corpus. They serve no useful
purpose, and could theoretically be purged (I've seen schema designs
that have many thousands of fields be redesigned for instance).


So it seems like at the point N segments are merged if no document in
those segments refers to a particular field, that field could be
omitted from the final segment, at least in theory.

Now, not knowing the segment merge code well (ok, not at all) I don't
know how_practical_ that is. And the last thing I'd advocate is to
slow down the normal merging process to handle a case where someone
makes errors in the first place when they can re-index the corpus with
the _right_ schema...


On Mon, Jan 18, 2016 at 12:01 PM, Adrien Grand <[email protected]> wrote:
> Hi Erick,
>
> It is not clear to me what remaining data you are looking to get rid of.
>
> You could purge unused field numbers by calling IndexWriter.addIndexes on a
> reader wrapper that removes unused fields. But this is not something that
> Lucene would do by itself. The fact that field numbers are reused is useful
> in order to be able to copy raw bytes when merging stored fields (otherwise
> you would have to decode/recode everything in order to remap field numbers).
>
> Maybe what you remember are these 2 issues that improved the memory usage of
> FieldInfos in the sparse case?
>  - https://issues.apache.org/jira/browse/LUCENE-6325
>  - https://issues.apache.org/jira/browse/LUCENE-6630
>
> Le lun. 18 janv. 2016 à 20:37, Erick Erickson <[email protected]> a
> écrit :
>>
>> I _swear_ I've seen this go by before, but can't find it.
>>
>> Let's say I have removed _all_ documents from my index that mention a
>> particular field (dynamic in this case). Merging segments apparently
>> does not remove that data from the merged segment, and the extra
>> information survives restarts, optimizations and all that. I did get
>> it to work by chance when I had a single segment, and updated _all_
>> the docs in that segment without the extra fields so some quick tests
>> will succeed if you're not careful.
>>
>> Usually, the answer is "who cares? A field or two "extra" won't have
>> any effect". In this case, though, there are over 20K extra dynamic
>> fields added mistakenly (which is why I like to remove dynamic field
>> definitions from solr when possible).
>>
>> This is true on 5x.
>>
>> Ring any bells?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Purging unused fields during merges

Reply via email to