[
https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268972#comment-15268972
]
Otis Gospodnetic commented on LUCENE-7253:
------------------------------------------
My take on this:
# sparse fields are indeed not an abuse case
# my understanding of what Robert is saying is that he agrees with 1), but that
current implementation is not geared for 1) and if existing DV code just
modified slightly to improve performance then it would not be the right
implementation
# Robert didn't actually mention -1 explicitly until David brought that up,
although we all know that Robert could always throw in his -1 in the end, after
a contributor has already spent hours or days making changes, just to have them
rejected........ (but this is a general Lucene project problem that, I think,
nobody has actually tried solving directly because it'd be painful)
# Robert actually proposed "The correct solution is to have a more next/advance
type api geared at forward iteration rather than one that mimics an array. Then
nulls can be handled in typical ways in various situations (eg rle). It should
be possible esp that scoring is in order.", so my take is that if a contributor
did exactly what Robert wants then this could potentially be accepted
# I assume the "correct approach" involves more changes and more coding and
time. I assume it would be useful to make a simpler and maybe not acceptable
change first in order to get some numbers and see if it's even worth investing
time in "correct approach"
# If the numbers look good then, because of a potential -1 from Robert, whoever
takes on this challenge would have to be very clear, before any additional dev
work, about what Robert wants, what he would -1, and what he would let in
> Sparse data in doc values and segments merging
> -----------------------------------------------
>
> Key: LUCENE-7253
> URL: https://issues.apache.org/jira/browse/LUCENE-7253
> Project: Lucene - Core
> Issue Type: Improvement
> Affects Versions: 5.5, 6.0
> Reporter: Pawel Rog
> Labels: performance
>
> Doc Values were optimized recently to efficiently store sparse data.
> Unfortunately there is still big problem with Doc Values merges for sparse
> fields. When we imagine 1 billion documents index it seems it doesn't matter
> if all documents have value for this field or there is only 1 document with
> value. Segment merge time is the same for both cases. In most cases this is
> not a problem but there are several cases in which one can expect having many
> fields with sparse doc values.
> I can describe an example. During performance tests of a system with large
> number of sparse fields I realized that Doc Values merges are a bottleneck. I
> had hundreds of different numeric fields. Each document contained only small
> subset of all fields. Average document contains 5-7 different numeric values.
> As you can see data was very sparse in these fields. It turned out that
> ingestion process was CPU-bound. Most of CPU time was spent in DocValues
> related methods (SingletonSortedNumericDocValues#setDocument,
> DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued,
> DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments.
> Adrien Grand suggested to reduce the number of sparse fields and replace them
> with smaller number of denser fields. This helped a lot but complicated
> fields naming.
> I am not much familiar with Doc Values source code but I have small
> suggestion how to improve Doc Values merges for sparse fields. I realized
> that Doc Values producers and consumers use Iterators. Let's take an example
> of numeric Doc Values. Would it be possible to replace Iterator which
> "travels" through all documents with Iterator over collection of non empty
> values? Of course this would require storing object (instead of numeric)
> which contains value and document ID. Such an iterator could significantly
> improve merge time of sparse Doc Values fields. IMHO this won't cause big
> overhead for dense structures but it can be game changer for sparse
> structures.
> This is what happens in NumericDocValuesWriter on flush
> {code}
> dvConsumer.addNumericField(fieldInfo,
> new Iterable<Number>() {
> @Override
> public Iterator<Number> iterator() {
> return new NumericIterator(maxDoc, values,
> docsWithField);
> }
> });
> {code}
> Before this happens during addValue, this loop is executed to fill holes.
> {code}
> // Fill in any holes:
> for (int i = (int)pending.size(); i < docID; ++i) {
> pending.add(MISSING);
> }
> {code}
> It turns out that variable called pending is used only internally in
> NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be
> good to change it with different class (some kind of list) because this may
> break DV performance for dense fields. I hope someone can suggest interesting
> solutions for this problem :).
> It would be great if discussion about sparse Doc Values merge performance can
> start here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]