[ https://issues.apache.org/jira/browse/LUCENE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Otis Gospodnetic updated LUCENE-7253: ------------------------------------- Fix Version/s: master (7.0) > Make sparse doc values and segments merging more efficient > ----------------------------------------------------------- > > Key: LUCENE-7253 > URL: https://issues.apache.org/jira/browse/LUCENE-7253 > Project: Lucene - Core > Issue Type: Improvement > Affects Versions: 5.5, 6.0 > Reporter: Pawel Rog > Labels: performance > Fix For: master (7.0) > > > Doc Values were optimized recently to efficiently store sparse data. > Unfortunately there is still big problem with Doc Values merges for sparse > fields. When we imagine 1 billion documents index it seems it doesn't matter > if all documents have value for this field or there is only 1 document with > value. Segment merge time is the same for both cases. In most cases this is > not a problem but there are several cases in which one can expect having many > fields with sparse doc values. > I can describe an example. During performance tests of a system with large > number of sparse fields I realized that Doc Values merges are a bottleneck. I > had hundreds of different numeric fields. Each document contained only small > subset of all fields. Average document contains 5-7 different numeric values. > As you can see data was very sparse in these fields. It turned out that > ingestion process was CPU-bound. Most of CPU time was spent in DocValues > related methods (SingletonSortedNumericDocValues#setDocument, > DocValuesConsumer$10$1#next, DocValuesConsumer#isSingleValued, > DocValuesConsumer$4$1#setNext, ...) - mostly during merging segments. > Adrien Grand suggested to reduce the number of sparse fields and replace them > with smaller number of denser fields. This helped a lot but complicated > fields naming. > I am not much familiar with Doc Values source code but I have small > suggestion how to improve Doc Values merges for sparse fields. I realized > that Doc Values producers and consumers use Iterators. Let's take an example > of numeric Doc Values. Would it be possible to replace Iterator which > "travels" through all documents with Iterator over collection of non empty > values? Of course this would require storing object (instead of numeric) > which contains value and document ID. Such an iterator could significantly > improve merge time of sparse Doc Values fields. IMHO this won't cause big > overhead for dense structures but it can be game changer for sparse > structures. > This is what happens in NumericDocValuesWriter on flush > {code} > dvConsumer.addNumericField(fieldInfo, > new Iterable<Number>() { > @Override > public Iterator<Number> iterator() { > return new NumericIterator(maxDoc, values, > docsWithField); > } > }); > {code} > Before this happens during addValue, this loop is executed to fill holes. > {code} > // Fill in any holes: > for (int i = (int)pending.size(); i < docID; ++i) { > pending.add(MISSING); > } > {code} > It turns out that variable called pending is used only internally in > NumericDocValuesWriter. I know pending is PackedLongValues and it wouldn't be > good to change it with different class (some kind of list) because this may > break DV performance for dense fields. I hope someone can suggest interesting > solutions for this problem :). > It would be great if discussion about sparse Doc Values merge performance can > start here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org