Mike, Each of my flushed segment is fully ordered by time. But TieredMergePolicy or LogByteSizeMergePolicy is going to pick arbitrary time-segments and disturb this arrangement and I wanted some kind of control on this.
But like you pointed-out, going by only be time-adjacent merges can be disastrous. Is there a way to mix both time and size to arrive at a somewhat [less-than-accurate] global order of segment merges. Like attempt a time-adjacent merge, provided size of segments is not extremely skewed etc... -- Ravi On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > You want to focus merging on the segments containing newer documents? > Why? This seems somewhat dangerous... > > Not taking into account the "true" segment size can lead to very very > poor merge decisions ... you should turn on IndexWriter's infoStream > and do a long running test to convince yourself the merging is being > sane. > > Mike > > Mike McCandless > > http://blog.mikemccandless.com > > > On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan > <ravikumar.govindara...@gmail.com> wrote: > > Thanks Mike, > > > > Will try your suggestion. I will try to describe the actual use-case > itself > > > > There is a requirement for merging time-adjacent segments [append-only, > > rolling time-series data] > > > > All Documents have a timestamp affixed and during flush I need to note > down > > the least timestamp for all documents, through Codec. > > > > Then, I define a TimeMergePolicy extends LogMergePolicy and define the > > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. > > > > LogMergePolicy will auto-arrange levels of segments according time and > > proceed with merges. Latest segments will be lesser in size and preferred > > during merges than older and bigger segments > > > > Do you think such an approach will be fine or there are better ways to > > solve this? > > > > -- > > Ravi > > > > > > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless < > > luc...@mikemccandless.com> wrote: > > > >> Somewhere in those numeric trie terms are the exact integers from your > >> documents, encoded. > >> > >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int > >> value back from the BytesRef term. > >> > >> But you need to filter out the "higher level" terms, e.g. using > >> NumericUtils.getPrefixCodedLongShift(term) == 0. Or use > >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum. I believe > >> all the terms you want come first, so once you hit a term where > >> .getPrefixCodedLongShift is > 0, that's your max term and you can stop > >> checking. > >> > >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that > >> you can e.g. pull your own TermsEnum and iterate the terms yourself. > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> > >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan > >> <ravikumar.govindara...@gmail.com> wrote: > >> > I use a Codec to flush data. All methods delegate to actual > >> Lucene42Codec, > >> > except for intercepting one single-field. This field is indexed as an > >> > IntField [Numeric-Trie...], with precisionStep=4. > >> > > >> > The purpose of the Codec is as follows > >> > > >> > 1. Note the first BytesRef for this field > >> > 2. During finish() call [TermsConsumer.java], note the last BytesRef > for > >> > this field > >> > 3. Converts both the first/last BytesRef to respective integers > >> > 4. Store these 2 ints in segment-info diagnostics > >> > > >> > The problem with this approach is that, first/last BytesRef is totally > >> > different from the actual "int" values I try to index. I guess, this > is > >> > because Numeric-Trie explodes all the integers into it's own format of > >> > BytesRefs. Hence my Codec stores the wrong values in > segment-diagnostics > >> > > >> > Is there a way I can record actual min/max int-values correctly in my > >> codec > >> > and still support NumericRange search? > >> > > >> > -- > >> > Ravi > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >