Re: Actual min and max-value of NumericField during codec flush

Shai Erera Wed, 12 Feb 2014 23:54:04 -0800

Hi

LogMP *always* picks adjacent segments together. Therefore, if you have
segments S1, S2, S3, S4 where the date-wise sort order is S4>S3>S2>S1, then
LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent
segments and in a raw (i.e. it doesn't skip segments).


I guess what both Mike and I don't understand is why you insist on merging
based on the timestamp of each segment. I.e. if the order, timestamp-wise,
of the segments isn't as I described above, then merging them like so won't
hurt - i.e. they will still be unsorted. No harm is done.

Maybe MergePolicy isn't what you need here. If you can record somewhere the
min/max timestamp of each segment, you can use a MultiReader to wrap the
sorted list of IndexReaders (actually SegmentReaders). Then your "reader",
always traverses segments from new to old.

If this approach won't address your issue, then you can merge based on
timestamps - there's nothing wrong about it. What Mike suggested is that
you benchmark your application with this merge policy, for a long period of
time (few hours/days, depending on your indexing rate), because what might
happen is that your merges are always unbalanced and your indexing
performance will degrade because of unbalanced amount of IO that happens
during the merge.

Shai


On Thu, Feb 13, 2014 at 7:25 AM, Ravikumar Govindarajan <
ravikumar.govindara...@gmail.com> wrote:

> @Mike,
>
> I had suggested the same approach in one of my previous mails, where-by
> each segment records min/max timestamps in seg-info diagnostics and use it
> for merging adjacent segments.
>
> "Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. "
>
> But you have expressed reservations
>
> "This seems somewhat dangerous...
>
> Not taking into account the "true" segment size can lead to very very
> poor merge decisions ... you should turn on IndexWriter's infoStream
> and do a long running test to convince yourself the merging is being
> sane."
>
> Will merging be disastrous, if I choose a TimeMergePolicy? I will also test
> and verify, but it's always great to hear finer points from experts.
>
> @Shai,
>
> LogByteSizeMP categorizes "adjacency" by "size", whereas it would be better
> if "timestamp" is used in my case
>
> Sure, I need to wrap this in an SMP to make sure that the newly-created
> segment is also in sorted-order
>
> --
> Ravi
>
>
>
> On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera <ser...@gmail.com> wrote:
>
> > Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks
> adjacent
> > segments and SortingMP ensures the merged segment is also sorted.
> >
> > Shai
> >
> >
> > On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
> > ravikumar.govindara...@gmail.com> wrote:
> >
> > > Yes exactly as you have described.
> > >
> > > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order
> > and
> > > goes for a merge
> > >
> > > While SortingMergePolicy will correctly solve the merge-part, it does
> not
> > > however play any role in picking segments to merge right?
> > >
> > > SMP internally delegates to TieredMergePolicy, which might pick S1&S4
> to
> > > merge disturbing the global-order. Ideally only "adjacent" segments
> > should
> > > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
> > >
> > > Can there be a better selection of segments to merge in this case, so
> as
> > to
> > > maintain a semblance of global-ordering?
> > >
> > > --
> > > Ravi
> > >
> > >
> > >
> > > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> > > luc...@mikemccandless.com> wrote:
> > >
> > > > OK, I see (early termination).
> > > >
> > > > That's a challenge, because you really want the docs sorted backwards
> > > > from how they were added right?  And, e.g., merged and then searched
> > > > in "reverse segment order"?
> > > >
> > > > I think you should be able to do this w/ SortingMergePolicy?  And
> then
> > > > use a custom collector that stops after you've gone back enough in
> > > > time for a given search.
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > >
> > > > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > > > <ravikumar.govindara...@gmail.com> wrote:
> > > > > Mike,
> > > > >
> > > > > All our queries need to be sorted by timestamp field, in descending
> > > order
> > > > > of time. [latest-first]
> > > > >
> > > > > Each segment is sorted in itself. But TieredMergePolicy picks
> > arbitrary
> > > > > segments and merges them [even with SortingMergePolicy etc...]. I
> am
> > > > trying
> > > > > to avoid this and see if an approximate global ordering of segments
> > [by
> > > > > time-stamp field] can be maintained via merge.
> > > > >
> > > > > Ex: TopN results will only examine recent 2-3 smaller segments
> > > > [best-case]
> > > > > and return, without examining older and bigger segments.
> > > > >
> > > > > I do not know the terminology, may be "Early Query Termination
> Across
> > > > > Segments" etc...?
> > > > >
> > > > > --
> > > > > Ravi
> > > > >
> > > > >
> > > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > > > luc...@mikemccandless.com> wrote:
> > > > >
> > > > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the
> > total
> > > > >> order.
> > > > >>
> > > > >> Only TieredMergePolicy merges out-of-order segments.
> > > > >>
> > > > >> I don't understand why you need to encouraging merging of the more
> > > > >> recent (by your "time" field) segments...
> > > > >>
> > > > >> Mike McCandless
> > > > >>
> > > > >> http://blog.mikemccandless.com
> > > > >>
> > > > >>
> > > > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > > > >> <ravikumar.govindara...@gmail.com> wrote:
> > > > >> > Mike,
> > > > >> >
> > > > >> > Each of my flushed segment is fully ordered by time. But
> > > > >> TieredMergePolicy
> > > > >> > or LogByteSizeMergePolicy is going to pick arbitrary
> time-segments
> > > and
> > > > >> > disturb this arrangement and I wanted some kind of control on
> > this.
> > > > >> >
> > > > >> > But like you pointed-out, going by only be time-adjacent merges
> > can
> > > be
> > > > >> > disastrous.
> > > > >> >
> > > > >> > Is there a way to mix both time and size to arrive at a somewhat
> > > > >> > [less-than-accurate] global order of segment merges.
> > > > >> >
> > > > >> > Like attempt a time-adjacent merge, provided size of segments is
> > not
> > > > >> > extremely skewed etc...
> > > > >> >
> > > > >> > --
> > > > >> > Ravi
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > > > >> > luc...@mikemccandless.com> wrote:
> > > > >> >
> > > > >> >> You want to focus merging on the segments containing newer
> > > documents?
> > > > >> >> Why?  This seems somewhat dangerous...
> > > > >> >>
> > > > >> >> Not taking into account the "true" segment size can lead to
> very
> > > very
> > > > >> >> poor merge decisions ... you should turn on IndexWriter's
> > > infoStream
> > > > >> >> and do a long running test to convince yourself the merging is
> > > being
> > > > >> >> sane.
> > > > >> >>
> > > > >> >> Mike
> > > > >> >>
> > > > >> >> Mike McCandless
> > > > >> >>
> > > > >> >> http://blog.mikemccandless.com
> > > > >> >>
> > > > >> >>
> > > > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > > > >> >> <ravikumar.govindara...@gmail.com> wrote:
> > > > >> >> > Thanks Mike,
> > > > >> >> >
> > > > >> >> > Will try your suggestion. I will try to describe the actual
> > > > use-case
> > > > >> >> itself
> > > > >> >> >
> > > > >> >> > There is a requirement for merging time-adjacent segments
> > > > >> [append-only,
> > > > >> >> > rolling time-series data]
> > > > >> >> >
> > > > >> >> > All Documents have a timestamp affixed and during flush I
> need
> > to
> > > > note
> > > > >> >> down
> > > > >> >> > the least timestamp for all documents, through Codec.
> > > > >> >> >
> > > > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and
> > > define
> > > > the
> > > > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> > > > >> >> >
> > > > >> >> > LogMergePolicy will auto-arrange levels of segments according
> > > time
> > > > and
> > > > >> >> > proceed with merges. Latest segments will be lesser in size
> and
> > > > >> preferred
> > > > >> >> > during merges than older and bigger segments
> > > > >> >> >
> > > > >> >> > Do you think such an approach will be fine or there are
> better
> > > > ways to
> > > > >> >> > solve this?
> > > > >> >> >
> > > > >> >> > --
> > > > >> >> > Ravi
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > > > >> >> > luc...@mikemccandless.com> wrote:
> > > > >> >> >
> > > > >> >> >> Somewhere in those numeric trie terms are the exact integers
> > > from
> > > > >> your
> > > > >> >> >> documents, encoded.
> > > > >> >> >>
> > > > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get
> the
> > > int
> > > > >> >> >> value back from the BytesRef term.
> > > > >> >> >>
> > > > >> >> >> But you need to filter out the "higher level" terms, e.g.
> > using
> > > > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> > > > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
> > > > believe
> > > > >> >> >> all the terms you want come first, so once you hit a term
> > where
> > > > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and
> you
> > > can
> > > > >> stop
> > > > >> >> >> checking.
> > > > >> >> >>
> > > > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved,
> so
> > > > that
> > > > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> > > > yourself.
> > > > >> >> >>
> > > > >> >> >> Mike McCandless
> > > > >> >> >>
> > > > >> >> >> http://blog.mikemccandless.com
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > > > >> >> >> <ravikumar.govindara...@gmail.com> wrote:
> > > > >> >> >> > I use a Codec to flush data. All methods delegate to
> actual
> > > > >> >> >> Lucene42Codec,
> > > > >> >> >> > except for intercepting one single-field. This field is
> > > indexed
> > > > as
> > > > >> an
> > > > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > > > >> >> >> >
> > > > >> >> >> > The purpose of the Codec is as follows
> > > > >> >> >> >
> > > > >> >> >> > 1. Note the first BytesRef for this field
> > > > >> >> >> > 2. During finish() call [TermsConsumer.java], note the
> last
> > > > >> BytesRef
> > > > >> >> for
> > > > >> >> >> > this field
> > > > >> >> >> > 3. Converts both the first/last BytesRef to respective
> > > integers
> > > > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > > > >> >> >> >
> > > > >> >> >> > The problem with this approach is that, first/last
> BytesRef
> > is
> > > > >> totally
> > > > >> >> >> > different from the actual "int" values I try to index. I
> > > guess,
> > > > >> this
> > > > >> >> is
> > > > >> >> >> > because Numeric-Trie explodes all the integers into it's
> own
> > > > >> format of
> > > > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> > > > >> >> segment-diagnostics
> > > > >> >> >> >
> > > > >> >> >> > Is there a way I can record actual min/max int-values
> > > correctly
> > > > in
> > > > >> my
> > > > >> >> >> codec
> > > > >> >> >> > and still support NumericRange search?
> > > > >> >> >> >
> > > > >> >> >> > --
> > > > >> >> >> > Ravi
> > > > >> >> >>
> > > > >> >> >>
> > > > ---------------------------------------------------------------------
> > > > >> >> >> To unsubscribe, e-mail:
> > java-user-unsubscr...@lucene.apache.org
> > > > >> >> >> For additional commands, e-mail:
> > > java-user-h...@lucene.apache.org
> > > > >> >> >>
> > > > >> >> >>
> > > > >> >>
> > > > >> >>
> > > ---------------------------------------------------------------------
> > > > >> >> To unsubscribe, e-mail:
> java-user-unsubscr...@lucene.apache.org
> > > > >> >> For additional commands, e-mail:
> > java-user-h...@lucene.apache.org
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >>
> > ---------------------------------------------------------------------
> > > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > > >>
> > > > >>
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > > >
> > >
> >
>

Re: Actual min and max-value of NumericField during codec flush

Reply via email to