Re: Actual min and max-value of NumericField during codec flush

Ravikumar Govindarajan Wed, 12 Feb 2014 03:05:54 -0800

Mike,

All our queries need to be sorted by timestamp field, in descending order
of time. [latest-first]


Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
segments and merges them [even with SortingMergePolicy etc...]. I am trying
to avoid this and see if an approximate global ordering of segments [by
time-stamp field] can be maintained via merge.

Ex: TopN results will only examine recent 2-3 smaller segments [best-case]
and return, without examining older and bigger segments.

I do not know the terminology, may be "Early Query Termination Across
Segments" etc...?

--
Ravi


On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
[email protected]> wrote:

> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
> order.
>
> Only TieredMergePolicy merges out-of-order segments.
>
> I don't understand why you need to encouraging merging of the more
> recent (by your "time" field) segments...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> <[email protected]> wrote:
> > Mike,
> >
> > Each of my flushed segment is fully ordered by time. But
> TieredMergePolicy
> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
> > disturb this arrangement and I wanted some kind of control on this.
> >
> > But like you pointed-out, going by only be time-adjacent merges can be
> > disastrous.
> >
> > Is there a way to mix both time and size to arrive at a somewhat
> > [less-than-accurate] global order of segment merges.
> >
> > Like attempt a time-adjacent merge, provided size of segments is not
> > extremely skewed etc...
> >
> > --
> > Ravi
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > [email protected]> wrote:
> >
> >> You want to focus merging on the segments containing newer documents?
> >> Why?  This seems somewhat dangerous...
> >>
> >> Not taking into account the "true" segment size can lead to very very
> >> poor merge decisions ... you should turn on IndexWriter's infoStream
> >> and do a long running test to convince yourself the merging is being
> >> sane.
> >>
> >> Mike
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> >> <[email protected]> wrote:
> >> > Thanks Mike,
> >> >
> >> > Will try your suggestion. I will try to describe the actual use-case
> >> itself
> >> >
> >> > There is a requirement for merging time-adjacent segments
> [append-only,
> >> > rolling time-series data]
> >> >
> >> > All Documents have a timestamp affixed and during flush I need to note
> >> down
> >> > the least timestamp for all documents, through Codec.
> >> >
> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> >> >
> >> > LogMergePolicy will auto-arrange levels of segments according time and
> >> > proceed with merges. Latest segments will be lesser in size and
> preferred
> >> > during merges than older and bigger segments
> >> >
> >> > Do you think such an approach will be fine or there are better ways to
> >> > solve this?
> >> >
> >> > --
> >> > Ravi
> >> >
> >> >
> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> >> > [email protected]> wrote:
> >> >
> >> >> Somewhere in those numeric trie terms are the exact integers from
> your
> >> >> documents, encoded.
> >> >>
> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
> >> >> value back from the BytesRef term.
> >> >>
> >> >> But you need to filter out the "higher level" terms, e.g. using
> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
> >> >> all the terms you want come first, so once you hit a term where
> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can
> stop
> >> >> checking.
> >> >>
> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
> >> >> you can e.g. pull your own TermsEnum and iterate the terms yourself.
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >>
> >> >>
> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> >> >> <[email protected]> wrote:
> >> >> > I use a Codec to flush data. All methods delegate to actual
> >> >> Lucene42Codec,
> >> >> > except for intercepting one single-field. This field is indexed as
> an
> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> >> >> >
> >> >> > The purpose of the Codec is as follows
> >> >> >
> >> >> > 1. Note the first BytesRef for this field
> >> >> > 2. During finish() call [TermsConsumer.java], note the last
> BytesRef
> >> for
> >> >> > this field
> >> >> > 3. Converts both the first/last BytesRef to respective integers
> >> >> > 4. Store these 2 ints in segment-info diagnostics
> >> >> >
> >> >> > The problem with this approach is that, first/last BytesRef is
> totally
> >> >> > different from the actual "int" values I try to index. I guess,
> this
> >> is
> >> >> > because Numeric-Trie explodes all the integers into it's own
> format of
> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> >> segment-diagnostics
> >> >> >
> >> >> > Is there a way I can record actual min/max int-values correctly in
> my
> >> >> codec
> >> >> > and still support NumericRange search?
> >> >> >
> >> >> > --
> >> >> > Ravi
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [email protected]
> >> >> For additional commands, e-mail: [email protected]
> >> >>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Actual min and max-value of NumericField during codec flush

Reply via email to