Re: Merge information in segment files

Mikhail Khludnev Fri, 16 Nov 2012 04:44:06 -0800

Alan,

It might be off-topic question but why you consider "all zero for new data"
as a major one?
I have two contra samples:
- I have click-through for some product, but it was out-of-stock during
building index, and because of this it was ignored during loading EFF.
After it was supplied into warehouse, we index this product but put default
click-through rank despite it's present in the file. but why?
- another issue if you have click-through rank not for product id (primary
key) but for brand or other field. Problem is the same - you know that D&G
is highly clicked products, but apply default rank instead.


I agree with Michael McC that the core problem of wasting CPU&IO is the old
dispute about top level data structures vs per segment: FieldCache vs
UnInvertedField; DocSet vs CachinWrapperFilter, Solr vs
Luvcene&ElasticSearch etc. I hope sooner or later we will have alternative
per-segment EFF impl and will be choose the trade off for particular case.


On Fri, Nov 16, 2012 at 4:17 PM, Alan Woodward <
[email protected]> wrote:

> Do you think it's worth promoting to a first-class API?  Just a boolean -
> isMerged(), or something.
>
> On 16 Nov 2012, at 12:11, Michael McCandless wrote:
>
> > We do actually record this, in the segments "diagnostics" field ...
> > but that format is something that can suddenly "change" (ie it's not
> > an API w/ back compat).
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Fri, Nov 16, 2012 at 7:01 AM, Alan Woodward
> > <[email protected]> wrote:
> >> Hi all,
> >>
> >> Is there any way of finding out if a segment is the result of a merge,
> or if it's just new data?  I can't find anything in SegmentInfo that
> records this - if it isn't there, I'll open a JIRA.
> >>
> >> Here's the use case:  I need to reload ExternalFileField data when
> segments are merged, as the internal docids will all have changed,
> invalidating the EFF caches.  However, new segments can just use default
> values (the EFF is used to store things like click rates, which are all
> zero for new data).  At the moment, caches are refreshed after every
> commit.  But cache reloading is heavy - if we can restrict it to only
> reload after a merge, then we save a lot of wasted CPU and IO cycles.
> >>
> >> Thanks,
> >> Alan Woodward
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <[email protected]>

Re: Merge information in segment files

Reply via email to