Alan, It might be off-topic question but why you consider "all zero for new data" as a major one? I have two contra samples: - I have click-through for some product, but it was out-of-stock during building index, and because of this it was ignored during loading EFF. After it was supplied into warehouse, we index this product but put default click-through rank despite it's present in the file. but why? - another issue if you have click-through rank not for product id (primary key) but for brand or other field. Problem is the same - you know that D&G is highly clicked products, but apply default rank instead.
I agree with Michael McC that the core problem of wasting CPU&IO is the old dispute about top level data structures vs per segment: FieldCache vs UnInvertedField; DocSet vs CachinWrapperFilter, Solr vs Luvcene&ElasticSearch etc. I hope sooner or later we will have alternative per-segment EFF impl and will be choose the trade off for particular case. On Fri, Nov 16, 2012 at 4:17 PM, Alan Woodward < [email protected]> wrote: > Do you think it's worth promoting to a first-class API? Just a boolean - > isMerged(), or something. > > On 16 Nov 2012, at 12:11, Michael McCandless wrote: > > > We do actually record this, in the segments "diagnostics" field ... > > but that format is something that can suddenly "change" (ie it's not > > an API w/ back compat). > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Fri, Nov 16, 2012 at 7:01 AM, Alan Woodward > > <[email protected]> wrote: > >> Hi all, > >> > >> Is there any way of finding out if a segment is the result of a merge, > or if it's just new data? I can't find anything in SegmentInfo that > records this - if it isn't there, I'll open a JIRA. > >> > >> Here's the use case: I need to reload ExternalFileField data when > segments are merged, as the internal docids will all have changed, > invalidating the EFF caches. However, new segments can just use default > values (the EFF is used to store things like click rates, which are all > zero for new data). At the moment, caches are refreshed after every > commit. But cache reloading is heavy - if we can restrict it to only > reload after a merge, then we save a lot of wasted CPU and IO cycles. > >> > >> Thanks, > >> Alan Woodward > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <[email protected]>
