On Mon, Aug 4, 2014 at 3:52 PM, Garrett Barton <[email protected]> wrote:
> So been messing with the MR style api a little more and I don't really like > it. The difference of having multiple things running in the JVM vs > independent turns out to be a reasonable enough difference that introduces > a whole lot of 'well we could do...' talk. > I have been playing with the a MR api and I agree that I do not like it. Plus it would likely promote misuse. > > So instead how about a middle ground. I still like the concept of a > BlurContext as it gives us an entry point to bail out of code using the > existing AtomicBoolean jazz with the progress() method. We could provide a > few types of BlurContexts, one that hard timed out after a set time, one > that had a time limit per BlurIndex, and another that had a timeout for > inactivity (think rt write Commands with no data coming in). We also get > the counters, always nice, and I'd like to see us move to parameter > retrieval like MR does with the context.getConf().getxxx() style vs the > Object[]. Unless someone has a really good reason as to wanting to keep > Object arrays?? > I don't like a config get model to pass argument because this is suppose to a be low latency call. I think that we could do a simple map of arguments if that is needed. I'm trying to come up with a higher level api that would allow for more complex calls, but would also handle the process, merge, merge model. I will try and post to the wiki at some point today. Aaron > Thoughts? > > I can update the wiki to give a cleaner example if anyone thinks thats a > good idea? > > ~Garrett > > > On Fri, Aug 1, 2014 at 7:24 PM, Garrett Barton <[email protected]> > wrote: > > > How about this? > > > > public abstract class Command<T1, T2> implements Serializable { > > > > public abstract void mergeFinal(Iterable<T2> results, BlurContext<T2> > > context) throws IOException; > > public abstract void mergeLocal(Iterable<T1> results, BlurContext<T2> > > context) throws IOException; > > public abstract void processIndex(BlurIndex blurIndex, BlurContext<T1> > > context) throws IOException; > > > > } > > > > Where BlurContext<T> looks like: > > > > public class BlurCommand<T1> implements Serializable { > > > > public void write(T1 object) throws IOException; > > public void progress(); > > public void incCounter(String counter); > > public void setCounter(String counter, long num); > > > > public Object[] getArgs(); > > public void setArgs(Object[] args); > > } > > > > > > Probably looks really familiar.. :) > > > > By providing the Iterable interface our implementation behind the scenes > > could be running through each call to proccessIndex, that way we don't > have > > to realize the full List<T1> like the current implementation does. Its a > > step in the right direction, now real memory usage is contained within > the > > Command as opposed to message passing. Its not total streaming but we > have > > removed one complete copy of intermediate results from ram. > > > > I also like the BlurContext idea more and more, we might not know all > the > > things we want to expose as hooks (blockcache, tmp disk access, > > blurConfig??) up front but this gives us an api compatible way to extend > > that without junking the core interface. > > > > The one last thing was while talking with Aaron he mentioned maybe > > separating what the shardserver does from the controller server. And > this > > is because it might give us more freedom to intergrate with other bulk > > processing/streaming engines which ideally will hit the shards directly > and > > not pull data back via the controllers. > > I'm not sure how that would look yet, its hard to get out of the > > mindset that shards and controllers look the same api wise. > > > > Anyways, hopefully this will spawn more ideas! > > > > > > > > > > On Thu, Jul 31, 2014 at 1:30 PM, Tim Williams <[email protected]> > > wrote: > > > >> On Thu, Jul 31, 2014 at 12:55 PM, Aaron McCurry <[email protected]> > >> wrote: > >> > We could do that, however we likely would need a way to have the > >> > implementation create a initial return object so that a merge could be > >> > incremental. > >> > > >> > For example: > >> > > >> > Let's say that we are aggregating counts and we have a custom Counts > >> object > >> > and we are going to merge each Result as it finishes. > >> > > >> > public Counts merge(Counts existing, Result result) { > >> > Counts mergedCounts= new Counts(); > >> > // Do some counting and merging of existing Counts. > >> > return mergedCounts; > >> > } > >> > > >> > So we could do one of three things. We could allow existing to be > null > >> if > >> > it's the first merge call or we could have a second method that > doesn't > >> > take an existing argument. > >> > > >> > public Counts inital(Result result) { > >> > ... > >> > } > >> > > >> > The last option I see is to use vargs like: > >> > > >> > public Counts merge(Result result, Counts... existing) { > >> > Counts mergedCounts= new Counts(); > >> > // Do some counting and merging of existing Counts. > >> > return mergedCounts; > >> > } > >> > > >> > This is at least a little cleaner in that it's implied that existing > >> could > >> > be absent or null as well as allowing multiple items to be merged are > >> the > >> > same time. > >> > > >> > What do you think? > >> > >> Yeah, they feel kinda awkward... what about having the command hold > >> it's state internally, then merge(Result r) is asking to merge r onto > >> itself? > >> > >> Thanks, > >> --tim > >> > > > > >
