pig-user  

RE: MapReduceLauncher static fields

Michael Harris
Fri, 04 Apr 2008 12:09:24 -0700

Ben,

Thanks for getting back to me. Ideally the stats would be attached to a
dump/store command. I tried to hack together a solution by making those
fields non-static, making MapReduceLauncher serializable, adding method
to MapReduceLauncher instances to getProgress, and modifying update
points to use particular instances of MapReduceLauncher rather than the
static calls it was doing before. Then I modified pig server to have a
method getProgress(String alias) :

        public double getProgress(String id) {
                ExecutionEngine ee = pigContext.getExecutionEngine();
                if (ee instanceof HExecutionEngine) {
                        HExecutionEngine he = (HExecutionEngine) ee;
                        LogicalPlan lp = aliases.get(id);
                        POMapreduce mapRed = (POMapreduce)
he.getPhysicalOpTable().get(
        
he.getPhysicalKey(lp.getRoot()));
                        return
mapRed.getMapReduceLauncher().getProgress();
                }
                return -1;
        }

I have only spent a few hours with the Pig code so im not sure this is
even correct, but it seems to work really well except the case when a
set of queries uses a set of jobs to complete: the results are totally
inaccurate until it gets to the final job. Its not really a big deal its
just an internal tool, my users can live with no status updates, but I
thought it would be a nice touch. I have looked at the roadmap for Pig
and see that querying for progress is on there, I just wanted to make
sure you guys think of my scenario (thread-safe, end user facing
application) when you add it :)

-Michael

-----Original Message-----
From: Benjamin Reed [EMAIL PROTECTED] 
Sent: Friday, April 04, 2008 11:29 AM
To: pig-user@incubator.apache.org
Subject: Re: MapReduceLauncher static fields

The statistics are not updated in a thread safe way. They are global
statistics, so they will be across jobs, and since they aren't thread
safe they may be wrong. Other than the numbers I think that the rest
should be thread safe assuming that the underlying Hadoop code is thread
safe, which it looks to be.

I would think for your application the stats should really be attached
to an object that represents the store or dump method object right? (Or
at least accessible through that object.)

ben

Michael Harris wrote:
> Hello,
>
>  
>
> I have written a pig application that does a fixed set of queries
> on-demand through a web interface. I am trying to get the progress of
> the queries from the PigServer, but I have noticed that the source of
> the progress data is all static fields in the MapReduceLauncher.
Clearly
> my webapp must be able to handle multiple concurrent pig queries (and
be
> thread-safe) and I would like to report the progress of each
individual
> query (job set) to the end user.  Do these static fields indicate that
I
> would get the progress of multiple concurrent queries initiated by
> different PigServer instances? or would I get the overall progress of
> the MapReduceLauncher for all queries currently being executed?
>
>  
>
> Thanks,
> Michael
>
>
>