Re: Lookup HashMap available within the Map

2008-11-30 Thread Shane Butler
Given the goal of a shared data accessable across the Map instances,
can someone please explain some of the differences between using:
- setNumTasksToExecutePerJvm() and then having statically declared
data initialised in Mapper.configure(); and
- a MultithreadedMapRunner?

Regards,
Shane


On Wed, Nov 26, 2008 at 6:41 AM, Doug Cutting [EMAIL PROTECTED] wrote:
 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM.
  So, yes, you could access a static cache from Mapper.configure().

 Doug




Re: Lookup HashMap available within the Map

2008-11-30 Thread tim robertson
Hi Shane,

I can't explain that, but I can say that with 0.19.0 I am using
setNumTasksToExecutePerJvm(-1) and then initializing statically
declared data in the Map configure successfully now.  It really is
educated guesswork for the tuning parameters though - I am profiling
the app for memory usage locally and then from trial and error
determining how much additional I need for the Node's hadoop framework
actiities, in order to set the -Xmx params and Maps jobs per Nodes for
the different EC2 sizes.  A little dirty perhaps, but I am still
learning 
(http://biodivertido.blogspot.com/2008/11/reproducing-spatial-joins-using-hadoop.html).

I'm interested to know when one would use a MultithreadedMapRunner also.

Cheers

Tim

On Sun, Nov 30, 2008 at 11:22 PM, Shane Butler [EMAIL PROTECTED] wrote:
 Given the goal of a shared data accessable across the Map instances,
 can someone please explain some of the differences between using:
 - setNumTasksToExecutePerJvm() and then having statically declared
 data initialised in Mapper.configure(); and
 - a MultithreadedMapRunner?

 Regards,
 Shane


 On Wed, Nov 26, 2008 at 6:41 AM, Doug Cutting [EMAIL PROTECTED] wrote:
 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM.
  So, yes, you could access a static cache from Mapper.configure().

 Doug





Re: Lookup HashMap available within the Map

2008-11-28 Thread Saptarshi Guha
The more I use it, i realize Hadoop is not  build around shared
memory. For these type of things, use TSpaces (IBM), that way you can
have a flag to load it once and allow for sharing.
Regards
Saptarshi


On Tue, Nov 25, 2008 at 3:42 PM, Chris K Wensel [EMAIL PROTECTED] wrote:
 cool. If you need a hand with Cascading stuff, feel free to ping me on the
 mail list or #cascading irc. lots of other friendly folk there already.

 ckw

 On Nov 25, 2008, at 12:35 PM, tim robertson wrote:

 Thanks Chris,

 I have a different test running, then will implement that.  Might give
 cascading a shot for what I am doing.

 Cheers

 Tim


 On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote:

 Hey Tim

 The .configure() method is what you are looking for i believe.

 It is called once per task, which in the default case, is once per jvm.

 Note Jobs are broken into parallel tasks, each task handles a portion of
 the
 input data. So you may create your map 100 times, because there are 100
 tasks, it will only be created once per jvm.

 I hope this makes sense.

 chris

 On Nov 25, 2008, at 11:46 AM, tim robertson wrote:

 Hi Doug,

 Thanks - it is not so much I want to run in a single JVM - I do want a
 bunch of machines doing the work, it is just I want them all to have
 this in-memory lookup index, that is configured once per job.  Is
 there some hook somewhere that I can trigger a read from the
 distributed cache, or is a Mapper.configure() the best place for this?
 Can it be called multiple times per Job meaning I need to keep some
 static synchronised indicator flag?

 Thanks again,

 Tim


 On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED]
 wrote:

 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single
 JVM.
 So, yes, you could access a static cache from Mapper.configure().

 Doug



 --
 Chris K Wensel
 [EMAIL PROTECTED]
 http://chris.wensel.net/
 http://www.cascading.org/







-- 
Saptarshi Guha - [EMAIL PROTECTED]


Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Hi all,

If I want to have an in memory lookup Hashmap that is available in
my Map class, where is the best place to initialise this please?

I have a shapefile with polygons, and I wish to create the polygon
objects in memory on each node's JVM and have the map able to pull
back the objects by id from some HashMapInteger, Geometry.

Is perhaps the best way to just have a static initialiser that is
synchronised so that it only gets run once and called during the
Map.configure() ?   This feels a little dirty.

Thanks for advice on this,

Tim


Re: Lookup HashMap available within the Map

2008-11-25 Thread Alex Loddengaard
You should use the DistributedCache:

http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/


and


http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache


Hope this helps!

Alex

On Tue, Nov 25, 2008 at 11:09 AM, tim robertson
[EMAIL PROTECTED]wrote:

 Hi all,

 If I want to have an in memory lookup Hashmap that is available in
 my Map class, where is the best place to initialise this please?

 I have a shapefile with polygons, and I wish to create the polygon
 objects in memory on each node's JVM and have the map able to pull
 back the objects by id from some HashMapInteger, Geometry.

 Is perhaps the best way to just have a static initialiser that is
 synchronised so that it only gets run once and called during the
 Map.configure() ?   This feels a little dirty.

 Thanks for advice on this,

 Tim



Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Hi

Thanks Alex - this will allow me to share the shapefile, but I need to
one time only per job per jvm read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this?  E.g. will it
only be called once per job?

Thanks

Tim


On Tue, Nov 25, 2008 at 8:12 PM, Alex Loddengaard [EMAIL PROTECTED] wrote:
 You should use the DistributedCache:
 
 http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/


 and

 
 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache


 Hope this helps!

 Alex

 On Tue, Nov 25, 2008 at 11:09 AM, tim robertson
 [EMAIL PROTECTED]wrote:

 Hi all,

 If I want to have an in memory lookup Hashmap that is available in
 my Map class, where is the best place to initialise this please?

 I have a shapefile with polygons, and I wish to create the polygon
 objects in memory on each node's JVM and have the map able to pull
 back the objects by id from some HashMapInteger, Geometry.

 Is perhaps the best way to just have a static initialiser that is
 synchronised so that it only gets run once and called during the
 Map.configure() ?   This feels a little dirty.

 Thanks for advice on this,

 Tim




Re: Lookup HashMap available within the Map

2008-11-25 Thread Doug Cutting

tim robertson wrote:

Thanks Alex - this will allow me to share the shapefile, but I need to
one time only per job per jvm read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this?  E.g. will it
only be called once per job?


In 0.19, with HADOOP-249, all tasks from a job can be run in a single 
JVM.  So, yes, you could access a static cache from Mapper.configure().


Doug



Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Hi Doug,

Thanks - it is not so much I want to run in a single JVM - I do want a
bunch of machines doing the work, it is just I want them all to have
this in-memory lookup index, that is configured once per job.  Is
there some hook somewhere that I can trigger a read from the
distributed cache, or is a Mapper.configure() the best place for this?
 Can it be called multiple times per Job meaning I need to keep some
static synchronised indicator flag?

Thanks again,

Tim


On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote:
 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM.
  So, yes, you could access a static cache from Mapper.configure().

 Doug




Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Thanks Chris,

I have a different test running, then will implement that.  Might give
cascading a shot for what I am doing.

Cheers

Tim


On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote:
 Hey Tim

 The .configure() method is what you are looking for i believe.

 It is called once per task, which in the default case, is once per jvm.

 Note Jobs are broken into parallel tasks, each task handles a portion of the
 input data. So you may create your map 100 times, because there are 100
 tasks, it will only be created once per jvm.

 I hope this makes sense.

 chris

 On Nov 25, 2008, at 11:46 AM, tim robertson wrote:

 Hi Doug,

 Thanks - it is not so much I want to run in a single JVM - I do want a
 bunch of machines doing the work, it is just I want them all to have
 this in-memory lookup index, that is configured once per job.  Is
 there some hook somewhere that I can trigger a read from the
 distributed cache, or is a Mapper.configure() the best place for this?
 Can it be called multiple times per Job meaning I need to keep some
 static synchronised indicator flag?

 Thanks again,

 Tim


 On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote:

 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single
 JVM.
 So, yes, you could access a static cache from Mapper.configure().

 Doug



 --
 Chris K Wensel
 [EMAIL PROTECTED]
 http://chris.wensel.net/
 http://www.cascading.org/




Re: Lookup HashMap available within the Map

2008-11-25 Thread Chris K Wensel
cool. If you need a hand with Cascading stuff, feel free to ping me on  
the mail list or #cascading irc. lots of other friendly folk there  
already.


ckw

On Nov 25, 2008, at 12:35 PM, tim robertson wrote:


Thanks Chris,

I have a different test running, then will implement that.  Might give
cascading a shot for what I am doing.

Cheers

Tim


On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED]  
wrote:

Hey Tim

The .configure() method is what you are looking for i believe.

It is called once per task, which in the default case, is once per  
jvm.


Note Jobs are broken into parallel tasks, each task handles a  
portion of the
input data. So you may create your map 100 times, because there are  
100

tasks, it will only be created once per jvm.

I hope this makes sense.

chris

On Nov 25, 2008, at 11:46 AM, tim robertson wrote:


Hi Doug,

Thanks - it is not so much I want to run in a single JVM - I do  
want a

bunch of machines doing the work, it is just I want them all to have
this in-memory lookup index, that is configured once per job.  Is
there some hook somewhere that I can trigger a read from the
distributed cache, or is a Mapper.configure() the best place for  
this?

Can it be called multiple times per Job meaning I need to keep some
static synchronised indicator flag?

Thanks again,

Tim


On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED]  
wrote:


tim robertson wrote:


Thanks Alex - this will allow me to share the shapefile, but I  
need to

one time only per job per jvm read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this?  E.g. will it
only be called once per job?


In 0.19, with HADOOP-249, all tasks from a job can be run in a  
single

JVM.
So, yes, you could access a static cache from Mapper.configure().

Doug




--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/