Re: Lookup HashMap available within the Map
Given the goal of a shared data accessable across the Map instances, can someone please explain some of the differences between using: - setNumTasksToExecutePerJvm() and then having statically declared data initialised in Mapper.configure(); and - a MultithreadedMapRunner? Regards, Shane On Wed, Nov 26, 2008 at 6:41 AM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug
Re: Lookup HashMap available within the Map
Hi Shane, I can't explain that, but I can say that with 0.19.0 I am using setNumTasksToExecutePerJvm(-1) and then initializing statically declared data in the Map configure successfully now. It really is educated guesswork for the tuning parameters though - I am profiling the app for memory usage locally and then from trial and error determining how much additional I need for the Node's hadoop framework actiities, in order to set the -Xmx params and Maps jobs per Nodes for the different EC2 sizes. A little dirty perhaps, but I am still learning (http://biodivertido.blogspot.com/2008/11/reproducing-spatial-joins-using-hadoop.html). I'm interested to know when one would use a MultithreadedMapRunner also. Cheers Tim On Sun, Nov 30, 2008 at 11:22 PM, Shane Butler [EMAIL PROTECTED] wrote: Given the goal of a shared data accessable across the Map instances, can someone please explain some of the differences between using: - setNumTasksToExecutePerJvm() and then having statically declared data initialised in Mapper.configure(); and - a MultithreadedMapRunner? Regards, Shane On Wed, Nov 26, 2008 at 6:41 AM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug
Re: Lookup HashMap available within the Map
The more I use it, i realize Hadoop is not build around shared memory. For these type of things, use TSpaces (IBM), that way you can have a flag to load it once and allow for sharing. Regards Saptarshi On Tue, Nov 25, 2008 at 3:42 PM, Chris K Wensel [EMAIL PROTECTED] wrote: cool. If you need a hand with Cascading stuff, feel free to ping me on the mail list or #cascading irc. lots of other friendly folk there already. ckw On Nov 25, 2008, at 12:35 PM, tim robertson wrote: Thanks Chris, I have a different test running, then will implement that. Might give cascading a shot for what I am doing. Cheers Tim On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote: Hey Tim The .configure() method is what you are looking for i believe. It is called once per task, which in the default case, is once per jvm. Note Jobs are broken into parallel tasks, each task handles a portion of the input data. So you may create your map 100 times, because there are 100 tasks, it will only be created once per jvm. I hope this makes sense. chris On Nov 25, 2008, at 11:46 AM, tim robertson wrote: Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/ -- Saptarshi Guha - [EMAIL PROTECTED]
Lookup HashMap available within the Map
Hi all, If I want to have an in memory lookup Hashmap that is available in my Map class, where is the best place to initialise this please? I have a shapefile with polygons, and I wish to create the polygon objects in memory on each node's JVM and have the map able to pull back the objects by id from some HashMapInteger, Geometry. Is perhaps the best way to just have a static initialiser that is synchronised so that it only gets run once and called during the Map.configure() ? This feels a little dirty. Thanks for advice on this, Tim
Re: Lookup HashMap available within the Map
You should use the DistributedCache: http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/ and http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache Hope this helps! Alex On Tue, Nov 25, 2008 at 11:09 AM, tim robertson [EMAIL PROTECTED]wrote: Hi all, If I want to have an in memory lookup Hashmap that is available in my Map class, where is the best place to initialise this please? I have a shapefile with polygons, and I wish to create the polygon objects in memory on each node's JVM and have the map able to pull back the objects by id from some HashMapInteger, Geometry. Is perhaps the best way to just have a static initialiser that is synchronised so that it only gets run once and called during the Map.configure() ? This feels a little dirty. Thanks for advice on this, Tim
Re: Lookup HashMap available within the Map
Hi Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? Thanks Tim On Tue, Nov 25, 2008 at 8:12 PM, Alex Loddengaard [EMAIL PROTECTED] wrote: You should use the DistributedCache: http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/ and http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache Hope this helps! Alex On Tue, Nov 25, 2008 at 11:09 AM, tim robertson [EMAIL PROTECTED]wrote: Hi all, If I want to have an in memory lookup Hashmap that is available in my Map class, where is the best place to initialise this please? I have a shapefile with polygons, and I wish to create the polygon objects in memory on each node's JVM and have the map able to pull back the objects by id from some HashMapInteger, Geometry. Is perhaps the best way to just have a static initialiser that is synchronised so that it only gets run once and called during the Map.configure() ? This feels a little dirty. Thanks for advice on this, Tim
Re: Lookup HashMap available within the Map
tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug
Re: Lookup HashMap available within the Map
Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug
Re: Lookup HashMap available within the Map
Thanks Chris, I have a different test running, then will implement that. Might give cascading a shot for what I am doing. Cheers Tim On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote: Hey Tim The .configure() method is what you are looking for i believe. It is called once per task, which in the default case, is once per jvm. Note Jobs are broken into parallel tasks, each task handles a portion of the input data. So you may create your map 100 times, because there are 100 tasks, it will only be created once per jvm. I hope this makes sense. chris On Nov 25, 2008, at 11:46 AM, tim robertson wrote: Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Lookup HashMap available within the Map
cool. If you need a hand with Cascading stuff, feel free to ping me on the mail list or #cascading irc. lots of other friendly folk there already. ckw On Nov 25, 2008, at 12:35 PM, tim robertson wrote: Thanks Chris, I have a different test running, then will implement that. Might give cascading a shot for what I am doing. Cheers Tim On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote: Hey Tim The .configure() method is what you are looking for i believe. It is called once per task, which in the default case, is once per jvm. Note Jobs are broken into parallel tasks, each task handles a portion of the input data. So you may create your map 100 times, because there are 100 tasks, it will only be created once per jvm. I hope this makes sense. chris On Nov 25, 2008, at 11:46 AM, tim robertson wrote: Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/