FW: NullPointer during startup debugging DN

2012-02-29 Thread Evert Lammerts
Cross-posting with common-user, since there's only little activity on hdfs-user 
these last days.

Evert


 Hi list,

 I'm having trouble starting up a DN (0.20.2) with Kerberos
 authentication and SSL enabled - I'm getting a NullPointerException
 during startup and the daemon exists. It's a bit hard to debug this
 problem, no idea how I'd do this from within Eclipse for example. Can I
 do this with jdb?

 Some background (also see relevant snippets from debug output, hdfs-
 site, core-site and ssl-server attached):

 12/02/21 11:24:11 DEBUG security.Krb5AndCertsSslSocketConnector:
 useKerb = false, useCerts = true jetty.ssl.password :
 jetty.ssl.keypassword : 12/02/21 11:24:11 INFO mortbay.log: jetty-
 6.1.26.cloudera.1
 12/02/21 11:24:11 INFO mortbay.log: Started SelectChannelConnector@p-
 worker02.alley.sara.nl:1006
 12/02/21 11:24:11 DEBUG security.Krb5AndCertsSslSocketConnector:
 Creating new KrbServerSocket for: p-worker02.alley.sara.nl
 12/02/21 11:24:11 WARN mortbay.log: java.lang.NullPointerException
 12/02/21 11:24:11 WARN mortbay.log: failed
 krb5andcertssslsocketconnec...@p-worker02.alley.sara.nl:50475:
 java.io.IOException: !JsseListener: java.lang.NullPointerException

 I'm a bit surprised that useKrb is set to false in
 Krb5AndCertsSslSocketConnector, but looking at
 org.apache.hadoop.hdfs.server.datanode.DataNode is see that it calls
 this.infoServer.addSslListener(secInfoSocAddr, sslConf,
 needClientAuth), which sets needKrbAuth to false. I guess this is on
 purpose then, and the values in core-site are just ignored here.

 The NullPointer seems to occur at
 org.apache.hadoop.security.Krb5AndCertsSslSocketConnector, in
 newServerSocket(). useCerts is true, and I see a call
 (SSLServerSocket)super.newServerSocket(host, port, backlog). I think
 things might go wrong here.

 This is probably due to some missing configuration. I have not set
 dfs.https.need.client.auth, and it defaults to false so I have not
 included a ssl-client.xml configuration file or key- and truststores
 for clients. I wouldn't mind doing that, but I'm not sure why I need a
 keystore for clients -  I guess the framework checks for DN-to-user
 mappings, it shouldn't need user keys.

 Any help is much appreciated!

 Evert



Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Michel Segel
Let's play devil's advocate for a second?

Why? Snappy exists.
The only advantage is that you don't have to convert from gzip to snappy and 
can process gzip files natively.

Next question is how large are the gzip files in the first place?

I don't disagree, I just want to have a solid argument in favor of it...




Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 28, 2012, at 9:50 AM, Niels Basjes ni...@basjes.nl wrote:

 Hi,
 
 Some time ago I had an idea and implemented it.
 
 Normally you can only run a single gzipped input file through a single
 mapper and thus only on a single CPU core.
 What I created makes it possible to process a Gzipped file in such a way
 that it can run on several mappers in parallel.
 
 I've put the javadoc I created on my homepage so you can read more about
 the details.
 http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec
 
 Now the question that was raised by one of the people reviewing this code
 was: Should this implementation be part of the core Hadoop feature set?
 The main reason that was given is that this needs a bit more understanding
 on what is happening and as such cannot be enabled by default.
 
 I would like to hear from the Hadoop Core/Map reduce users what you think.
 
 Should this be
 - a part of the default Hadoop feature set so that anyone can simply enable
 it by setting the right configuration?
 - a separate library?
 - a nice idea I had fun building but that no one needs?
 - ... ?
 
 -- 
 Best regards / Met vriendelijke groeten,
 
 Niels Basjes


Hadoop fair scheduler doubt: allocate jobs to pool

2012-02-29 Thread Austin Chungath
How can I set the fair scheduler such that all jobs submitted from a
particular user group go to a pool with the group name?

I have setup fair scheduler and I have two users: A and B (belonging to the
user group hadoop)

When these users submit hadoop jobs, the jobs from A got to a pool named A
and the jobs from B go to a pool named B.
 I want them to go to a pool with their group name, So I tried adding the
following to mapred-site.xml:

property
 namemapred.fairscheduler.poolnameproperty/name
valuegroup.name/value
/property

But instead the jobs now go to the default pool.
I want the jobs submitted by A and B to go to the pool named hadoop. How
do I do that?
also how can I explicity set a job to any specified pool?

I have set the allocation file (fair-scheduler.xml) like this:

allocations
  pool name=hadoop
minMaps1/minMaps
minReduces1/minReduces
maxMaps3/maxMaps
maxReduces3/maxReduces
  /pool
  userMaxJobsDefault5/userMaxJobsDefault
/allocations

Any help is greatly appreciated.
Thanks,
Austin


TaskTracker without datanode

2012-02-29 Thread Daniel Baptista
Hi All,

I was wondering (network traffic considerations aside) is it possible to run a 
TaskTracker without a DataNode. I was hoping to test this method as a means of 
scaling processing power temporarily.

Are there better approaches, I don't (currently) need the additional storage 
that a DataNode provides and I would like to add additional processing power 
from time-to-timewithout worrying about data loss and decommissioning DataNodes.

Thanks, Dan.


RE: TaskTracker without datanode

2012-02-29 Thread Daniel Baptista
Forgot to mention that I am using Hadoop 0.20.2

From: Daniel Baptista
Sent: 29 February 2012 14:44
To: common-user@hadoop.apache.org
Subject: TaskTracker without datanode

Hi All,

I was wondering (network traffic considerations aside) is it possible to run a 
TaskTracker without a DataNode. I was hoping to test this method as a means of 
scaling processing power temporarily.

Are there better approaches, I don't (currently) need the additional storage 
that a DataNode provides and I would like to add additional processing power 
from time-to-timewithout worrying about data loss and decommissioning DataNodes.

Thanks, Dan.


Re: TaskTracker without datanode

2012-02-29 Thread Harsh J
Yes this is fine to do. TTs are not dependent on co-located DNs, but
only benefit if they are.

On Wed, Feb 29, 2012 at 8:14 PM, Daniel Baptista
daniel.bapti...@performgroup.com wrote:
 Forgot to mention that I am using Hadoop 0.20.2

 From: Daniel Baptista
 Sent: 29 February 2012 14:44
 To: common-user@hadoop.apache.org
 Subject: TaskTracker without datanode

 Hi All,

 I was wondering (network traffic considerations aside) is it possible to run 
 a TaskTracker without a DataNode. I was hoping to test this method as a means 
 of scaling processing power temporarily.

 Are there better approaches, I don't (currently) need the additional storage 
 that a DataNode provides and I would like to add additional processing power 
 from time-to-timewithout worrying about data loss and decommissioning 
 DataNodes.

 Thanks, Dan.



-- 
Harsh J


Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Edward Capriolo
Mike,

Snappy is cool and all, but I was not overly impressed with it.

GZ zipps much better then Snappy. Last time I checked for our log file
gzip took them down from 100MB- 40MB, while snappy compressed them
from 100MB-55MB. That was only with sequence files. But still that is
pretty significant if you are considering long term storage. Also
being that the delta in the file size was large I could not actually
make the agree that using sequence+snappy was faster then sequence+gz.
Sure the MB/s rate was probably faster but since I had more MB I was
not able to prove snappy a win. I use it for intermediate compression
only.

Actually the raw formats (gz vs sequence gz) are significantly smaller
and faster then their sequence file counterparts.

Believe it or not, I commonly use mapred.compress.output without
sequence files. As long as I have a larger number of reducers I do not
have to worry about files being splittable because N mappers process N
files. Generally I am happpy with say N mappers because the input
formats tend to create more mappers then I want which makes more
overhead and more shuffle.

But being able to generate split info for them and processing them
would be good as well. I remember that was a hot thing to do with lzo
back in the day. The pain of once overing the gz files to generate the
split info is detracting but it is nice to know it is there if you
want it.

Edward
On Wed, Feb 29, 2012 at 7:10 AM, Michel Segel michael_se...@hotmail.com wrote:
 Let's play devil's advocate for a second?

 Why? Snappy exists.
 The only advantage is that you don't have to convert from gzip to snappy and 
 can process gzip files natively.

 Next question is how large are the gzip files in the first place?

 I don't disagree, I just want to have a solid argument in favor of it...




 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Feb 28, 2012, at 9:50 AM, Niels Basjes ni...@basjes.nl wrote:

 Hi,

 Some time ago I had an idea and implemented it.

 Normally you can only run a single gzipped input file through a single
 mapper and thus only on a single CPU core.
 What I created makes it possible to process a Gzipped file in such a way
 that it can run on several mappers in parallel.

 I've put the javadoc I created on my homepage so you can read more about
 the details.
 http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec

 Now the question that was raised by one of the people reviewing this code
 was: Should this implementation be part of the core Hadoop feature set?
 The main reason that was given is that this needs a bit more understanding
 on what is happening and as such cannot be enabled by default.

 I would like to hear from the Hadoop Core/Map reduce users what you think.

 Should this be
 - a part of the default Hadoop feature set so that anyone can simply enable
 it by setting the right configuration?
 - a separate library?
 - a nice idea I had fun building but that no one needs?
 - ... ?

 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes


Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Niels Basjes
Hi,

On Wed, Feb 29, 2012 at 13:10, Michel Segel michael_se...@hotmail.comwrote:

 Let's play devil's advocate for a second?


I always like that :)


 Why?


Because then datafiles from other systems (like the Apache HTTP webserver)
can be processed without preprocessing more efficiently.

Snappy exists.


Compared to gzip: Snappy is faster, compresses a bit less and is
unfortunately not splittable.

The only advantage is that you don't have to convert from gzip to snappy
 and can process gzip files natively.


Yes, that and the fact that the files are smaller.
Note that I've described some of these considerations in the javadoc.

Next question is how large are the gzip files in the first place?


I work for the biggest webshop in the Netherlands and I'm facing a set of
logfiles that are very often  1 GB each and are gzipped.
The first thing we do with then is parse and disect each line in the very
first mapper. Then we store the result in (snappy compressed) avro files.

I don't disagree, I just want to have a solid argument in favor of it...


:)

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Niels Basjes
Hi,

On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote:
...

 But being able to generate split info for them and processing them
 would be good as well. I remember that was a hot thing to do with lzo
 back in the day. The pain of once overing the gz files to generate the
 split info is detracting but it is nice to know it is there if you
 want it.


Note that the solution I created (HADOOP-7076) does not require any
preprocessing.
It can split ANY gzipped file as-is.
The downside is that this effectively costs some additional performance
because the task has to decompress the first part of the file that is to be
discarded.

The other two ways of splitting gzipped files either require
- creating come kind of compression index before actually using the file
(HADOOP-6153)
- creating a file in a format that is gerenated in such a way that it is
really a set of concatenated gzipped files. (HADOOP-7909)

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Robert Evans
I can see a use for it, but I have two concerns about it.  My biggest concern 
is maintainability.  We have had lots of things get thrown into contrib in the 
past, very few people use them, and inevitably they start to suffer from bit 
rot.  I am not saying that it will happen with this, but if you have to ask if 
people will use it and there has been no overwhelming yes, it makes me nervous 
about it.  My second concern is with knowing when to use this.  Anything that 
adds this in would have to come with plenty of documentation about how it 
works, how it is different from the normal gzip format, explanations about what 
type of a load it might put on data nodes that hold the start of the file, etc.

From both of these I would prefer to see this as a github project for a while 
first, and one it shows that it has a significant following, or a community 
with it, then we can pull it in.  But if others disagree I am not going to 
block it.  I am a -0 on pulling this in now.

--Bobby

On 2/29/12 10:00 AM, Niels Basjes ni...@basjes.nl wrote:

Hi,

On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote:
...

 But being able to generate split info for them and processing them
 would be good as well. I remember that was a hot thing to do with lzo
 back in the day. The pain of once overing the gz files to generate the
 split info is detracting but it is nice to know it is there if you
 want it.


Note that the solution I created (HADOOP-7076) does not require any
preprocessing.
It can split ANY gzipped file as-is.
The downside is that this effectively costs some additional performance
because the task has to decompress the first part of the file that is to be
discarded.

The other two ways of splitting gzipped files either require
- creating come kind of compression index before actually using the file
(HADOOP-6153)
- creating a file in a format that is gerenated in such a way that it is
really a set of concatenated gzipped files. (HADOOP-7909)

--
Best regards / Met vriendelijke groeten,

Niels Basjes



Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
I am going to try few things today. I have a JAXBContext object that
marshals the xml, this is static instance but my guess at this point is
that since this is in separate jar then the one where job runs and I used
DistributeCache.addClassPath this context is being created on every call
for some reason. I don't know why that would be. I am going to create this
instance as static in the mapper class itself and see if that helps. I also
add debugs. Will post the results after try it out.

On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 It would be great if we can take a look at what you are doing in the UDF vs
 the Mapper.

 100x slow does not make sense for the same job/logic, its either the Mapper
 code or may be the cluster was busy at the time you scheduled MapReduce
 job?

 Thanks,
 Prashant

 On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I am comparing runtime of similar logic. The entire logic is exactly same
  but surprisingly map reduce job that I submit is 100x slow. For pig I use
  udf and for hadoop I use mapper only and the logic same as pig. Even the
  splits on the admin page are same. Not sure why it's so slow. I am
  submitting job like:
 
  java -classpath
 
 
 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
  com.services.dp.analytics.hadoop.mapred.FormMLProcessor
 
 
 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
  /examples/output1/
 
  How should I go about looking the root cause of why it's so slow? Any
  suggestions would be really appreciated.
 
 
 
  One of the things I noticed is that on the admin page of map task list I
  see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728
 but
  for pig the status is blank.
 



Re: Browse the filesystem weblink broken after upgrade to 1.0.0: HTTP 404 Problem accessing /browseDirectory.jsp

2012-02-29 Thread W.P. McNeill
I can do perform HDFS operations from the command line like hadoop fs -ls
/. Doesn't that meant that the datanode is up?


Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Robert Evans
If many people are going to use it then by all means put it in.  If there is 
only one person, or a very small handful of people that are going to use it 
then I personally would prefer to see it a separate project.  However, Edward, 
you have convinced me that I am trying to make a logical judgment based only on 
a gut feeling and the response rate to an email chain.  Thanks for that.  What 
I really want to know is how well does this new CompressionCodec perform in 
comparison to the regular gzip codec in various different conditions and what 
type of impact does it have on network traffic and datanode load.  My gut 
feeling is that the speedup is going to be relatively small except when there 
is a lot of computation happening in the mapper and the added load and network 
traffic outweighs the speedup in most cases, but like all performance on a 
complex system gut feelings are almost worthless and hard numbers are what is 
needed to make a judgment call.  Niels, I assume you have tested this on your 
cluster(s).  Can you share with us some of the numbers?

--Bobby Evans

On 2/29/12 11:06 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

Too bad we can not up the replication on the first few blocks of the
file or distributed cache it.

The crontrib statement is arguable. I could make a case that the
majority of stuff should not be in hadoop-core. NLineInputFormat for
example, nice to have. Took a long time to get ported to the new map
reduce format. DBInputFormat DataDriverDBInputFormat sexy for sure but
does not need to be part of core. I could see hadoop as just coming
with TextInputFormat and SequenceInputFormat and everything else is
after market from github,

On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans ev...@yahoo-inc.com wrote:
 I can see a use for it, but I have two concerns about it.  My biggest concern 
 is maintainability.  We have had lots of things get thrown into contrib in 
 the past, very few people use them, and inevitably they start to suffer from 
 bit rot.  I am not saying that it will happen with this, but if you have to 
 ask if people will use it and there has been no overwhelming yes, it makes me 
 nervous about it.  My second concern is with knowing when to use this.  
 Anything that adds this in would have to come with plenty of documentation 
 about how it works, how it is different from the normal gzip format, 
 explanations about what type of a load it might put on data nodes that hold 
 the start of the file, etc.

 From both of these I would prefer to see this as a github project for a while 
 first, and one it shows that it has a significant following, or a community 
 with it, then we can pull it in.  But if others disagree I am not going to 
 block it.  I am a -0 on pulling this in now.

 --Bobby

 On 2/29/12 10:00 AM, Niels Basjes ni...@basjes.nl wrote:

 Hi,

 On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote:
 ...

 But being able to generate split info for them and processing them
 would be good as well. I remember that was a hot thing to do with lzo
 back in the day. The pain of once overing the gz files to generate the
 split info is detracting but it is nice to know it is there if you
 want it.


 Note that the solution I created (HADOOP-7076) does not require any
 preprocessing.
 It can split ANY gzipped file as-is.
 The downside is that this effectively costs some additional performance
 because the task has to decompress the first part of the file that is to be
 discarded.

 The other two ways of splitting gzipped files either require
 - creating come kind of compression index before actually using the file
 (HADOOP-6153)
 - creating a file in a format that is gerenated in such a way that it is
 really a set of concatenated gzipped files. (HADOOP-7909)

 --
 Best regards / Met vriendelijke groeten,

 Niels Basjes




Streaming Hadoop using C

2012-02-29 Thread Mark question
Hi guys, thought I should ask this before I use it ... will using C over
Hadoop give me the usual C memory management? For example, malloc() ,
sizeof() ? My guess is no since this all will eventually be turned into
bytecode, but I need more control on memory which obviously is hard for me
to do with Java.

Let me know of any advantages you know about streaming in C over hadoop.
Thank you,
Mark


Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
Mark,
Both streaming and pipes allow this, perhaps more so pipes at the level of the 
mapreduce task. Can you provide more details on the application?
On Feb 29, 2012, at 1:56 PM, Mark question wrote:

 Hi guys, thought I should ask this before I use it ... will using C over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned into
 bytecode, but I need more control on memory which obviously is hard for me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over hadoop.
 Thank you,
 Mark



Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
Thanks Charles .. I'm running Hadoop for research to perform duplicate
detection methods. To go deeper, I need to understand what's slowing my
program, which usually starts with analyzing memory to predict best input
size for map task. So you're saying piping can help me control memory even
though it's running on VM eventually?

Thanks,
Mark

On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote:

 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:

  Hi guys, thought I should ask this before I use it ... will using C over
  Hadoop give me the usual C memory management? For example, malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned into
  bytecode, but I need more control on memory which obviously is hard for
 me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over hadoop.
  Thank you,
  Mark




Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
Mark,
So if I understand, it is more the memory management that you are interested 
in, rather than a need to run an existing C or C++ application in MapReduce 
platform?
Have you done profiling of the application?
C
On Feb 29, 2012, at 2:19 PM, Mark question wrote:

 Thanks Charles .. I'm running Hadoop for research to perform duplicate
 detection methods. To go deeper, I need to understand what's slowing my
 program, which usually starts with analyzing memory to predict best input
 size for map task. So you're saying piping can help me control memory even
 though it's running on VM eventually?
 
 Thanks,
 Mark
 
 On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote:
 
 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
 Hi guys, thought I should ask this before I use it ... will using C over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned into
 bytecode, but I need more control on memory which obviously is hard for
 me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over hadoop.
 Thank you,
 Mark
 
 



Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
I've used hadoop profiling (.prof) to show the stack trace but it was hard
to follow. jConsole locally since I couldn't find a way to set a port
number to child processes when running them remotely. Linux commands
(top,/proc), showed me that the virtual memory is almost twice as my
physical which means swapping is happening which is what I'm trying to
avoid.

So basically, is there a way to assign a port to child processes to monitor
them remotely (asked before by Xun) or would you recommend another
monitoring tool?

Thank you,
Mark


On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote:

 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++ application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:

  Thanks Charles .. I'm running Hadoop for research to perform duplicate
  detection methods. To go deeper, I need to understand what's slowing my
  program, which usually starts with analyzing memory to predict best input
  size for map task. So you're saying piping can help me control memory
 even
  though it's running on VM eventually?
 
  Thanks,
  Mark
 
  On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
  Mark,
  Both streaming and pipes allow this, perhaps more so pipes at the level
 of
  the mapreduce task. Can you provide more details on the application?
  On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
  Hi guys, thought I should ask this before I use it ... will using C
 over
  Hadoop give me the usual C memory management? For example, malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned into
  bytecode, but I need more control on memory which obviously is hard for
  me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over
 hadoop.
  Thank you,
  Mark
 
 




Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
The documentation on Starfish http://www.cs.duke.edu/starfish/index.html
looks promising , I have not used it. I wonder if others on the list have found 
it more useful than setting mapred.task.profile.
C
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

 I've used hadoop profiling (.prof) to show the stack trace but it was hard
 to follow. jConsole locally since I couldn't find a way to set a port
 number to child processes when running them remotely. Linux commands
 (top,/proc), showed me that the virtual memory is almost twice as my
 physical which means swapping is happening which is what I'm trying to
 avoid.
 
 So basically, is there a way to assign a port to child processes to monitor
 them remotely (asked before by Xun) or would you recommend another
 monitoring tool?
 
 Thank you,
 Mark
 
 
 On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote:
 
 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++ application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
 Thanks Charles .. I'm running Hadoop for research to perform duplicate
 detection methods. To go deeper, I need to understand what's slowing my
 program, which usually starts with analyzing memory to predict best input
 size for map task. So you're saying piping can help me control memory
 even
 though it's running on VM eventually?
 
 Thanks,
 Mark
 
 On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level
 of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
 Hi guys, thought I should ask this before I use it ... will using C
 over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned into
 bytecode, but I need more control on memory which obviously is hard for
 me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over
 hadoop.
 Thank you,
 Mark
 
 
 
 



Re: Streaming Hadoop using C

2012-02-29 Thread Charles Earl
I assume you have also just tried running locally and using the jdk performance 
tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum 
number of tasks?
Perhaps the discussion
http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
might be relevant?
On Feb 29, 2012, at 3:53 PM, Mark question wrote:

 I've used hadoop profiling (.prof) to show the stack trace but it was hard
 to follow. jConsole locally since I couldn't find a way to set a port
 number to child processes when running them remotely. Linux commands
 (top,/proc), showed me that the virtual memory is almost twice as my
 physical which means swapping is happening which is what I'm trying to
 avoid.
 
 So basically, is there a way to assign a port to child processes to monitor
 them remotely (asked before by Xun) or would you recommend another
 monitoring tool?
 
 Thank you,
 Mark
 
 
 On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote:
 
 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++ application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
 Thanks Charles .. I'm running Hadoop for research to perform duplicate
 detection methods. To go deeper, I need to understand what's slowing my
 program, which usually starts with analyzing memory to predict best input
 size for map task. So you're saying piping can help me control memory
 even
 though it's running on VM eventually?
 
 Thanks,
 Mark
 
 On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
 Mark,
 Both streaming and pipes allow this, perhaps more so pipes at the level
 of
 the mapreduce task. Can you provide more details on the application?
 On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
 Hi guys, thought I should ask this before I use it ... will using C
 over
 Hadoop give me the usual C memory management? For example, malloc() ,
 sizeof() ? My guess is no since this all will eventually be turned into
 bytecode, but I need more control on memory which obviously is hard for
 me
 to do with Java.
 
 Let me know of any advantages you know about streaming in C over
 hadoop.
 Thank you,
 Mark
 
 
 
 



Re: 100x slower mapreduce compared to pig

2012-02-29 Thread Mohit Anchlia
I think I've found the problem. There was one line of code that caused this
issue :)  that was output.collect(key, value);

I had to add more logging to the code to get to it. For some reason kill
-QUIT didn't send the stacktrace to the userLogs/job/attempt/syslog , I
searched all the logs and couldn't find one. Does anyone know where
stacktraces are generally sent?

On Wed, Feb 29, 2012 at 1:08 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I can't seem to find what's causing this slowness. Nothing in the logs.
 It's just painfuly slow. However, pig job is awesome in performance that
 has the same logic. Here is the mapper code and the pig code:


 *public* *static* *class* Map *extends* MapReduceBase

 *implements* MapperText, Text, Text, Text {

 *public* *void* map(Text key, Text value,

 OutputCollectorText, Text output,

 Reporter reporter)
 *throws* IOException {

 String line = value.toString();

 //log.info(output key: + key + value  + value + value  + line);

 FormMLType f;

 *try* {

 f = FormMLUtils.*convertToRows*(line);

 FormMLStack fm =
 *new* FormMLStack(f,key.toString());

 fm.parseFormML();

 *for* (String *row* : fm.getFormattedRecords(*false*)){

 output.collect(key, value);

 }

 }
 *catch* (JAXBException e) {

 *log*.error(Error processing record  + key, e);

 }

  }

 }

 And here is the pig udf:


 *public* DataBag exec(Tuple input) *throws* IOException {

 *try* {

 DataBag output =
 mBagFactory.newDefaultBag();

 Object o = input.get(1);

 *if* (!(o *instanceof* String)) {

 *throw* *new* IOException(

 Expected document input to be chararray, but got 

 + o.getClass().getName());

 }

 Object o1 = input.get(0);

 *if* (!(o1 *instanceof* String)) {

 *throw* *new* IOException(

 Expected input to be chararray, but got 

 + o.getClass().getName());

 }

 String document = (String)o;

 String filename = (String)o1;

 FormMLType f = FormMLUtils.*convertToRows*(document);

 FormMLStack fm =
 *new* FormMLStack(f,filename);

 fm.parseFormML();

 *for* (String row : fm.getFormattedRecords(*false*)){

 output.add(
 mTupleFactory.newTuple(row));

 }

 *return* output;

 }
 *catch* (ExecException ee) {

 log.error(Failed to Process , ee);

 *throw* ee;

 }
 *catch* (JAXBException e) {

 // *TODO* Auto-generated catch block

 log.error(Invalid xml, e);

 *throw* *new* IllegalArgumentException(invalid xml  +
 e.getCause().getMessage());

 }

 }

   On Wed, Feb 29, 2012 at 9:27 AM, Mohit Anchlia 
 mohitanch...@gmail.comwrote:

 I am going to try few things today. I have a JAXBContext object that
 marshals the xml, this is static instance but my guess at this point is
 that since this is in separate jar then the one where job runs and I used
 DistributeCache.addClassPath this context is being created on every call
 for some reason. I don't know why that would be. I am going to create this
 instance as static in the mapper class itself and see if that helps. I also
 add debugs. Will post the results after try it out.


 On Tue, Feb 28, 2012 at 4:18 PM, Prashant Kommireddi prash1...@gmail.com
  wrote:

 It would be great if we can take a look at what you are doing in the UDF
 vs
 the Mapper.

 100x slow does not make sense for the same job/logic, its either the
 Mapper
 code or may be the cluster was busy at the time you scheduled MapReduce
 job?

 Thanks,
 Prashant

 On Tue, Feb 28, 2012 at 4:11 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I am comparing runtime of similar logic. The entire logic is exactly
 same
  but surprisingly map reduce job that I submit is 100x slow. For pig I
 use
  udf and for hadoop I use mapper only and the logic same as pig. Even
 the
  splits on the admin page are same. Not sure why it's so slow. I am
  submitting job like:
 
  java -classpath
 
 
 .:analytics.jar:/hadoop-0.20.2-cdh3u3/lib/*:/root/.mohit/hadoop-0.20.2-cdh3u3/*:common.jar
  com.services.dp.analytics.hadoop.mapred.FormMLProcessor
 
 
 /examples/testfile40.seq,/examples/testfile41.seq,/examples/testfile42.seq,/examples/testfile43.seq,/examples/testfile44.seq,/examples/testfile45.seq,/examples/testfile46.seq,/examples/testfile47.seq,/examples/testfile48.seq,/examples/testfile49.seq
  /examples/output1/
 
  How should I go about looking the root cause of why it's so slow? Any
  suggestions would be really appreciated.
 
 
 
  One of the things I noticed is that on the admin page of map task list
 I
  see status as hdfs://dsdb1:54310/examples/testfile40.seq:0+134217728
 but
  for pig the status is blank.
 






Re: Invocation exception

2012-02-29 Thread Mohit Anchlia
Thanks for the example. I did look at the logs and also at the admin page
and all I see is the exception that I posted initially.

I am not sure why adding an extra jar to the classpath in DistributedCache
causes that exception. I tried to look at Configuration code in hadoop.util
package but it doesn't tell much. It looks like it's throwing on this line
configureMethod.invoke(theObject, conf); in below code.


*private* *static* *void* setJobConf(Object theObject, Configuration conf) {

//If JobConf and JobConfigurable are in classpath, AND

//theObject is of type JobConfigurable AND

//conf is of type JobConf then

//invoke configure on theObject

*try* {

Class? jobConfClass =

conf.getClassByName(org.apache.hadoop.mapred.JobConf);

Class? jobConfigurableClass =

conf.getClassByName(org.apache.hadoop.mapred.JobConfigurable);

*if* (jobConfClass.isAssignableFrom(conf.getClass()) 

jobConfigurableClass.isAssignableFrom(theObject.getClass())) {

Method configureMethod =

jobConfigurableClass.getMethod(configure, jobConfClass);

configureMethod.invoke(theObject, conf);

}

} *catch* (ClassNotFoundException e) {

//JobConf/JobConfigurable not in classpath. no need to configure

} *catch* (Exception e) {

*throw* *new* RuntimeException(Error in configuring object, e);

}

}

On Tue, Feb 28, 2012 at 9:25 PM, Harsh J ha...@cloudera.com wrote:

 Mohit,

 If you visit the failed task attempt on the JT Web UI, you can see the
 complete, informative stack trace on it. It would point the exact line
 the trouble came up in and what the real error during the
 configure-phase of task initialization was.

 A simple attempts page goes like the following (replace job ID and
 task ID of course):


 http://host:50030/taskdetails.jsp?jobid=job_201202041249_3964tipid=task_201202041249_3964_m_00

 Once there, find and open the All logs link to see stdout, stderr,
 and syslog of the specific failed task attempt. You'll have more info
 sifting through this to debug your issue.

 This is also explained in Tom's book under the title Debugging a Job
 (p154, Hadoop: The Definitive Guide, 2nd ed.).

 On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  It looks like adding this line causes invocation exception. I looked in
  hdfs and I see that file in that path
 
   DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
 conf);
 
  I have similar code for another jar
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf); but this works just fine.
 
 
  On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
  I commented reducer and combiner both and still I see the same
 exception.
  Could it be because I have 2 jars being added?
 
   On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com
 wrote:
 
  On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   For some reason I am getting invocation exception and I don't see any
  more
   details other than this exception:
  
   My job is configured as:
  
  
   JobConf conf = *new* JobConf(FormMLProcessor.*class*);
  
   conf.addResource(hdfs-site.xml);
  
   conf.addResource(core-site.xml);
  
   conf.addResource(mapred-site.xml);
  
   conf.set(mapred.reduce.tasks, 0);
  
   conf.setJobName(mlprocessor);
  
   DistributedCache.*addFileToClassPath*(*new*
 Path(/jars/analytics.jar),
   conf);
  
   DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
   conf);
  
   conf.setOutputKeyClass(Text.*class*);
  
   conf.setOutputValueClass(Text.*class*);
  
   conf.setMapperClass(Map.*class*);
  
   conf.setCombinerClass(Reduce.*class*);
  
   conf.setReducerClass(IdentityReducer.*class*);
  
 
  Why would you set the Reducer when the number of reducers is set to
 zero.
  Not sure if this is the real cause.
 
 
  
   conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
  
   conf.setOutputFormat(TextOutputFormat.*class*);
  
   FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
  
   FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
  
   JobClient.*runJob*(conf);
  
   -
   *
  
   java.lang.RuntimeException*: Error in configuring object
  
   at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
   ReflectionUtils.java:93*)
  
   at
  
 
 org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
  
   at org.apache.hadoop.util.ReflectionUtils.newInstance(*
   ReflectionUtils.java:117*)
  
   at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
  
   at org.apache.hadoop.mapred.MapTask.run(*MapTask.java:325*)
  
   at org.apache.hadoop.mapred.Child$4.run(*Child.java:270*)
  
   at java.security.AccessController.doPrivileged(*Native Method*)
  
   at javax.security.auth.Subject.doAs(*Subject.java:396*)
  
   at org.apache.hadoop.security.UserGroupInformation.doAs(*
   UserGroupInformation.java:1157*)
  
   at org.apache.hadoop.mapred.Child.main(*Child.java:264*)
  
   Caused 

Re: Invocation exception

2012-02-29 Thread Harsh J
Mohit,

I'm positive the real exception lies a few scrolls below that message
on the attempt page. Possibly a class not found issue.

The message you see on top is when something throws up an exception
while being configure()-ed. It is most likely a job config or
setup-time issue from your code or from the library code.

On Thu, Mar 1, 2012 at 5:19 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 Thanks for the example. I did look at the logs and also at the admin page
 and all I see is the exception that I posted initially.

 I am not sure why adding an extra jar to the classpath in DistributedCache
 causes that exception. I tried to look at Configuration code in hadoop.util
 package but it doesn't tell much. It looks like it's throwing on this line
 configureMethod.invoke(theObject, conf); in below code.


 *private* *static* *void* setJobConf(Object theObject, Configuration conf) {

 //If JobConf and JobConfigurable are in classpath, AND

 //theObject is of type JobConfigurable AND

 //conf is of type JobConf then

 //invoke configure on theObject

 *try* {

 Class? jobConfClass =

 conf.getClassByName(org.apache.hadoop.mapred.JobConf);

 Class? jobConfigurableClass =

 conf.getClassByName(org.apache.hadoop.mapred.JobConfigurable);

 *if* (jobConfClass.isAssignableFrom(conf.getClass()) 

 jobConfigurableClass.isAssignableFrom(theObject.getClass())) {

 Method configureMethod =

 jobConfigurableClass.getMethod(configure, jobConfClass);

 configureMethod.invoke(theObject, conf);

 }

 } *catch* (ClassNotFoundException e) {

 //JobConf/JobConfigurable not in classpath. no need to configure

 } *catch* (Exception e) {

 *throw* *new* RuntimeException(Error in configuring object, e);

 }

 }

 On Tue, Feb 28, 2012 at 9:25 PM, Harsh J ha...@cloudera.com wrote:

 Mohit,

 If you visit the failed task attempt on the JT Web UI, you can see the
 complete, informative stack trace on it. It would point the exact line
 the trouble came up in and what the real error during the
 configure-phase of task initialization was.

 A simple attempts page goes like the following (replace job ID and
 task ID of course):


 http://host:50030/taskdetails.jsp?jobid=job_201202041249_3964tipid=task_201202041249_3964_m_00

 Once there, find and open the All logs link to see stdout, stderr,
 and syslog of the specific failed task attempt. You'll have more info
 sifting through this to debug your issue.

 This is also explained in Tom's book under the title Debugging a Job
 (p154, Hadoop: The Definitive Guide, 2nd ed.).

 On Wed, Feb 29, 2012 at 1:40 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  It looks like adding this line causes invocation exception. I looked in
  hdfs and I see that file in that path
 
   DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
 conf);
 
  I have similar code for another jar
  DistributedCache.*addFileToClassPath*(*new* Path(/jars/analytics.jar),
  conf); but this works just fine.
 
 
  On Tue, Feb 28, 2012 at 11:44 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 
  I commented reducer and combiner both and still I see the same
 exception.
  Could it be because I have 2 jars being added?
 
   On Mon, Feb 27, 2012 at 8:23 PM, Subir S subir.sasiku...@gmail.com
 wrote:
 
  On Tue, Feb 28, 2012 at 4:30 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   For some reason I am getting invocation exception and I don't see any
  more
   details other than this exception:
  
   My job is configured as:
  
  
   JobConf conf = *new* JobConf(FormMLProcessor.*class*);
  
   conf.addResource(hdfs-site.xml);
  
   conf.addResource(core-site.xml);
  
   conf.addResource(mapred-site.xml);
  
   conf.set(mapred.reduce.tasks, 0);
  
   conf.setJobName(mlprocessor);
  
   DistributedCache.*addFileToClassPath*(*new*
 Path(/jars/analytics.jar),
   conf);
  
   DistributedCache.*addFileToClassPath*(*new* Path(/jars/common.jar),
   conf);
  
   conf.setOutputKeyClass(Text.*class*);
  
   conf.setOutputValueClass(Text.*class*);
  
   conf.setMapperClass(Map.*class*);
  
   conf.setCombinerClass(Reduce.*class*);
  
   conf.setReducerClass(IdentityReducer.*class*);
  
 
  Why would you set the Reducer when the number of reducers is set to
 zero.
  Not sure if this is the real cause.
 
 
  
   conf.setInputFormat(SequenceFileAsTextInputFormat.*class*);
  
   conf.setOutputFormat(TextOutputFormat.*class*);
  
   FileInputFormat.*setInputPaths*(conf, *new* Path(args[0]));
  
   FileOutputFormat.*setOutputPath*(conf, *new* Path(args[1]));
  
   JobClient.*runJob*(conf);
  
   -
   *
  
   java.lang.RuntimeException*: Error in configuring object
  
   at org.apache.hadoop.util.ReflectionUtils.setJobConf(*
   ReflectionUtils.java:93*)
  
   at
  
 
 org.apache.hadoop.util.ReflectionUtils.setConf(*ReflectionUtils.java:64*)
  
   at org.apache.hadoop.util.ReflectionUtils.newInstance(*
   ReflectionUtils.java:117*)
  
   at org.apache.hadoop.mapred.MapTask.runOldMapper(*MapTask.java:387*)
  
   at 

Re: Does Hadoop 0.20.205 and Ganglia 3.1.7 compatible with each other ?

2012-02-29 Thread Merto Mertek
Varun sorry for my late response. Today I have deployed a new version and I
can confirm that patches you provided works well. I' ve been running some
jobs on a 5node cluster for an hour without a core on full load so now
thinks works as expected.

Thank you again!

I have used just your first option..

On 15 February 2012 19:53, mete efk...@gmail.com wrote:

 Well rebuilding ganglia seemed easier and Merto was testing the other so i
 though that i should give that one a chance :)
 anyway i will send you gdb details or patch hadoop and try it at my
 earliest convenience

 Cheers

 On Wed, Feb 15, 2012 at 6:59 PM, Varun Kapoor rez...@hortonworks.com
 wrote:

  The warnings about underflow are totally expected (they come from
 strtod(),
  and they will no longer occur with Hadoop-1.0.1, which applies my patch
  from HADOOP-8052), so that's not worrisome.
 
  As for the buffer overflow, do you think you could show me a backtrace of
  this core? If you can't find the core file on disk, just start gmetad
 under
  gdb, like so:
 
  $ sudo gdb path to gmetad
 
  (gdb) r --conf=path to your gmetad.conf
  ...
  ::Wait for crash::
  (gdb) bt
  (gdb) info locals
 
  If you're familiar with gdb, then I'd appreciate any additional diagnosis
  you could perform (for example, to figure out which metric's value caused
  this buffer overflow) - if you're not, I'll try and send you some gdb
  scripts to narrow things down once I see the output from this round of
  debugging.
 
  Also, out of curiosity, is patching Hadoop not an option for you? Or is
 it
  just that rebuilding (and redeploying) ganglia is the lesser of the 2
  evils? :)
 
  Varun
 
  On Tue, Feb 14, 2012 at 11:43 PM, mete efk...@gmail.com wrote:
 
   Hello Varun,
   i have patched and recompiled ganglia from source bit it still cores
  after
   the patch.
  
   Here are some logs:
   Feb 15 09:39:14 master gmetad[16487]: RRD_update
  
  
 
 (/var/lib/ganglia/rrds/hadoop/slave4/metricssystem.MetricsSystem.publish_max_time.rrd):
  
  
 
 /var/lib/ganglia/rrds/hadoop/slave4/metricssystem.MetricsSystem.publish_max_time.rrd:
   converting '4.9E-324' to float: Numerical result out of range
   Feb 15 09:39:14 master gmetad[16487]: RRD_update
  
  
 
 (/var/lib/ganglia/rrds/hadoop/master/metricssystem.MetricsSystem.publish_imax_time.rrd):
  
  
 
 /var/lib/ganglia/rrds/hadoop/master/metricssystem.MetricsSystem.publish_imax_time.rrd:
   converting '4.9E-324' to float: Numerical result out of range
   Feb 15 09:39:14 master gmetad[16487]: RRD_update
  
  
 
 (/var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.publish_imax_time.rrd):
  
  
 
 /var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.publish_imax_time.rrd:
   converting '4.9E-324' to float: Numerical result out of range
   Feb 15 09:39:14 master gmetad[16487]: RRD_update
  
  
 
 (/var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.snapshot_imax_time.rrd):
  
  
 
 /var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.snapshot_imax_time.rrd:
   converting '4.9E-324' to float: Numerical result out of range
   Feb 15 09:39:14 master gmetad[16487]: RRD_update
  
  
 
 (/var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.publish_max_time.rrd):
  
  
 
 /var/lib/ganglia/rrds/hadoop/slave1/metricssystem.MetricsSystem.publish_max_time.rrd:
   converting '4.9E-324' to float: Numerical result out of range
   Feb 15 09:39:14 master gmetad[16487]: *** buffer overflow detected ***:
   gmetad terminated
  
   i am using hadoop.1.0.0 and ganglia 3.20 tarball.
  
   Cheers
   Mete
  
   On Sat, Feb 11, 2012 at 2:19 AM, Merto Mertek masmer...@gmail.com
  wrote:
  
Varun unfortunately I have had some problems with deploying a new
  version
on the cluster.. Hadoop is not picking the new build in lib folder
   despite
a classpath is set to it. The new build is picked just if I put it in
  the
$HD_HOME/share/hadoop/, which is very strange.. I've done this on all
   nodes
and can access the web, but all tasktracker are being stopped because
  of
   an
error:
   
INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager:
   Cleanup...
 java.lang.InterruptedException: sleep interrupted
 at java.lang.Thread.sleep(Native Method)
 at

   
  
 
 org.apache.hadoop.filecache.TrackerDistributedCacheManager$CleanupThread.run(TrackerDistributedCacheManager.java:926)

   
   
Probably the error is the consequence of an inadequate deploy of a
  jar..
   I
will ask to the dev list how they do it or are you maybe having any
  other
idea?
   
   
   
On 10 February 2012 17:10, Varun Kapoor rez...@hortonworks.com
  wrote:
   
 Hey Merto,

 Any luck getting the patch running on your cluster?

 In case you're interested, there's now a JIRA for this:
 https://issues.apache.org/jira/browse/HADOOP-8052.

 Varun

 On Wed, Feb 8, 2012 at 7:45 PM, Varun Kapoor 
 

Re: Streaming Hadoop using C

2012-02-29 Thread Mark question
Thank you for your time and suggestions, I've already tried starfish, but
not jmap. I'll check it out.
Thanks again,
Mark

On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote:

 I assume you have also just tried running locally and using the jdk
 performance tools (e.g. jmap) to gain insight by configuring hadoop to run
 absolute minimum number of tasks?
 Perhaps the discussion

 http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
 might be relevant?
 On Feb 29, 2012, at 3:53 PM, Mark question wrote:

  I've used hadoop profiling (.prof) to show the stack trace but it was
 hard
  to follow. jConsole locally since I couldn't find a way to set a port
  number to child processes when running them remotely. Linux commands
  (top,/proc), showed me that the virtual memory is almost twice as my
  physical which means swapping is happening which is what I'm trying to
  avoid.
 
  So basically, is there a way to assign a port to child processes to
 monitor
  them remotely (asked before by Xun) or would you recommend another
  monitoring tool?
 
  Thank you,
  Mark
 
 
  On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com
 wrote:
 
  Mark,
  So if I understand, it is more the memory management that you are
  interested in, rather than a need to run an existing C or C++
 application
  in MapReduce platform?
  Have you done profiling of the application?
  C
  On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
  Thanks Charles .. I'm running Hadoop for research to perform duplicate
  detection methods. To go deeper, I need to understand what's slowing my
  program, which usually starts with analyzing memory to predict best
 input
  size for map task. So you're saying piping can help me control memory
  even
  though it's running on VM eventually?
 
  Thanks,
  Mark
 
  On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl 
 charles.ce...@gmail.com
  wrote:
 
  Mark,
  Both streaming and pipes allow this, perhaps more so pipes at the
 level
  of
  the mapreduce task. Can you provide more details on the application?
  On Feb 29, 2012, at 1:56 PM, Mark question wrote:
 
  Hi guys, thought I should ask this before I use it ... will using C
  over
  Hadoop give me the usual C memory management? For example, malloc() ,
  sizeof() ? My guess is no since this all will eventually be turned
 into
  bytecode, but I need more control on memory which obviously is hard
 for
  me
  to do with Java.
 
  Let me know of any advantages you know about streaming in C over
  hadoop.
  Thank you,
  Mark