Write a file to local disks on all nodes of a YARN cluster.

2013-12-08 Thread Jay Vyas
I want to put a file on all nodes of my cluster, that is locally readable
(not in HDFS).

Assuming that i cant gaurantee a FUSE mount or NFS or anything of the SORT
on my cluster, is there a poor man's way to do this in yarn?  something
like

for each node n in cluster:
n.copToLocal("a","/tmp/a");

So that afterwards, all nodes in the cluster have a file "a" in /tmp.

-- 
Jay Vyas
http://jayunit100.blogspot.com


FSMainOperations > FSContract tests?

2013-12-05 Thread Jay Vyas
Mainly @steveloughran Is it safe to say that *old* fs semantics are in 
FSContract test, and *new* fs semantics in FSMainOps tests ? 

I ask this because it seems that you had tests in your swift filesystem tests 
which used the FSContract libs, as well as the FSMainOps.. 

Not sure why you need both?  There is pretty high redundancy it seems

Re: Hadoop Test libraries: Where did they go ?

2013-11-25 Thread Jay Vyas
Yup , we figured it out eventually.
The artifacts now use the test-jar directive which creates a jar file that you 
can reference in mvn using the  tag in your dependencies.

However, fyi, I haven't been able to successfully google for the quintessential 
classes in the hadoop test libs like the fs BaseContractTest by name, so they 
are now harder to find then before

So i think it's unfortunate that they are not a top level maven artifact.

It's misleading, as It's now very easy to assume from looking at hadoop in mvn 
central that hadoop-test is just an old library that nobody updates anymore.

Just a thought but Maybe hadoop-test could be rejuvenated to point to the 
hadoop-commons some how?


> On Nov 25, 2013, at 4:52 AM, Steve Loughran  wrote:
> 
> I see a hadoop-common-2.2.0-tests.jar in org.apache.hadoop/hadoop-?common;
> SHA1 a9994d261d00295040a402cd2f611a2bac23972a, which resolves in a search
> engine to
> http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/2.2.0/
> 
> It looks like it is now part of the hadoop-common artifacts, you just say
> you want the test bits
> 
> http://maven.apache.org/guides/mini/guide-attached-tests.html
> 
> 
> 
>> On 21 November 2013 23:28, Jay Vyas  wrote:
>> 
>> It appears to me that
>> 
>> http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-test
>> 
>> Is no longer updated
>> 
>> Where does hadoop now package the test libraries?
>> 
>> Looking in the ".//hadoop-common-project/hadoop-common/pom.xml " file in
>> the hadoop 2X branches, im not sure wether or not src/test is packaged into
>> a jar anymore... but i fear it is not.
>> 
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
> 
> -- 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to 
> which it is addressed and may contain information that is confidential, 
> privileged and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, you are hereby notified that 
> any printing, copying, dissemination, distribution, disclosure or 
> forwarding of this communication is strictly prohibited. If you have 
> received this communication in error, please contact the sender immediately 
> and delete it from your system. Thank You.


RawLocalFileSystem, getPos and NullPointerException

2013-09-09 Thread Jay Vyas
What is the correct behaviour for getPos  in a record reader, and how
should it behave when the underlying stream is null?  It appears this can
happen in the rawlocalfilesystem.  Not sure if its implemented more safely
in DistributedfileSYstem just yet.


   I've found that the getPos in the RawLocalFileSystem's input stream can
throw a null pointer exception if its underlying stream is closed.

I discovered this when playing with a custom record reader.

to patch it, I simply check if a call to "stream.available()" throws an
exception, and if so, I return 0 in the getPos() function.

The existing getPos() implementation is found here:

https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20/src/examples/org/apache/hadoop/examples/MultiFileWordCount.java

What should be the correct behaviour of getPos() in the RecordReader?


http://stackoverflow.com/questions/18708832/hadoop-rawlocalfilesystem-and-getpos

-- 
Jay Vyas
http://jayunit100.blogspot.com


MultiFileLineRecodrReader vs CombineFileRecordReader

2013-09-07 Thread Jay Vyas
I've found that there are two different implementations of seemingly the
same class:

MultiFileLineRecordReader (implemented as an inner class in some versions
of MultiFileWordCount) and

CombineFileRecordReader

In order to implement RecordReaders for the MultiFileWordCount class.

Is there any major difference between these classes, and why the redundancy
?  I'm thinking maybe it was retro added at some point, based on some git
detective work which I tried...

But I figured it might just be easier to ask here :)

-- 
Jay Vyas
http://jayunit100.blogspot.com


Mapred.system.dir: should JT start without it?

2013-08-15 Thread Jay Vyas
Is there a startup for contract mapreduce making its own mapred.system.dir  ? 

Also, it seems that the jobtracker can startup even if this directory was not 
created / doesn't exist - I'm thinking that if that's the case, JT should fail 
up front.

Re: JobSubmissionFiles: past , present, and future?

2013-04-12 Thread Jay Vyas
To update on this, it was just pointed out to me by matt farrallee 
that the auto fix of permissions is for a failsafe 
in case of a race condition, and not meant to mend bad permissions in all cases:

https://github.com/apache/hadoop-common/commit/f25dc04795a0e9836e3f237c802bfc1fe8a243ad

Something to keep in mind - if you see the "fixing staging permissions" error 
message alot
Then there might be a more systemic problem in your fs... At least, that was 
the case for us.

On Apr 12, 2013, at 6:11 AM, Jay Vyas  wrote:

> Hi guys: 
> 
> I'm curious about the changes and future of the JobSubmissionFiles class.
> 
> Grepping around on the web I'm finding some code snippets that suggest that 
> hadoop security is not handled the same way on the staging directory as 
> before:
> 
> http://javasourcecode.org/html/open-source/hadoop/hadoop-0.20.203.0/org/apache/hadoop/mapreduce/JobSubmissionFiles.java.html
> 
> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201210.mbox/%3ccaocnvr0eylsckxaocpnm7kbzwphvcdjbbx5a+azes_s6pws...@mail.gmail.com%3E
> 
> But I'm having trouble definitively pinning this to versions.
> 
> Why the difference in the if/else logic here and what is the future
> Of permissions on .staging?


JobSubmissionFiles: past , present, and future?

2013-04-12 Thread Jay Vyas
Hi guys: 

I'm curious about the changes and future of the JobSubmissionFiles class.

Grepping around on the web I'm finding some code snippets that suggest that 
hadoop security is not handled the same way on the staging directory as before:

http://javasourcecode.org/html/open-source/hadoop/hadoop-0.20.203.0/org/apache/hadoop/mapreduce/JobSubmissionFiles.java.html

http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201210.mbox/%3ccaocnvr0eylsckxaocpnm7kbzwphvcdjbbx5a+azes_s6pws...@mail.gmail.com%3E

But I'm having trouble definitively pinning this to versions.

Why the difference in the if/else logic here and what is the future
Of permissions on .staging?

Re: Non local mapper .. Is it worth it?

2012-12-06 Thread Jay Vyas
H but How can the scheduler effect the performance of a Mapper if there 
are no competing jobs?

I thought the scheduler only impacted the way separate jobs got resources for 
different jobs. In my example, there are 2 mappers, 2+n files, and 1 job.

Jay Vyas 
http://jayunit100.blogspot.com

On Dec 6, 2012, at 4:39 AM, Bertrand Dechoux  wrote:

> The short answer is yes it can be worth it because your job can finish
> faster if you are not only allowing local mappers. But this is of course a
> trade off. The best performance (but not latency) can be obtained when
> using only local mappers. You should read about delay scheduling which
> allows the user to pick what is the 'best'. Fair scheduler has it for
> hadoop 1 and capacity scheduler has it but for hadoop 2.
> 
> Regards
> 
> Bertrand
> 
> On Thu, Dec 6, 2012 at 6:14 AM,  wrote:
> 
>> If there is a job with files f1 and f2, and a Mapper (m1) is running
>> against a file (f2) which is far from the local machine(m1), will the
>> overhead of copying f2 over to m1 be worth it?.
>> 
>> That is  - is the amount of resources required to read data off a
>> remote machine (m2)  worth it? Or would it be better if that remote (m2)
>> now simply processed both files (f1, f2) in turn?
>> 
>> Jay Vyas
>> http://jayunit100.blogspot.com
> 
> 
> 
> 
> -- 
> Bertrand Dechoux


Re: strategies to share information between mapreduce tasks

2012-09-26 Thread Jay Vyas
The reason this is so rare is that the nature of map/reduce tasks is that
they are orthogonal  i.e. the word count, batch image recognition, tera
sort -- all the things hadoop is famous for are largely orthogonal tasks.
Its much more rare (i think) to see people using hadoop to do traffic
simulations or solve protein folding problems... Because those tasks
require continuous signal integration.

1) First, try to consider rewriting it so that ll communication is replaced
by state variables in a reducer, and choose your keys wisely, so that all
"communication" between machines is obviated by the fact that a single
reducer is receiving all the information relevant for it to do its task.

2) If a small amount of state needs to be preserved or cached in real time
two optimize the situation where two machines might dont have to redo the
same task (i.e. invoke a web service to get a peice of data, or some other
task that needs to be rate limited and not duplicated) then you can use a
fast key value store (like you suggested) like the ones provided by basho (
http://basho.com/) or amazon (Dynamo).

3) If you really need alot of message passing, then then you might be
better of using an inherently more integrated tool like GridGain... which
allows for sophisticated message passing between asynchronously running
processes, i.e.
http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/.


It seems like there might not be a reliable way to implement a
sophisticated message passing architecutre in hadoop, because the system is
inherently so dynamic, and is built for rapid streaming reads/writes, which
would be stifled by significant communication overhead.


Re: Python + hdfs written thrift sequence files: lots of moving parts!

2012-09-25 Thread Jay Vyas
Thanks harsh: In any case, I'm really curious about how it is that sequence
file headers are formatted, as the documentation in the SequenceFile
javadocs seems to be very generic.

To make my questions more concrete:

1) I notice that the FileSplit class has a getStart() function.  It is
documented as returning the place to start "processing".  Does that imply
that a FileSplit does, or does not include a header?

http://hadoop.apache.org/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html#getStart%28%29

2) Also, Its not clear to me that how compression and serialization are
related.  These are two inticrately coupled aspects of HDFS file writing,
and im not sure what the idiom for coordinating the compression of records
to  the deserialization is.


Python + hdfs written thrift sequence files: lots of moving parts!

2012-09-25 Thread Jay Vyas
Hi guys!

Im trying to read some hadoop outputted thrift files in plain old java
(without using SequenceFile.Reader).  The reason for this is that I

(1) want to understand the sequence file format better and
(2) would like to be able to port this code to a language which doesnt have
robust hadoop sequence file i/o / thrift support  (python). My code looks
like this:

So, before reading forward, if anyone has :

1) Some general hints on how to create a Sequence file with thrift encoded
key values in python would be very useful.
2) Some tips on the generic approach for reading a sequencefile (the
comments seem to be a bit underspecified in the SequenceFile header)

I'd appreciate it!

Now, here is my adventure into thrift/hdfs sequence file i/o :

I've written a simple stub which , I think, should be the start of a
sequence file reader (just tries to skip the header and get straight to the
data).

But it doesnt handle compression.

http://pastebin.com/vyfgjML9

So, this code ^^ appears to fail with cryptic errors : "don't know what
type: 15".

This error comes from a case statement, which attempts to determine what
type of thrift record is being read in:
"fail 127 don't know what type: 15"

  private byte getTType(byte type) throws TProtocolException {
switch ((byte)(type & 0x0f)) {
  case TType.STOP:
return TType.STOP;
  case Types.BOOLEAN_FALSE:
  case Types.BOOLEAN_TRUE:
return TType.BOOL;
 
 case Types.STRUCT:
return TType.STRUCT;
  default:
throw new TProtocolException("don't know what type: " +
(byte)(type & 0x0f));
}

Upon further investigation, I have found that, in fact, the Configuration
object is (of course) heavily utilized by the SequenceFile reader, in
particular, to
determine the Codec.  That corroborates my hypothesis that the data needs
to be decompressed or decoded before it can be deserialized by thrift, I
believe.

So... I guess what Im assuming is missing here, is that I don't know how to
manually reproduce the Codec/GZip, etc.. logic inside of
SequenceFile.Reader in plain old java (i.e without cheating and using the
SequenceFile.Reader class that is configured in our mapreduce soruce
code).

With my end goal being to read the file in python, I think it would be nice
to be able to read the sequencefile in java, and use this as a template
(since I know that my thrift objects and serialization are working
correctly in my current java source codebase, when read in from
SequenceFile.Reader api).

Any suggestions on how I can distill the logic of the SequenceFile.Reader
class into a simplified version which is specific to my data, so that I can
start porting into a python script which is capable of scanning a few real
sequencefiles off of HDFS would be much appreciated !!!

In general... what are the core steps for doing i/o with sequence files
that are compressed and or serialized in different formats?  Do we
decompress first , and then deserialize?  Or do them both at the same time
?  Thanks!

PS I've added an issue to github here
https://github.com/matteobertozzi/Hadoop/issues/5, for a python
SequenceFile reader.  If I get some helpful hints on this thread maybe I
can directly implement an example on matteobertozzi's python hadoop trunk.

-- 
Jay Vyas
MMSB/UCHC


Re: resetting conf/ parameters in a life cluster.

2012-08-18 Thread Jay Vyas
hmmm I wonder if there is a way to push conf/*xml parameters out to all
the slaves, maybe at runtime ?

On Sat, Aug 18, 2012 at 4:06 PM, Harsh J  wrote:

> Jay,
>
> Oddly, the counters limit changes (increases, anyway) needs to be
> applied at the JT, TT and *also* at the client - to take real effect.
>
> On Sat, Aug 18, 2012 at 8:31 PM, Jay Vyas  wrote:
> > Hi guys:
> >
> > I've reset my max counters as follows :
> >
> > ./hadoop-site.xml:
> >
>  
> mapreduce.job.counters.limit15000
> >
> > However, a job is failing (after reducers get to 100%!) at the very end,
> > due to exceeded counter limit.  I've confirmed in my
> > code that indeed the correct counter parameter is being set.
> >
> > My hypothesis: Somehow, the name node counters parameter is effectively
> > being transferred to slaves... BUT the name node *itself* hasn't updated
> its
> > maximum counter allowance, so it throws an exception at the end of the
> job,
> > that is, they dying message from hadoop is
> >
> > " max counter limit 120 exceeded "
> >
> > I've confirmed in my job that the counter parameter is correct, when the
> > job starts... However... somehow the "120 limit exceeded" exception is
> > still thrown.
> >
> > This is in elastic map reduce, hadoop .20.205
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC


Hadoop idioms for reporting cluster and counter stats.

2012-08-16 Thread Jay Vyas
Hi guys : I want to start automating the output of  counter stats, cluster
size, etc... at the end of the main map reduce jobs which we run.  Is there
a simple way to do this ?

Here is my current thought :

1) Run all jobs from a driver class (we already do this).

2) At the end of each job, intercept the global counters and write them out
to a text file.  This would
presumably be on the local fs.

3) Export the local filesystem.

4) Maybe the NameNode also has access to such data , maybe via an API
(clearly, the hadoop web ui gets this
data from somewhere, re in the "cluster summary" header..


-- 
Jay Vyas
MMSB/UCHC


Re: Mechanism of hadoop -jar

2012-08-11 Thread Jay Vyas
Sorry for the confusion... To be clear It is TOTALLY okay to jar up a text
file and access it in hadoop via the class.getResource(...) api !

1) Hadoop doesn't do anything funny with the class loader, it just uses the
Simple sun JVM class loader.

2) My problem was simply that I wasn't jarring up my text file properly.
 This was causing (obviously) all my mappers/reducers to not see my file.

Thanks for all the responses they were helpful !


Mechanism of hadoop -jar

2012-08-11 Thread Jay Vyas
Hi guys:  I'm trying to find documentation on how "hadoop jar" actually
works i.e. how it copies/runs the jar file across the cluster, in order to
debug a jar issue.

1) Where can I get a good explanation of how the hadoop commands (i.e.
-jar) are implemented ?

2) Specifically, Im trying to access a bundled text file from a jar :

class.getResource("myfile.txt")

from inside a mapreduce job Is it okay to do this ?  Or does a classes
ability to aquire local resources change  in the mapper/reducer JVMs ?



-- 
Jay Vyas
MMSB/UCHC


Re: Merge Reducers Output

2012-07-30 Thread Jay Vyas
Its not clear to me that you need custom input formats

1) Getmerge might work or

2) Simply run a SINGLE reducer job (have mappers output static final int
key=1, or specify numReducers=1).

In this case, only one reducer will be called, and it will read through all
the values.

On Tue, Jul 31, 2012 at 12:30 AM, Bejoy KS  wrote:

> Hi
>
> Why not use 'hadoop fs -getMerge 
> ' while copying files out of hdfs for the end users to
> consume. This will merge all the files in 'outputFolderInHdfs'  into one
> file and put it in lfs.
>
> Regards
> Bejoy KS
>
> Sent from handheld, please excuse typos.
>
> -Original Message-
> From: Michael Segel 
> Date: Mon, 30 Jul 2012 21:08:22
> To: 
> Reply-To: common-user@hadoop.apache.org
> Subject: Re: Merge Reducers Output
>
> Why not use a combiner?
>
> On Jul 30, 2012, at 7:59 PM, Mike S wrote:
>
> > Liked asked several times, I need to merge my reducers output files.
> > Imagine I have many reducers which will generate 200 files. Now to
> > merge them together, I have written another map reduce job where each
> > mapper read a complete file in full in memory, and output that and
> > then only one reducer has to merge them together. To do so, I had to
> > write a custom fileinputreader that reads the complete file into
> > memory and then another custom fileoutputfileformat to append the each
> > reducer item bytes together. this how my mapper and reducers looks
> > like
> >
> > public static class MapClass extends Mapper > BytesWritable, IntWritable, BytesWritable>
> >   {
> >   @Override
> >   public void map(NullWritable key, BytesWritable value,
> Context
> > context) throws IOException, InterruptedException
> >   {
> >   context.write(key, value);
> >   }
> >   }
> >
> >   public static class Reduce extends Reducer > BytesWritable, NullWritable, BytesWritable>
> >   {
> >   @Override
> >   public void reduce(NullWritable key,
> Iterable values,
> > Context context) throws IOException, InterruptedException
> >   {
> >   for (BytesWritable value : values)
> >   {
> >   context.write(NullWritable.get(), value);
> >   }
> >   }
> >   }
> >
> > I still have to have one reducers and that is a bottle neck. Please
> > note that I must do this merging as the users of my MR job are outside
> > my hadoop environment and the result as one file.
> >
> > Is there better way to merge reducers output files?
> >
>
>


-- 
Jay Vyas
MMSB/UCHC


EMR classpath overiding over all mappers/reducers

2012-07-26 Thread Jay Vyas
HI guys :

I have an EMR job which seems to be loading "old" versions of an
aws-sdk-java jar.  I looked closer and found that
the hadoop nodes im using in fact have a old version of a jar in $HOME/lib/
which causing the problem.

This is most commonly seen, for example, with jackson json jars.

What is the simplest way to specify the correct version of the jar to take
precedent ?

Initially I have tried both

1) setting the "mapreduce.job.user.classpath.first"  in my job, but that
seems to have no effect.
and 2) exporting the command line property "HADOOP_USER_CLASSPATH_FIRST" at
the command line before launching my jobs.

Neither seems to work properly it seems (caveat: admittedly these are just
initial attempts... maybe i've done something minor incorrectly).

But.. before i bang my head against the shell scripts --- Can somebody
suggest an ideal way to force a jar to be the "priority" loaded jar across
all mappers and reducers, i.e. overiding the hadoop classpath ?

-- 
Jay Vyas
MMSB/UCHC


fail and kill all tasks without killing job.

2012-07-20 Thread jay vyas
Hi guys : I want my tasks to end/fail, but I don't want to kill my 
entire hadoop job.


I have a hadoop job that runs 5 hadoop jobs in a row.
Im on the last of those sub-jobs, and want to fail all tasks so that the 
task tracker stops delegating them,

and the hadoop main job can naturally come to a close.

However, when I run "hadoop job kill-attempt / fail-attempt ", the 
jobtracker seems to simply relaunch

the same tasks with new ids.

How can I tell the jobtracker to give up on redelegating?


Simply reading small a hadoop text file.

2012-07-13 Thread Jay Vyas
Hi guys : Whats the idiomatic way to iterate through the k/v pairs in a
text file ? been playing with almost everything everything with
SequenceFiles and almost forgot :)

my text output actually has tabs in it... So, im not sure what the default
separator is, and wehter or not there is a smart way to find the value.

-- 
Jay Vyas
MMSB/UCHC


fixing the java / unixPrincipal hadoop error... Ubuntu.

2012-07-08 Thread Jay Vyas
Hi guys : I run into the following roadblock in my VM - and Im not sure
what the right way to install sun java is.  Any suggestions?
In particular, the question is best described here:

http://stackoverflow.com/questions/11288964/sun-java-not-loading-unixprincipal-ubuntu-12#comment14859324_11288964

PS I posted this here because, mainly, this is a hadoop issue more than a
pure java one, since the missing class "UnixPrincipal" Exception (i.e. you
google for it), is mostly exclusive to the hadoop community.

-- 
Jay Vyas
MMSB/UCHC


Sun JDK 1.6.033: java.lang.ClassNotFoundException: com.sun.security.auth.UnixPrincipal

2012-07-01 Thread Jay Vyas
Hi guys: Im getting this very odd error in my sun / ubuntu / hadoop run.

-  im not running a hadoop cluster here, just some local FS java hadoop
map/r jobs.
- The exception Im getting on FileSystem.get(conf) is
"java.lang.ClassNotFoundException: com.sun.security.auth.UnixPrincipal"

Here are my specs :

vagrant@precise64:~/Development/workspace/pisae$ grep 'hadoop' ./ivy.xml
*
*
vagrant@precise64:~/Development/workspace/pisae$ java -version
*java version "1.6.0_33"
Java(TM) SE Runtime Environment (build 1.6.0_33-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03, mixed mode)
*
vagrant@precise64:~/Development/workspace/pisae$ lsb_release -a
No LSB modules are available.
*Distributor ID:Ubuntu
Description:Ubuntu 12.04 LTS
Release:12.04
Codename:precise*

The error :

/home/vagrant/Development/workspace/pisae/build.xml:222:
java.lang.NoClassDefFoundError: com/sun/security/auth/UnixPrincipal
at
org.apache.hadoop.security.UserGroupInformation.(UserGroupInformation.java:246)
at
org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1436)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1337)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:244)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:122)
...
Caused by: java.lang.ClassNotFoundException:
com.sun.security.auth.UnixPrincipal
at
org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1361)
at
org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1311)
at
org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1070)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 38 more

Any thoughts?  Thanks!


Re: counters docs

2012-05-31 Thread Jay Vyas
Sure I will get started on this.  Thanks for the feedback.

On Thu, May 31, 2012 at 1:29 PM, Arun C Murthy  wrote:

> You got me thinking, there probably isn't one at al.
>
> Mind opening a jira? (Better yet, file a patch, thanks!)
>
> Arun
>
> On May 30, 2012, at 5:25 PM, Jay Vyas wrote:
>
> > Hi guys : Where is the best documentation on the default hadoop counters,
> > and how to use/interpret them ?  I always seem to forget which ones are
> > important / useful when debugging   Id like to compose a quick
> > reference cheat sheet for reference when debugging large groups of
> counters
> > in chained jobs.
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
>
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
>
>
>


-- 
Jay Vyas
MMSB/UCHC


Job constructore deprecation

2012-05-22 Thread Jay Vyas
Hi guys : I have noticed that the comments for this class encourage us to
use the Job constructors, yet, they are deprecated.

http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html

What is the idiomatic way to create a Job in hadoop ? And why have the job
constructors been deprecated ?

-- 
Jay Vyas
MMSB/UCHC


Re: Simulating cluster config parameters (in local mode)

2012-05-09 Thread Jay Vyas
Ahhh I now know the answer : My solution :

1) Get a simple mapred config file and remove all parameters but the one i
need to set for my local mode.
2) Put the mapred site config .xml in my classpath.
3) Run my application.

The JobConf I assume SHOULD NOT work because this parameter is specifically
meant to defy and job specific configuration.


On Wed, May 9, 2012 at 5:03 PM, Serge Blazhiyevskyy <
serge.blazhiyevs...@nice.com> wrote:

> You should be able to set that param on JobConf object
>
> Regards,
> Serge
>
> On 5/9/12 1:09 PM, "Jay Vyas"  wrote:
>
> >Hi guys : I need to set a cluster configuration parameter (specifically,
> >the "mapreduce.job.counters.limit") 
> > Easy ... right ?
> >
> >Well one problem : I'm running hadoop in local-mode !
> >
> >So How can I simulate this parameter so that my local mode allows me to
> >use
> >non-default cluster configruation parameters ?
> >
> >
> >--
> >Jay Vyas
> >MMSB/UCHC
>
>


-- 
Jay Vyas
MMSB/UCHC


Simulating cluster config parameters (in local mode)

2012-05-09 Thread Jay Vyas
Hi guys : I need to set a cluster configuration parameter (specifically,
the "mapreduce.job.counters.limit") 
 Easy ... right ?

Well one problem : I'm running hadoop in local-mode !

So How can I simulate this parameter so that my local mode allows me to use
non-default cluster configruation parameters ?


-- 
Jay Vyas
MMSB/UCHC


heap space error, low memory job, unit test

2012-05-01 Thread Jay Vyas
Hi guys :


I have a map/r job that has always worked fine, but which fails due to a
heap space error on my local machine during unit tests.

It runs in hadoop's default mode, and just fails durring the constructor of
the MapOutputBuffer Any thoughts on why ?

I dont do any custom memory settings in by unit tests, because they aren't
really needed --- So I assume this is related to /tmp files
or something ... but cant track down the issue.

Any thoughts would be very much appreciated ..

12/05/01 19:15:53 WARN mapred.LocalJobRunner: job_local_0002
java.lang.OutOfMemoryError: Java heap space
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:807)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.(MapTask.java:557)




-- 
Jay Vyas
MMSB/UCHC


EMR Hadoop

2012-04-29 Thread Jay Vyas
Hi guys :

1) Does anybody know if there is a VM out there which runs EMR hadoop ?  I
would like to have a
local vm for dev purposes that mirrored the EMR hadoop instances.

2) How does EMR's hadoop differ from apache hadoop and Cloudera's hadoop ?

-- 
Jay Vyas
MMSB/UCHC


The meaning of FileSystem in context of OutputFormat storage

2012-04-25 Thread Jay Vyas
I just saw this line in the javadocs for OutputFormat:

"Output files are stored in a
FileSystem<http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html>.
"

Seems like an odd sentence.  What is the implication here -- is this
implying anything other than the obvious ?

-- 
Jay Vyas
MMSB/UCHC


Re: understanding hadoop job submission

2012-04-25 Thread Jay Vyas
Yes, the job is submitted by the api calls in map reduce code

On Wed, Apr 25, 2012 at 3:56 AM, Devaraj k  wrote:

> Hi Arindam,
>
>hadoop jar jarFileName MainClassName
>
> The above command will not submit the job. This command only executes the
> jar file using the Main Class(Main-class present in manifest info if
> available otherwise class name(i.e MainClassName in the above command)
> passed as an argument. If we give any additional arguments in the command,
> those will be passed to the Main class args.
>
>   We can have a job submission code in the Main Class or any of the
> classes in the jar file. You can take a look into WordCount example for job
> submission info.
>
>
> Thanks
> Devaraj
>
> 
> From: Arindam Choudhury [arindamchoudhu...@gmail.com]
> Sent: Wednesday, April 25, 2012 2:14 PM
> To: common-user
> Subject: understanding hadoop job submission
>
> Hi,
>
> I am new to hadoop and I am trying to understand hadoop job submission.
>
> We submit the job using:
>
> hadoop jar some.jar name input output
>
> this in turn invoke the RunJar . But in RunJar I can not find any
> JobSubmit() or any call to JobClient.
>
> Then, how the job gets submitted to the JobTracker?
>
> -Arindam
>



-- 
Jay Vyas
MMSB/UCHC


Re: Determine the key of Map function

2012-04-23 Thread Jay Vyas
Ahh... Well than the key will be teacher, and the value will simply be

<-1 * # students, class_id> .

Then, you will see in the reducer that the first 3 entries will always be
the ones you wanted.

On Mon, Apr 23, 2012 at 10:17 PM, Lac Trung  wrote:

> Hi Jay !
> I think it's a bit difference here. I want to get 30 classId for each
> teacherId that have most students.
> For example : get 3 classId.
> (File1)
> 1) Teacher1, Class11, 30
> 2) Teacher1, Class12, 29
> 3) Teacher1, Class13, 28
> 4) Teacher1, Class14, 27
> ... n ...
>
> n+1) Teacher2, Class21, 45
> n+2) Teacher2, Class22, 44
> n+3) Teacher2, Class23, 43
> n+4) Teacher2, Class24, 42
> ... n+m ...
>
> => return 3 line 1, 2, 3 for Teacher1 and line n+1, n+2, n+3 for Teacher2
>
>
> Vào 09:52 Ngày 24 tháng 4 năm 2012, Jay Vyas  đã
> viết:
>
> > Its somewhat tricky to understand exactly what you need from your
> > explanation, but I believe you want teachers who have the most students
> in
> > a given class.  So for English, i have 10 teachers teaching the class -
> and
> > i want the ones with the highes # of students.
> >
> > You can output key= , value=<-1*#ofstudent,teacherid> as the
> > values.
> >
> > The values will then be sorted, by # of students.  You can thus pick
> > teacher in the the first value of your reducer, and that will be the
> > teacher for class id = xyz , with the highes number of students.
> >
> > You can also be smart in your mapper by running a combiner to remove the
> > teacherids who are clearly not maximal.
> >
> > On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung 
> wrote:
> >
> > > Hello everyone !
> > >
> > > I have a problem with MapReduce [:(] like that :
> > > I have 4 file input with 3 fields : teacherId, classId, numberOfStudent
> > > (numberOfStudent is ordered by desc for each teach)
> > > Output is top 30 classId that numberOfStudent is max for each teacher.
> > > My approach is MapReduce like Wordcount example. But I don't know how
> to
> > > determine key for map function.
> > > I run Wordcount example, understand its code but I have no experience
> at
> > > programming MapReduce.
> > >
> > > Can anyone help me to resolve this problem ?
> > > Thanks so much !
> > >
> > >
> > > --
> > > Lạc Trung
> > > 20083535
> > >
> >
> >
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
> >
>
>
>
> --
> Lạc Trung
> 20083535
>



-- 
Jay Vyas
MMSB/UCHC


Re: Determine the key of Map function

2012-04-23 Thread Jay Vyas
Its somewhat tricky to understand exactly what you need from your
explanation, but I believe you want teachers who have the most students in
a given class.  So for English, i have 10 teachers teaching the class - and
i want the ones with the highes # of students.

You can output key= , value=<-1*#ofstudent,teacherid> as the
values.

The values will then be sorted, by # of students.  You can thus pick
teacher in the the first value of your reducer, and that will be the
teacher for class id = xyz , with the highes number of students.

You can also be smart in your mapper by running a combiner to remove the
teacherids who are clearly not maximal.

On Mon, Apr 23, 2012 at 9:38 PM, Lac Trung  wrote:

> Hello everyone !
>
> I have a problem with MapReduce [:(] like that :
> I have 4 file input with 3 fields : teacherId, classId, numberOfStudent
> (numberOfStudent is ordered by desc for each teach)
> Output is top 30 classId that numberOfStudent is max for each teacher.
> My approach is MapReduce like Wordcount example. But I don't know how to
> determine key for map function.
> I run Wordcount example, understand its code but I have no experience at
> programming MapReduce.
>
> Can anyone help me to resolve this problem ?
> Thanks so much !
>
>
> --
> Lạc Trung
> 20083535
>



-- 
Jay Vyas
MMSB/UCHC


Re: hadoop.tmp.dir with multiple disks

2012-04-22 Thread Jay Vyas
I don't understand why multiple disks would be particularly beneficial for
a Map/Reduce job. would I/O for a map/reduce job be i/o *as well as CPU
bound* ?   I would think that simply reading and parsing large files would
still require dedicated CPU blocks. ?

On Sun, Apr 22, 2012 at 3:14 AM, Harsh J  wrote:

> You can use mapred.local.dir for this purpose. It accepts a list of
> directories tasks may use, just like dfs.data.dir uses multiple disks
> for block writes/reads.
>
> On Sun, Apr 22, 2012 at 12:50 PM, mete  wrote:
> > Hello folks,
> >
> > I have a job that processes text files from hdfs on local fs (temp
> > directory) and then copies those back to hdfs.
> > I added another drive to each server to have better io performance, but
> as
> > far as i could see hadoop.tmp.dir will not benefit from multiple
> disks,even
> > if i setup two different folders on different disks. (dfs.data.dir works
> > fine). As a result the disk with temp folder set is highy utilized, where
> > the other one is a little bit idle.
> > Does anyone have an idea on what to do? (i am using cdh3u3)
> >
> > Thanks in advance
> > Mete
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC


Re: FPGrowth job got stuck when running fpGrowth.generateTopKFrequentPatterns

2012-04-10 Thread Jay Vyas
Do you have counters for the reducers , and if so , are they continuing to
update ?

On Tue, Apr 10, 2012 at 8:45 AM, kayhan  wrote:

> Trying to mine frequent patterns of a dataset with 8000 transactions and
> 193
> attributes using mahout's parallel-frequent-pattern-mining algorithm. I run
> it in map-reduce mode on a cluster of 10 machines on windows7 and I use
> cygwin.
>
> When I run it, in reduce step, it shows completion of 100%, however it
> stucks there. No error or any other indication, but it waits there for
> hours
> and does not finish working.
> The details of the job shows that, it can not finish the part 'Processing
> FPTree: Bottom Up FP Growth > reduce'. For example the part 'Writing Top K
> patterns for: 338 > reduce' is finished in seconds.
> I post sample image of the gui that shows running tasks;
>
> http://hadoop-common.472056.n3.nabble.com/file/n3899658/Ads%C4%B1z2.png
>
> Also a part of the tasktracker log of the slave that is responsible for the
> job is posted below;
>
> /
> 2012-04-10 16:07:11,051 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201204101532_0003_r_14_1 1.0% Processing FPTree: Bottom Up FP
> Growth > reduce
> 2012-04-10 16:07:20,084 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201204101532_0003_r_14_1 1.0% Processing FPTree: Bottom Up FP
> Growth > reduce
> 2012-04-10 16:07:32,127 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201204101532_0003_r_14_1 1.0% Processing FPTree: Bottom Up FP
> Growth > reduce
> 2012-04-10 16:07:41,160 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201204101532_0003_r_14_1 1.0% Processing FPTree: Bottom Up FP
> Growth > reduce
> 2012-04-10 16:07:50,192 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201204101532_0003_r_14_1 1.0% Processing FPTree: Bottom Up FP
> Growth > reduce
> 2012-04-10 16:08:02,235 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201204101532_0003_r_14_1 1.0% Processing FPTree: Bottom Up FP
> Growth > reduce
> 2012-04-10 16:08:11,268 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201204101532_0003_r_14_1 1.0% Processing FPTree: Bottom Up FP
> Growth > reduce
> /
>
> If have any suggestions would be glad.
>
>
> --
> View this message in context:
> http://hadoop-common.472056.n3.nabble.com/FPGrowth-job-got-stuck-when-running-fpGrowth-generateTopKFrequentPatterns-tp3899658p3899658.html
> Sent from the Users mailing list archive at Nabble.com.
>



-- 
Jay Vyas
MMSB/UCHC


Re: Structuring MapReduce Jobs

2012-04-09 Thread Jay Vyas
Hi : Well phrased question  I think you will need to read up on
reducers, and then you will see the light.

1) in your mapper, emit (date,tradeValue) objects.

2) Then hadoop will send the following to the reducers.

date1,tradeValues[]
date2,tradeValues[]
...


3) Then, in your reducer, you will apply the function to the whole set of
trade values.

4) Note that the mappers will split on files - they are not gaurantees that
any particular data will be sent to the mappers. If you want the any data
to be "grouped", you will need to write a mapper that performs this
grouping on an arbitrarily large data set, and then your group specific
statistics will have to be done at the reducer stage.

Think of it this way : The mapper does the grouping of inputs for reducers,
and the reducers then do the group specific logic.  For example, in word
count, the mappers emit individual words - the reducers recieve a large
group of numbers for each individual word, and sum them to emit a total
count.  In your case, the words are like the raw bank records - and the
function you are applying to records from a certain "date" is like the sum
function in the word count reducer.







On Mon, Apr 9, 2012 at 11:45 AM, Tom Ferguson  wrote:

> Resending my query below... it didn't seem to post first time.
>
> Thanks,
>
> Tom
> On Apr 8, 2012 11:37 AM, "Tom Ferguson"  wrote:
>
> > Hello,
> >
> > I'm very new to Hadoop and I am trying to carry out of proof of concept
> > for processing some trading data. I am from a .net background, so I am
> > trying to prove whether it can be done primarily using C#, therefore I am
> > looking at the Hadoop Streaming job (from the Hadoop examples) to call in
> > to some C# executables.
> >
> > My problem is, I am not certain of the best way to structure my jobs to
> > process the data in the way I want.
> >
> > I have data stored in an RDBMS in the following format:
> >
> > ID TradeID  Date  Value
> > -
> > 1 1  2012-01-01 12.34
> > 2 1  2012-01-02 12.56
> > 3 1  2012-01-03 13.78
> > 4 2  2012-01-04 18.94
> > 5 2  2012-05-17 19.32
> > 6 2  2012-05-18 19.63
> > 7 3  2012-05-19 17.32
> > What I want to do is take all the Dates & Values for a given TradeID into
> > a mathematical function that will spit out the same set of Dates but will
> > have recalculated all the Values. I hope that makes sense.. e.g.
> >
> > Date Value
> > ---
> > 2012-01-01 12.34
> > 2012-01-02 12.56
> > 2012-01-03 13.78
> > will have the mathematical function applied and spit out
> >
> > Date Value
> > ---
> > 2012-01-01 28.74
> > 2012-01-02 31.29
> > 2012-01-03 29.93
> > I am not exactly sure how to achieve this using Hadoop Streaming, but my
> > thoughts so far are...
> >
> >
> >1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split
> >by TradeID - will this guarantee that all the the data points for a
> given
> >TradeID will be processed by the same Map task??
> >2. Write a Map task as a C# executable that will stream data in in the
> >format (ID, TradeID, Date, Value)
> >3. Gather all the data points for a given TradeID together into an
> >array (or other datastructure)
> >4. Pass the array into the mathematical function
> >5. Get the results back as another array
> >6. Stream the results back out in the format (TradeID, Date,
> >ResultValue)
> >
> > I will have around 500,000 Trade IDs, with up to 3,000 data points each,
> > so I am hoping that the data/processing will be distributed appropriately
> > by Hadoop.
> >
> > Now, this seams a little bit long winded, but is this the best way of
> > doing it, based on the constraints of having to use C# for writing my
> > tasks? In the example above I do not have a Reduce job at all. Is that
> > right in my scenario?
> >
> > Thanks for any help you can give and apologies if I am asking stupid
> > questions here!
> >
> > Kind Regards,
> >
> > Tom
> >
>



-- 
Jay Vyas
MMSB/UCHC


Re: map task execution time

2012-04-05 Thread Jay Vyas
(excuse the typo in the last email : I meant "I've been playing with Cinch"
, not "I've been with Cinch")

On Thu, Apr 5, 2012 at 7:54 AM, Jay Vyas  wrote:

> How can "hadoop job" be used to read m/r statistics ?
>
> On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma wrote:
>
>> Thanks Kai, I will try those.
>>
>> On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt  wrote:
>>
>> > Hi,
>> >
>> > Am 05.04.2012 um 00:20 schrieb bikash sharma:
>> >
>> > > Is it possible to get the execution time of the constituent map/reduce
>> > > tasks of a MapReduce job (say sort) at the end of a job run?
>> > > Preferably, can we obtain this programatically?
>> >
>> >
>> > you can access the JobTracker's web UI and see the start and stop
>> > timestamps for every individual task.
>> >
>> > Since the JobTracker Java API is exposed, you can write your own
>> > application to fetch that data through your own code.
>> >
>> > Also, "hadoop job" on the command line can be used to read job
>> statistics.
>> >
>> > Kai
>> >
>> >
>> > --
>> > Kai Voigt
>> > k...@123.org
>> >
>> >
>> >
>> >
>> >
>>
>
>
>
> --
> Jay Vyas
> MMSB/UCHC
>



-- 
Jay Vyas
MMSB/UCHC


Re: map task execution time

2012-04-05 Thread Jay Vyas
How can "hadoop job" be used to read m/r statistics ?

On Thu, Apr 5, 2012 at 7:30 AM, bikash sharma wrote:

> Thanks Kai, I will try those.
>
> On Thu, Apr 5, 2012 at 3:15 AM, Kai Voigt  wrote:
>
> > Hi,
> >
> > Am 05.04.2012 um 00:20 schrieb bikash sharma:
> >
> > > Is it possible to get the execution time of the constituent map/reduce
> > > tasks of a MapReduce job (say sort) at the end of a job run?
> > > Preferably, can we obtain this programatically?
> >
> >
> > you can access the JobTracker's web UI and see the start and stop
> > timestamps for every individual task.
> >
> > Since the JobTracker Java API is exposed, you can write your own
> > application to fetch that data through your own code.
> >
> > Also, "hadoop job" on the command line can be used to read job
> statistics.
> >
> > Kai
> >
> >
> > --
> > Kai Voigt
> > k...@123.org
> >
> >
> >
> >
> >
>



-- 
Jay Vyas
MMSB/UCHC


Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'

2012-04-02 Thread Jay Vyas
Thanks J : just curious about how you came to hypothesize (1) (i.e.
regarding the fact that threads and the
API componentns arent thread safe in my hadoop version).

I think thats a really good guess, and I would like to be able to make
those sorts of intelligent hypotheses
myself.  Any reading you can point me to for further enlightement ?

On Mon, Apr 2, 2012 at 3:16 PM, Harsh J  wrote:

> Jay,
>
> Without seeing the whole stack trace all I can say as cause for that
> exception from a job is:
>
> 1. You're using threads and the API components you are using isn't
> thread safe in your version of Hadoop.
> 2. Files are being written out to HDFS directories without following
> the OC rules. (This is negated, per your response).
>
> On Mon, Apr 2, 2012 at 7:35 PM, Jay Vyas  wrote:
> > No, my job does not write files directly to disk. It simply goes to some
> > web pages , reads data (in the reducer phase), and parses jsons into
> thrift
> > objects which are emitted via the standard MultipleOutputs API to hdfs
> > files.
> >
> > Any idea why hadoop would throw the "AlreadyBeingCreatedException" ?
> >
> > On Mon, Apr 2, 2012 at 2:52 PM, Harsh J  wrote:
> >
> >> Jay,
> >>
> >> What does your job do? Create files directly on HDFS? If so, do you
> >> follow this method?:
> >>
> >>
> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
> >>
> >> A local filesystem may not complain if you re-create an existent file.
> >> HDFS' behavior here is different. This simple Python test is what I
> >> mean:
> >> >>> a = open('a', 'w')
> >> >>> a.write('f')
> >> >>> b = open('a', 'w')
> >> >>> b.write('s')
> >> >>> a.close(), b.close()
> >> >>> open('a').read()
> >> 's'
> >>
> >> Hence it is best to use the FileOutputCommitter framework as detailed
> >> in the mentioned link.
> >>
> >> On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas  wrote:
> >> > Hi guys:
> >> >
> >> > I have a map reduce job that runs normally on local file system from
> >> > eclipse, *but* it fails on HDFS running in psuedo distributed mode.
> >> >
> >> > The exception I see is
> >> >
> >> > *org.apache.hadoop.ipc.RemoteException:
> >> > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:*
> >> >
> >> >
> >> > Any thoughts on why this might occur in psuedo distributed mode, but
> not
> >> in
> >> > regular file system ?
> >>
> >>
> >>
> >> --
> >> Harsh J
> >>
> >
> >
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC


Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'

2012-04-02 Thread Jay Vyas
No, my job does not write files directly to disk. It simply goes to some
web pages , reads data (in the reducer phase), and parses jsons into thrift
objects which are emitted via the standard MultipleOutputs API to hdfs
files.

Any idea why hadoop would throw the "AlreadyBeingCreatedException" ?

On Mon, Apr 2, 2012 at 2:52 PM, Harsh J  wrote:

> Jay,
>
> What does your job do? Create files directly on HDFS? If so, do you
> follow this method?:
>
> http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
>
> A local filesystem may not complain if you re-create an existent file.
> HDFS' behavior here is different. This simple Python test is what I
> mean:
> >>> a = open('a', 'w')
> >>> a.write('f')
> >>> b = open('a', 'w')
> >>> b.write('s')
> >>> a.close(), b.close()
> >>> open('a').read()
> 's'
>
> Hence it is best to use the FileOutputCommitter framework as detailed
> in the mentioned link.
>
> On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas  wrote:
> > Hi guys:
> >
> > I have a map reduce job that runs normally on local file system from
> > eclipse, *but* it fails on HDFS running in psuedo distributed mode.
> >
> > The exception I see is
> >
> > *org.apache.hadoop.ipc.RemoteException:
> > org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:*
> >
> >
> > Any thoughts on why this might occur in psuedo distributed mode, but not
> in
> > regular file system ?
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC


Re: namespace error after formatting namenode (psuedo distr mode).

2012-03-31 Thread Jay Vyas
Hi guys : I'm trying to reset my hadoop psuedo-distributed setup on my
local machine. I have formatted the namenode.

There are two ways to do this  -

Option 1) synchronize the datanodes so that the namespace ids are correct
(arpit has hinted that
this is the solution).

Since the datanodes are bad (i.e. the namespace ids are out of sync).
Maybe,  I can "format" my datanodes ?  Or is there some other operation
which I can run which would synchronize these namespaces ? Im not sure what
files to delete.  I tried deleting the data in dfs, but I think this might
have broken some other things in my setup.

Option 2)

Since i have done other things to corrupt my datanodes (i.e. rm -rf on the
dfs) I would, ideally, like to start my whole hadoop setup over from
scratch, but I'm not sure how to do that.  So any feedback on how to
"reinstall" hadoop would also probably solve my problem.


On Fri, Mar 30, 2012 at 11:28 PM, JAX  wrote:

> Thanks alot arpit : I will try this first thing in the morning.
>
> For now --- I need a glass of wine.
>
> Jay Vyas
> MMSB
> UCHC
>
> On Mar 30, 2012, at 10:38 PM, Arpit Gupta  wrote:
>
> > the namespace id is persisted on the datanode data directories. As you
> formatted the namenode these id's no longer match.
> >
> > So stop the datanode clean up your dfs.data.dir on your system which
> from the logs seems to be "/private/tmp/hadoop-Jpeerindex/dfs/data" and
> then start the datanode.
> >
> > --
> > Arpit Gupta
> > Hortonworks Inc.
> > http://hortonworks.com/
> >
> > On Mar 30, 2012, at 2:33 PM, Jay Vyas wrote:
> >
> >> Hi guys !
> >>
> >> This is very strange - I have formatted my namenode (psuedo distributed
> >> mode) and now Im getting some kind of namespace error.
> >>
> >> Without further ado : here is the interesting output of my logs .
> >>
> >>
> >> Last login: Fri Mar 30 19:29:12 on ttys009
> >> doolittle-5:~ Jpeerindex$
> >> doolittle-5:~ Jpeerindex$
> >> doolittle-5:~ Jpeerindex$ cat Development/hadoop-0.20.203.0/logs/*
> >> 2012-03-30 22:28:31,640 INFO
> >> org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
> >> /
> >> STARTUP_MSG: Starting DataNode
> >> STARTUP_MSG:   host = doolittle-5.local/192.168.3.78
> >> STARTUP_MSG:   args = []
> >> STARTUP_MSG:   version = 0.20.203.0
> >> STARTUP_MSG:   build =
> >>
> http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203-r
> >> 1099333; compiled by 'oom' on Wed May  4 07:57:50 PDT 2011
> >> /
> >> 2012-03-30 22:28:32,138 INFO
> org.apache.hadoop.metrics2.impl.MetricsConfig:
> >> loaded properties from hadoop-metrics2.properties
> >> 2012-03-30 22:28:32,190 INFO
> >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
> >> MetricsSystem,sub=Stats registered.
> >> 2012-03-30 22:28:32,191 INFO
> >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
> >> period at 10 second(s).
> >> 2012-03-30 22:28:32,191 INFO
> >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics
> system
> >> started
> >> 2012-03-30 22:28:32,923 INFO
> >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
> ugi
> >> registered.
> >> 2012-03-30 22:28:32,959 WARN
> >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
> already
> >> exists!
> >> 2012-03-30 22:28:34,478 INFO org.apache.hadoop.ipc.Client: Retrying
> connect
> >> to server: localhost/127.0.0.1:9000. Already tried 0 time(s).
> >> 2012-03-30 22:28:36,317 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
> >> Incompatible namespaceIDs in /private/tmp/hadoop-Jpeerindex/dfs/data:
> >> namenode namespaceID = 1829914379; datanode namespaceID = 1725952472
> >>   at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
> >>   at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
> >>   at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:354)
> >>   at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:268)
> >>   at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1480)
> >>   at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1419)
> >>   at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1437)
> >>   at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1563)
> >>   at
> >> org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1573)
> >
>



-- 
Jay Vyas
MMSB/UCHC


namespace error after formatting namenode (psuedo distr mode).

2012-03-30 Thread Jay Vyas
Hi guys !

This is very strange - I have formatted my namenode (psuedo distributed
mode) and now Im getting some kind of namespace error.

Without further ado : here is the interesting output of my logs .


Last login: Fri Mar 30 19:29:12 on ttys009
doolittle-5:~ Jpeerindex$
doolittle-5:~ Jpeerindex$
doolittle-5:~ Jpeerindex$ cat Development/hadoop-0.20.203.0/logs/*
2012-03-30 22:28:31,640 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = doolittle-5.local/192.168.3.78
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.203.0
STARTUP_MSG:   build =
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203-r
1099333; compiled by 'oom' on Wed May  4 07:57:50 PDT 2011
/
2012-03-30 22:28:32,138 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
loaded properties from hadoop-metrics2.properties
2012-03-30 22:28:32,190 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
MetricsSystem,sub=Stats registered.
2012-03-30 22:28:32,191 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2012-03-30 22:28:32,191 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system
started
2012-03-30 22:28:32,923 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
registered.
2012-03-30 22:28:32,959 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already
exists!
2012-03-30 22:28:34,478 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:9000. Already tried 0 time(s).
2012-03-30 22:28:36,317 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
Incompatible namespaceIDs in /private/tmp/hadoop-Jpeerindex/dfs/data:
namenode namespaceID = 1829914379; datanode namespaceID = 1725952472
at
org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:232)
at
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:147)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:354)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:268)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1480)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1419)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1437)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1563)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1573)


Re: simple hadoop psuedo distr. mode instructions

2012-03-22 Thread Jay Vyas
great thanks Jagat !

On Fri, Mar 23, 2012 at 1:42 AM, Jagat  wrote:

> Hi Jay
>
> Just follow this to install
>
> http://jugnu-life.blogspot.in/2012/03/hadoop-installation-tutorial.html
>
> The official tutorial at link below is also useful
>
> http://hadoop.apache.org/common/docs/r1.0.1/single_node_setup.html
>
> Thanks
>
> Jagat
>
> On Fri, Mar 23, 2012 at 12:08 PM, Jay Vyas  wrote:
>
> > Hi guys : What the latest, simplest, best directions to get a tiny,
> > psuedodistributed hadoop setup running on my ubuntu machine ?
> >
> > On Wed, Mar 21, 2012 at 5:14 PM,  wrote:
> >
> > > Owen,
> > >
> > > Is there interest in reverting hadoop-2399 in 0.23.x ?
> > >
> > > - Milind
> > >
> > > ---
> > > Milind Bhandarkar
> > > Greenplum Labs, EMC
> > > (Disclaimer: Opinions expressed in this email are those of the author,
> > and
> > > do not necessarily represent the views of any organization, past or
> > > present, the author might be affiliated with.)
> > >
> > >
> > >
> > > On 3/19/12 11:20 PM, "Owen O'Malley"  wrote:
> > >
> > > >On Mon, Mar 19, 2012 at 11:05 PM, madhu phatak 
> > > >wrote:
> > > >
> > > >> Hi Owen O'Malley,
> > > >>  Thank you for that Instant reply. It's working now. Can you explain
> > me
> > > >> what you mean by "input to reducer is reused" in little detail?
> > > >
> > > >
> > > >Each time the statement "Text value = values.next();" is executed it
> > > >always
> > > >returns the same Text object with the contents of that object changed.
> > > >When
> > > >you add the Text to the list, you are adding a pointer to the same
> Text
> > > >object. At the end you have 6 copies of the same pointer instead of 6
> > > >different Text objects.
> > > >
> > > >The reason that I said it is my fault, is because I added the
> > optimization
> > > >that causes it. If you are interested in Hadoop archeology, it was
> > > >HADOOP-2399 that made the change. I also did HADOOP-3522 to improve
> the
> > > >documentation in the area.
> > > >
> > > >-- Owen
> > >
> > >
> >
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
> >
>



-- 
Jay Vyas
MMSB/UCHC


Re: Very strange Java Collection behavior in Hadoop

2012-03-22 Thread Jay Vyas
Hi guys : What the latest, simplest, best directions to get a tiny,
psuedodistributed hadoop setup running on my ubuntu machine ?

On Wed, Mar 21, 2012 at 5:14 PM,  wrote:

> Owen,
>
> Is there interest in reverting hadoop-2399 in 0.23.x ?
>
> - Milind
>
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do not necessarily represent the views of any organization, past or
> present, the author might be affiliated with.)
>
>
>
> On 3/19/12 11:20 PM, "Owen O'Malley"  wrote:
>
> >On Mon, Mar 19, 2012 at 11:05 PM, madhu phatak 
> >wrote:
> >
> >> Hi Owen O'Malley,
> >>  Thank you for that Instant reply. It's working now. Can you explain me
> >> what you mean by "input to reducer is reused" in little detail?
> >
> >
> >Each time the statement "Text value = values.next();" is executed it
> >always
> >returns the same Text object with the contents of that object changed.
> >When
> >you add the Text to the list, you are adding a pointer to the same Text
> >object. At the end you have 6 copies of the same pointer instead of 6
> >different Text objects.
> >
> >The reason that I said it is my fault, is because I added the optimization
> >that causes it. If you are interested in Hadoop archeology, it was
> >HADOOP-2399 that made the change. I also did HADOOP-3522 to improve the
> >documentation in the area.
> >
> >-- Owen
>
>


-- 
Jay Vyas
MMSB/UCHC


Re: Question about accessing another HDFS

2011-12-08 Thread Jay Vyas
Can you show your code here ?  What URL protocol are you using ?

On Thu, Dec 8, 2011 at 5:47 PM, Tom Melendez  wrote:

> I'm hoping there is a better answer, but I'm thinking you could load
> another configuration file (with B.company in it) using Configuration,
> grab a FileSystem obj with that and then go forward.  Seems like some
> unnecessary overhead though.
>
> Thanks,
>
> Tom
>
> On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier 
> wrote:
> > Hi -
> >
> > We have two namenodes set up at our company, say:
> >
> > hdfs://A.mycompany.com
> > hdfs://B.mycompany.com
> >
> > From the command line, I can do:
> >
> > Hadoop fs –ls hdfs://A.mycompany.com//some-dir
> >
> > And
> >
> > Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir
> >
> > I’m now trying to do the same from a Java program that uses the HDFS
> API. No luck there. I get an exception: “Wrong FS”.
> >
> > Any idea what I’m missing in my Java program??
> >
> > Thanks,
> >
> > Frank
>



-- 
Jay Vyas
MMSB/UCHC


"No HADOOP COMMON HOME set."

2011-11-17 Thread Jay Vyas
Hi guys : I followed the exact directions on the hadoop installation guide
for psuedo-distributed mode
here
http://hadoop.apache.org/common/docs/current/single_node_setup.html#Configuration

However, I get that several environmental variables are not set (for
example , "HaDOOP_COMMON_HOME" is not set)

Also, hadoop reported thatHADOOP CONF was not set, as well.

Im wondering wether there is a resource on how to set environmental
variables to run hadoop ?

Thanks.

-- 
Jay Vyas
MMSB/UCHC


Re: simple question : where is conf/hadoop-env.sh ?

2011-11-17 Thread Jay Vyas
I mean... when I look in the hadoop installation directory (after untarring
it) ... I dont see any of those files in conf ...?

Does that mean i have to manually copy them over from the templates ?

On Thu, Nov 17, 2011 at 1:54 PM, Ayon Sinha  wrote:

> $HADOOP_HOME/conf
>
> -Ayon
> See My Photos on Flickr
> Also check out my Blog for answers to commonly asked questions.
>
>
>
> ________
> From: Jay Vyas 
> To: common-user@hadoop.apache.org
> Sent: Thursday, November 17, 2011 10:40 AM
> Subject: simple question : where is conf/hadoop-env.sh ?
>
> Hi guys : I do not see a "conf/hadoop-env.sh" file, which is required for
> hadoop installation, according to standard hadoop install directions which
> I find online and in the hadoop elephant book...
>
> Any hints on how to install a basic, hybrid mode hadoop fs on my laptop ?
>
> I DO SEE a series of files in the templates directory... is it my
> responsibility to copy these into conf/ before installing hadoop
>
> Thanks :)
>
> --
> Jay Vyas
> MMSB/UCHC
>



-- 
Jay Vyas
MMSB/UCHC


simple question : where is conf/hadoop-env.sh ?

2011-11-17 Thread Jay Vyas
Hi guys : I do not see a "conf/hadoop-env.sh" file, which is required for
hadoop installation, according to standard hadoop install directions which
I find online and in the hadoop elephant book...

Any hints on how to install a basic, hybrid mode hadoop fs on my laptop ?

I DO SEE a series of files in the templates directory... is it my
responsibility to copy these into conf/ before installing hadoop

Thanks :)

-- 
Jay Vyas
MMSB/UCHC


reducing mappers for a job

2011-11-16 Thread Jay Vyas
Hi guys : In a shared cluster environment, whats the best way to reduce the
number of mappers per job ?  Should you do it with inputSplits ?  Or simply
toggle the values in the JobConf (i.e. increase the number of bytes in an
input split) ?





-- 
Jay Vyas
MMSB/UCHC


Re: Server log files, order of importance ?

2011-10-31 Thread Jay Vyas
Thanks uma : I was looking for a more general list.  Is there a good
summary of various hadoop daemons and their logs online ?

On Mon, Oct 31, 2011 at 6:16 PM, Uma Maheswara Rao G 72686 <
mahesw...@huawei.com> wrote:

> If you want to trace one particular block associated with a file, you can
> first check the file Name and find the NameSystem.allocateBlock: from your
> NN logs.
>  here you can find the allocated blockID. After this, you just grep with
> this blockID from your huge logs. Take the time spamps for each operations
> based on this grep information. easily you can trace what happend to that
> block.
>
> Regards,
> Uma
> ----- Original Message -
> From: Jay Vyas 
> Date: Tuesday, November 1, 2011 3:37 am
> Subject: Server log files, order of importance ?
> To: common-user@hadoop.apache.org
>
> > Hi guys :I wanted to go through each of the server logs on my hadoop
> > (single psuedo node) vm.
> >
> > In particular, I want to know where to look when things go wrong
> > (i.e. so I
> > can more effectively debug hadoop namenode issues in the future).
> >
> > Can someone suggest what the most important ones to start looking
> > at are ?
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
> >
>



-- 
Jay Vyas
MMSB/UCHC


Server log files, order of importance ?

2011-10-31 Thread Jay Vyas
Hi guys :I wanted to go through each of the server logs on my hadoop
(single psuedo node) vm.

In particular, I want to know where to look when things go wrong (i.e. so I
can more effectively debug hadoop namenode issues in the future).

Can someone suggest what the most important ones to start looking at are ?

-- 
Jay Vyas
MMSB/UCHC


Re: getting there (EOF exception).

2011-10-30 Thread Jay Vyas
Harsh ! that was thetrick !

I changed the fs.deault.name to 0.0.0.0. from "localhost".

Then, my java could easily connect with no problems to my remote hadoop
namenode !!!

Thanks !

In summary --- if you need to connect to the namenode remotely make
sure its serving to 0.0.0.0 and
not localhost, and not 127.0.0.1 (for those of you that are ignorant like
me, localhost != 0.0.0.0 ..

thank you thank you thank you

On Mon, Oct 31, 2011 at 12:21 AM, Harsh J  wrote:

> What is your fs.default.name set to? It'd bind to the hostname provided
> in that.
>
> On Mon, Oct 31, 2011 at 9:38 AM, JAX  wrote:
> > Thanks! Yes i agree ... But Are you sure 8020? 8020 serves on 127.0.0.1
> (rather than 0.0.0.0) ... Thus it is inaccessible to outside
> clients...That is very odd Why would that be the case ? Any
> insights ( using cloud eras hadoop vm).
> >
> > Sent from my iPad
> >
> > On Oct 30, 2011, at 11:48 PM, Harsh J  wrote:
> >
> >> Hey Jay,
> >>
> >> I believe this may be related to your other issues as well, but 50070
> is NOT the port you want to connect to. 50070 serves over HTTP, while
> default port (fs.default.name), for IPC connections is 8020, or whatever
> you have configured.
> >>
> >> On 31-Oct-2011, at 5:17 AM, Jay Vyas wrote:
> >>
> >>> Hi  guys : What is the meaning of an EOF exception when trying to
> connect
> >>> to Hadoop by creating a new FileSystem object ?  Does this simply mean
> >>> the system cant be read ?
> >>>
> >>> java.io.IOException: Call to /172.16.112.131:50070 failed on local
> >>> exception: java.io.EOFException
> >>>   at org.apache.hadoop.ipc.Client.wrapException(Client.java:1139)
> >>>   at org.apache.hadoop.ipc.Client.call(Client.java:1107)
> >>>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
> >>>   at $Proxy0.getProtocolVersion(Unknown Source)
> >>>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
> >>>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
> >>>   at
> >>> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111)
> >>>   at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:213)
> >>>   at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:180)
> >>>   at
> >>>
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
> >>>   at
> >>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514)
> >>>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
> >>>   at
> >>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548)
> >>>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
> >>>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
> >>>   at sb.HadoopRemote.main(HadoopRemote.java:35)
> >>> Caused by: java.io.EOFException
> >>>   at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >>>   at
> >>>
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:812)
> >>>   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:720)
> >>>
> >>> --
> >>> Jay Vyas
> >>> MMSB/UCHC
> >>
> >
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
MMSB/UCHC


getting there (EOF exception).

2011-10-30 Thread Jay Vyas
Hi  guys : What is the meaning of an EOF exception when trying to connect
to Hadoop by creating a new FileSystem object ?  Does this simply mean
the system cant be read ?

java.io.IOException: Call to /172.16.112.131:50070 failed on local
exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1139)
at org.apache.hadoop.ipc.Client.call(Client.java:1107)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
at $Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
at
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:213)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:180)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
at sb.HadoopRemote.main(HadoopRemote.java:35)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:812)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:720)

-- 
Jay Vyas
MMSB/UCHC


namenode / mr jobs work... but external connection fails (video link)

2011-10-29 Thread Jay Vyas
Hi guys :

Well, I finally got my hadoop working just fine by using a different  VM.

This is a little crazy but  I decided to make it easier to see my
scenario so I put a little 4min video.

http://www.youtube.com/watch?v=HxoLgDmXeb4

The resolution is bad, but you can see that

1) (at the end) a map reduce job runs fine on the node itself using
examples-jar
and that
2) (beggining) ports, such as 50070 are generally open --- but my java code
(running on host machine in eclipse), fails to connect over the vm.

Any thoughts ?


** Note : The hadoop versions may not match up exactly --- would that result
in a connection error ?  I assume not, since the hdfs:// protocol is not
anything new.

I assume that the hdfs protocol shouldnt need any additional security.


Re: writing to hdfs via java api

2011-10-28 Thread Jay Vyas
Thanks tom : Thats interesting

First, I tried, and it complained that the input directory didnt exist, so I
ran
$> hadoop fs -mkdir /user/cloudera/input

Then, I tried to do this :

$> hadoop jar /usr/lib/hadoop-0.20/hadoop-examples.jar grep input output2
'dfs[a-z.]+'

And it seemed to start working .. But then it abruptly printed "killed"
somehow at the end of the job [scroll down] ?

Maybe this is related to why i cant connect . ?!

1) the hadoop jar 11/10/14 21:34:43 WARN util.NativeCodeLoader: Unable to
load native-hadoop library for your platform... using builtin-java classes
where applicable
11/10/14 21:34:43 WARN snappy.LoadSnappy: Snappy native library not loaded
11/10/14 21:34:43 INFO mapred.FileInputFormat: Total input paths to process
: 0
11/10/14 21:34:44 INFO mapred.JobClient: Running job: job_201110142010_0009
11/10/14 21:34:45 INFO mapred.JobClient:  map 0% reduce 0%
11/10/14 21:34:55 INFO mapred.JobClient:  map 0% reduce 100%
11/10/14 21:34:57 INFO mapred.JobClient: Job complete: job_201110142010_0009
11/10/14 21:34:57 INFO mapred.JobClient: Counters: 14
11/10/14 21:34:57 INFO mapred.JobClient:   Job Counters
11/10/14 21:34:57 INFO mapred.JobClient: Launched reduce tasks=1
11/10/14 21:34:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5627
11/10/14 21:34:57 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
11/10/14 21:34:57 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
11/10/14 21:34:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=5050
11/10/14 21:34:57 INFO mapred.JobClient:   FileSystemCounters
11/10/14 21:34:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=53452
11/10/14 21:34:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=86
11/10/14 21:34:57 INFO mapred.JobClient:   Map-Reduce Framework
11/10/14 21:34:57 INFO mapred.JobClient: Reduce input groups=0
11/10/14 21:34:57 INFO mapred.JobClient: Combine output records=0
11/10/14 21:34:57 INFO mapred.JobClient: Reduce shuffle bytes=0
11/10/14 21:34:57 INFO mapred.JobClient: Reduce output records=0
11/10/14 21:34:57 INFO mapred.JobClient: Spilled Records=0
11/10/14 21:34:57 INFO mapred.JobClient: Combine input records=0
11/10/14 21:34:57 INFO mapred.JobClient: Reduce input records=0
11/10/14 21:34:57 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
11/10/14 21:34:58 INFO mapred.FileInputFormat: Total input paths to process
: 1
11/10/14 21:34:58 INFO mapred.JobClient: Running job: job_201110142010_0010
11/10/14 21:34:59 INFO mapred.JobClient:  map 0% reduce 0%
Killed


On Fri, Oct 28, 2011 at 8:24 PM, Tom Melendez  wrote:

> Hi Jay,
>
> Some questions for you:
>
> - Does the hadoop client itself work from that same machine?
> - Are you actually able to run the hadoop example jar (in other words,
> your setup is valid otherwise)?
> - Is port 8020 actually available?  (you can telnet or nc to it?)
> - What does jps show on the namenode?
>
> Thanks,
>
> Tom
>
> On Fri, Oct 28, 2011 at 4:04 PM, Jay Vyas  wrote:
> > Hi guys : Made more progress debugging my hadoop connection, but still
> > haven't got it working..  It looks like my VM (cloudera hadoop) won't
> > let me in.  I find that there is no issue connecting to the name node -
> that
> > is , using hftp and 50070..
> >
> > via standard HFTP as in here :
> >
> > //This method works fine - connecting directly to hadoop's namenode and
> > querying the filesystem
> > public static void main1(String[] args) throws Exception
> >{
> >String uri = "hftp://155.37.101.76:50070/";;
> >
> >System.out.println( "uri: " + uri );
> >Configuration conf = new Configuration();
> >
> >FileSystem fs = FileSystem.get( URI.create( uri ), conf );
> >fs.printStatistics();
> >}
> >
> >
> > But unfortunately, I can't get into hdfs . Any thoughts on this ?  I
> am
> > modifying the uri to access port 8020
> > which is what is in my core-site.xml .
> >
> >   // This fails, resulting (trys to connect over and over again,
> eventually
> > gives up printing "already tried to connect 20 times")
> >public static void main(String[] args)
> >{
> >try {
> >String uri = "hdfs://155.37.101.76:8020/";
> >
> >System.out.println( "uri: " + uri );
> >Configuration conf = new Configuration();
> >
> >FileSystem fs = FileSystem.get( URI.create( uri ), conf );
> >fs.printStatistics();
> >} catch (Exception e) 

Re: writing to hdfs via java api

2011-10-28 Thread Jay Vyas
Hi guys : Made more progress debugging my hadoop connection, but still
haven't got it working..  It looks like my VM (cloudera hadoop) won't
let me in.  I find that there is no issue connecting to the name node - that
is , using hftp and 50070..

via standard HFTP as in here :

//This method works fine - connecting directly to hadoop's namenode and
querying the filesystem
public static void main1(String[] args) throws Exception
{
String uri = "hftp://155.37.101.76:50070/";;

System.out.println( "uri: " + uri );
Configuration conf = new Configuration();

FileSystem fs = FileSystem.get( URI.create( uri ), conf );
fs.printStatistics();
}


But unfortunately, I can't get into hdfs . Any thoughts on this ?  I am
modifying the uri to access port 8020
which is what is in my core-site.xml .

   // This fails, resulting (trys to connect over and over again, eventually
gives up printing "already tried to connect 20 times")
public static void main(String[] args)
{
try {
String uri = "hdfs://155.37.101.76:8020/";

System.out.println( "uri: " + uri );
Configuration conf = new Configuration();

FileSystem fs = FileSystem.get( URI.create( uri ), conf );
fs.printStatistics();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}

The error message is :

11/10/28 19:03:38 INFO ipc.Client: Retrying connect to server: /
155.37.101.76:8020. Already tried 0 time(s).
11/10/28 19:03:39 INFO ipc.Client: Retrying connect to server: /
155.37.101.76:8020. Already tried 1 time(s).
11/10/28 19:03:40 INFO ipc.Client: Retrying connect to server: /
155.37.101.76:8020. Already tried 2 time(s).
11/10/28 19:03:41 INFO ipc.Client: Retrying connect to server: /
155.37.101.76:8020. Already tried 3 time(s).

Any thoughts on this would be *really* be appreciated  ... Thanks guys.


writing to hdfs via java api

2011-10-27 Thread Jay Vyas
I found a way to connect to hadoop via hftp, and it works fine, (read only)
:

uri = "hftp://172.16.xxx.xxx:50070/";;

System.out.println( "uri: " + uri );
Configuration conf = new Configuration();

FileSystem fs = FileSystem.get( URI.create( uri ), conf );
fs.printStatistics();

However, it appears that hftp is read only, and I want to read/write as well
as copy files, that is, I want to connect over hdfs . How can I enable hdfs
connections so that i can edit the actual , remote filesystem using the file
/ path's APIs  ?  Are there ssh settings that have to be set before i can do
this > ?

I tried to change the protocol above from "hftp" -> "hdfs", but I got the
following exception ...

Exception in thread "main" java.io.IOException: Call to /
172.16.112.131:50070 failed on local exception: java.io.EOFException at
org.apache.hadoop.ipc.Client.wrapException(Client.java:1139) at
org.apache.hadoop.ipc.Client.call(Client.java:1107) at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at
$Proxy0.getProtocolVersion(Unknown Source) at
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398) at
org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) at
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) at
org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:213) at
org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:180) at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514) at
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548) at
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530) at
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228) at
sb.HadoopRemote.main(HadoopRemote.java:24)