Re: hbase and hypertable comparison

2011-05-26 Thread Edward Choi

Thanks for the clear answer Andy. 
The comparison actually was conducted by hypertable dev team, so I guess it 
wasn't all that fair to hbase. 
I have regained the confidence in hbase once more :)

Ed

From mp2893's iPhone

On 2011. 5. 26., at 오전 12:03, Andrew Purtell apurt...@apache.org wrote:

 I think I can speak for all of the HBase devs that in our opinion this vendor 
 benchmark was designed by hypertable to demonstrate a specific feature of 
 their system -- autotuning -- in such a way that HBase was, obviously, not 
 tuned. Nobody from the HBase project was consulted on the results or to do 
 such tuning, as is common courtesy when running a competitive benchmark, if 
 the goal is a fair test. Furthermore the benchmark code was not a community 
 accepted benchmark such as YCSB. 
 
 I do not think the results are valid beyond being some vendor FUD and do not 
 warrant much comment beyond this.
 
 Best regards,
 
- Andy
 
 Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
 Tom White)
 
 
 --- On Wed, 5/25/11, edward choi mp2...@gmail.com wrote:
 
 From: edward choi mp2...@gmail.com
 Subject: hbase and hypertable comparison
 To: u...@hbase.apache.org, common-user@hadoop.apache.org
 Date: Wednesday, May 25, 2011, 12:47 AM
 I'm planning to use a NoSQL
 distributed database.
 I did some searching and came across a lot of database
 systems such as
 MongoDB, CouchDB, Hbase, Cassandra, Hypertable, etc.
 
 Since what I'll be doing is frequently reading a varying
 amount of data, and
 less frequently writing a massive amount of data,
 I thought Hbase, or Hypertable is the way to go.
 
 I did some internet and found some performance comparison
 between HBase and
 HyperTable.
 Obviously HT dominated Hbase in every aspect (random
 read/write and a couple
 of more)
 
 But the comparison was made with Hbase 0.20.4, and Hbase
 had much
 improvements since the current version is 0.90.3.
 
 I am curious if the performance gap is still large between
 Hbase and HT.
 I am running Hadoop already so I wanted to go with Hbase
 but the performance
 gap was so big that it made me reconsider.
 
 Any opinions please?
 


Re: Sorting ...

2011-05-26 Thread Luca Pireddu
On May 25, 2011 22:15:50 Mark question wrote:
 I'm using SequenceFileInputFormat, but then what to write in my mappers?
 
   each mapper is taking a split from the SequenceInputFile then sort its
 split ?! I don't want that..
 
 Thanks,
 Mark
 
 On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote:
  On May 25, 2011 01:43:22 Mark question wrote:
   Thanks Luca, but what other way to sort a directory of sequence files?
   
   I don't plan to write a sorting algorithm in mappers/reducers, but
   hoping to use the sequenceFile.sorter instead.
   
   Any ideas?
   
   Mark
  


If you want to achieve a global sort, then look at how TeraSort does it:

http://sortbenchmark.org/YahooHadoop.pdf

The idea is to partition the data so that all keys in part[i] are  all keys 
in part[i+1].  Each partition in individually sorted, so to read the data in 
globally sorted order you simply have to traverse it starting from the first 
partition and working your way to the last one.

If your keys are already what you want to sort by, then you don't even need a 
mapper (just use the default identity map).



-- 
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452


Re: one question about hadoop

2011-05-26 Thread Luke Lu
Hadoop embeds jetty directly into hadoop servers with the
org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml
is auto generated with the jasper compiler during the build phase. The
new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop
HttpServer and doesn't need web.xml and/or jsp support either.

On Thu, May 26, 2011 at 12:14 AM, 王晓峰 sanlang2...@gmail.com wrote:
 hi,admin:

    I'm a  fresh fish from China.
    I want to know how the Jetty combines with the hadoop.
    I can't find the file named web.xml that should exist in usual system
 that combine with Jetty.
    I'll be very happy to receive your answer.
    If you have any question,please feel free to contract with me.

 Best Regards,

 Jack



can our problem be handled by hadoop

2011-05-26 Thread Mirko Kämpf
Hello,

we are working on a scientific project to analyze information spread in
networks. Our simulations are independent from each other but we need a
large amount of runs and we have to collect all data for the interpretation
of results by our reporting tools. So my was to use hadoop as a base, with
its distributed filesystem. So we could start independent runs on each node
of the cluster and at the end we collect the data for the calculation of
average values. The simulation tool is written in java and consists of about
2 MB jar files.
Is this a situation there hadoop can help us?
One fact is, that we want to parallize the production of large data sets.
Best wishes
Mirko


Re: can our problem be handled by hadoop

2011-05-26 Thread Mathias Herberts
Hi,

seems like the perfect use case for  Map Reduce yep.

2011/5/26 Mirko Kämpf mirko.kae...@googlemail.com:
 Hello,

 we are working on a scientific project to analyze information spread in
 networks. Our simulations are independent from each other but we need a
 large amount of runs and we have to collect all data for the interpretation
 of results by our reporting tools. So my was to use hadoop as a base, with
 its distributed filesystem. So we could start independent runs on each node
 of the cluster and at the end we collect the data for the calculation of
 average values. The simulation tool is written in java and consists of about
 2 MB jar files.
 Is this a situation there hadoop can help us?
 One fact is, that we want to parallize the production of large data sets.
 Best wishes
 Mirko



Re: Comparing

2011-05-26 Thread Juan P.
Harsh,
Thanks for your response, it was very helpful.
There are still a couple of things which are not really clear to me though.
You say that Keys have got to be compared by the MR framework. But I'm
still not 100% sure why keys are sorted. I thought what hadoop did was,
during shuffling it chose which keys went to which reducer and then for each
key/value it checked the key and sent them to the correct node. If that was
the case then a good equals implementation could be enough. So why instead
of just *shuffling* does the MP framework *sort* the items?

Also, you were very clear about the use of RawComparator, thank you. Do you
know how RawComparable works though?

Again, thanks for your help!
Cheers,
Pony

On Thu, May 26, 2011 at 1:58 AM, Harsh J ha...@cloudera.com wrote:

 Pony,

 Keys have got to be compared by the MR framework somehow, and the way
 it does when you use Writables is by ensuring that your Key is of a
 Writable + Comparable type (WritableComparable).

 If you specify a specific comparator class, then that will be used;
 else the default WritableComparator will get asked if it can supply a
 comparator for use with your key type.

 AFAIK, the default WritableComparator wraps around RawComparator and
 does indeed deserialize the writables before applying the compare
 operation. The RawComparator's primary idea is to give you a pair of
 raw byte sequences to compare directly. Certain other serialization
 libraries (Apache Avro is one) provide ways to compare using bytes
 itself (Across different types), which can end up being faster when
 used in jobs.

 Hope this clears up your confusion.

 On Tue, May 24, 2011 at 2:06 AM, Juan P. gordoslo...@gmail.com wrote:
  Hi guys,
  I wanted to get your help with a couple of questions which came up while
  looking at the Hadoop Comparator/Comparable architecture.
 
  As I see it before each reducer operates on each key, a sorting algorithm
 is
  applied to them. *Why does Hadoop need to do that?*
 
  If I implement my own class and I intend to use it as a Key I must allow
 for
  instances of my class to be compared. So I have 2 choices: I can
 implement
  WritableComparable or I can register a WritableComparator for my
  class. Should I fail to do either, would the Job fail?
  If I register my WritableComparator which does not use the Comparable
  interface at all, does my Key need to implement WritableComparable?
  If I don't implement my Comparator and my Key implements
 WritableComparable,
  does it mean that Hadoop will deserialize my Keys twice? (once for
 sorting,
  and once for reducing)
  What is RawComparable used for?
 
  Thanks for your help!
  Pony
 



 --
 Harsh J



Re: Sorting ...

2011-05-26 Thread Robert Evans
Also if you want something that is fairly fast and a lot less dev work to get 
going you might want to look at pig.  They can do a distributed order by that 
is fairly good.

--Bobby Evans

On 5/26/11 2:45 AM, Luca Pireddu pire...@crs4.it wrote:

On May 25, 2011 22:15:50 Mark question wrote:
 I'm using SequenceFileInputFormat, but then what to write in my mappers?

   each mapper is taking a split from the SequenceInputFile then sort its
 split ?! I don't want that..

 Thanks,
 Mark

 On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote:
  On May 25, 2011 01:43:22 Mark question wrote:
   Thanks Luca, but what other way to sort a directory of sequence files?
  
   I don't plan to write a sorting algorithm in mappers/reducers, but
   hoping to use the sequenceFile.sorter instead.
  
   Any ideas?
  
   Mark
 


If you want to achieve a global sort, then look at how TeraSort does it:

http://sortbenchmark.org/YahooHadoop.pdf

The idea is to partition the data so that all keys in part[i] are  all keys
in part[i+1].  Each partition in individually sorted, so to read the data in
globally sorted order you simply have to traverse it starting from the first
partition and working your way to the last one.

If your keys are already what you want to sort by, then you don't even need a
mapper (just use the default identity map).



--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel:  +39 0709250452



Help with pigsetup

2011-05-26 Thread Mohit Anchlia
I sent this to pig apache user mailing list but have got no response.
Not sure if that list is still active.

thought I will post here if someone is able to help me.

I am in process of installing and learning pig. I have a hadoop
cluster and when I try to run pig in mapreduce mode it errors out:

Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1

Error before Pig is launched

ERROR 2999: Unexpected internal error. Failed to create DataStorage

java.lang.RuntimeException: Failed to create DataStorage
   at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
   at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
   at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
   at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
   at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
   at org.apache.pig.PigServer.init(PigServer.java:226)
   at org.apache.pig.PigServer.init(PigServer.java:215)
   at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
   at org.apache.pig.Main.run(Main.java:452)
   at org.apache.pig.Main.main(Main.java:107)
Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
failed on local exception: java.io.EOFException
   at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
   at org.apache.hadoop.ipc.Client.call(Client.java:743)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
   at $Proxy0.getProtocolVersion(Unknown Source)
   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
   at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
   at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
   at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
   at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
   at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
   at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
   ... 9 more
Caused by: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:375)
   at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)


Re: Help with pigsetup

2011-05-26 Thread Harsh J
I think Jonathan Coveney's reply on user@pig answered your question.
Its basically an issue of hadoop version differences between the one
Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
is newer.

On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I sent this to pig apache user mailing list but have got no response.
 Not sure if that list is still active.

 thought I will post here if someone is able to help me.

 I am in process of installing and learning pig. I have a hadoop
 cluster and when I try to run pig in mapreduce mode it errors out:

 Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1

 Error before Pig is launched
 
 ERROR 2999: Unexpected internal error. Failed to create DataStorage

 java.lang.RuntimeException: Failed to create DataStorage
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
       at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
       at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
       at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
       at org.apache.pig.PigServer.init(PigServer.java:226)
       at org.apache.pig.PigServer.init(PigServer.java:215)
       at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
       at org.apache.pig.Main.run(Main.java:452)
       at org.apache.pig.Main.main(Main.java:107)
 Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
 failed on local exception: java.io.EOFException
       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
       at org.apache.hadoop.ipc.Client.call(Client.java:743)
       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
       at $Proxy0.getProtocolVersion(Unknown Source)
       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
       at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
       at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
       at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
       at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
       at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
       ... 9 more
 Caused by: java.io.EOFException
       at java.io.DataInputStream.readInt(DataInputStream.java:375)
       at 
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)




-- 
Harsh J


Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
For some reason I don't see that reply from Jonathan in my Inbox. I'll
try to google it.

What should be my next step in that case? I can't use pig then?

On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote:
 I think Jonathan Coveney's reply on user@pig answered your question.
 Its basically an issue of hadoop version differences between the one
 Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
 is newer.

 On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 I sent this to pig apache user mailing list but have got no response.
 Not sure if that list is still active.

 thought I will post here if someone is able to help me.

 I am in process of installing and learning pig. I have a hadoop
 cluster and when I try to run pig in mapreduce mode it errors out:

 Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1

 Error before Pig is launched
 
 ERROR 2999: Unexpected internal error. Failed to create DataStorage

 java.lang.RuntimeException: Failed to create DataStorage
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
       at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
       at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
       at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
       at org.apache.pig.PigServer.init(PigServer.java:226)
       at org.apache.pig.PigServer.init(PigServer.java:215)
       at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
       at org.apache.pig.Main.run(Main.java:452)
       at org.apache.pig.Main.main(Main.java:107)
 Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
 failed on local exception: java.io.EOFException
       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
       at org.apache.hadoop.ipc.Client.call(Client.java:743)
       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
       at $Proxy0.getProtocolVersion(Unknown Source)
       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
       at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
       at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
       at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
       at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
       at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
       ... 9 more
 Caused by: java.io.EOFException
       at java.io.DataInputStream.readInt(DataInputStream.java:375)
       at 
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)




 --
 Harsh J



Re: Help with pigsetup

2011-05-26 Thread Jonathan Coveney
I'll repost it here then :)

Here is what I had to do to get pig running with a different version of
Hadoop (in my case, the cloudera build but I'd try this as well):

build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when
you run pig, put the pig-withouthadoop.jar on your classpath as well as your
hadoop jar. In my case, I found that scripts only worked if I additionally
manually registered the antlr jar:

register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar;

2011/5/26 Mohit Anchlia mohitanch...@gmail.com

 For some reason I don't see that reply from Jonathan in my Inbox. I'll
 try to google it.

 What should be my next step in that case? I can't use pig then?

 On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote:
  I think Jonathan Coveney's reply on user@pig answered your question.
  Its basically an issue of hadoop version differences between the one
  Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
  is newer.
 
  On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I sent this to pig apache user mailing list but have got no response.
  Not sure if that list is still active.
 
  thought I will post here if someone is able to help me.
 
  I am in process of installing and learning pig. I have a hadoop
  cluster and when I try to run pig in mapreduce mode it errors out:
 
  Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1
 
  Error before Pig is launched
  
  ERROR 2999: Unexpected internal error. Failed to create DataStorage
 
  java.lang.RuntimeException: Failed to create DataStorage
at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
at org.apache.pig.PigServer.init(PigServer.java:226)
at org.apache.pig.PigServer.init(PigServer.java:215)
at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
at org.apache.pig.Main.run(Main.java:452)
at org.apache.pig.Main.main(Main.java:107)
  Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
  failed on local exception: java.io.EOFException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
at org.apache.hadoop.ipc.Client.call(Client.java:743)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
... 9 more
  Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
 
 
 
 
  --
  Harsh J
 



Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
On Thu, May 26, 2011 at 10:06 AM, Jonathan Coveney jcove...@gmail.com wrote:
 I'll repost it here then :)

 Here is what I had to do to get pig running with a different version of
 Hadoop (in my case, the cloudera build but I'd try this as well):


 build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when
 you run pig, put the pig-withouthadoop.jar on your classpath as well as your
 hadoop jar. In my case, I found that scripts only worked if I additionally
 manually registered the antlr jar:

Thanks Jonathan! I will give it a shot.


 register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar;

Is this a windows command? Sorry, have not used this before.


 2011/5/26 Mohit Anchlia mohitanch...@gmail.com

 For some reason I don't see that reply from Jonathan in my Inbox. I'll
 try to google it.

 What should be my next step in that case? I can't use pig then?

 On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote:
  I think Jonathan Coveney's reply on user@pig answered your question.
  Its basically an issue of hadoop version differences between the one
  Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
  is newer.
 
  On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I sent this to pig apache user mailing list but have got no response.
  Not sure if that list is still active.
 
  thought I will post here if someone is able to help me.
 
  I am in process of installing and learning pig. I have a hadoop
  cluster and when I try to run pig in mapreduce mode it errors out:
 
  Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1
 
  Error before Pig is launched
  
  ERROR 2999: Unexpected internal error. Failed to create DataStorage
 
  java.lang.RuntimeException: Failed to create DataStorage
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
        at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
        at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
        at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
        at org.apache.pig.PigServer.init(PigServer.java:226)
        at org.apache.pig.PigServer.init(PigServer.java:215)
        at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
        at org.apache.pig.Main.run(Main.java:452)
        at org.apache.pig.Main.main(Main.java:107)
  Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
  failed on local exception: java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
        at org.apache.hadoop.ipc.Client.call(Client.java:743)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
        at
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
        at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
        at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
        ... 9 more
  Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
 
 
 
 
  --
  Harsh J
 




Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
I've built pig-withouthadoop.jar and have copied it to my linux box.
Now how do I put hadoop-core-0.20.203.0.jar and pig-withouthadoop.jar
in the classpath. Is it by using CLASSPATH variable?

On Thu, May 26, 2011 at 10:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 On Thu, May 26, 2011 at 10:06 AM, Jonathan Coveney jcove...@gmail.com wrote:
 I'll repost it here then :)

 Here is what I had to do to get pig running with a different version of
 Hadoop (in my case, the cloudera build but I'd try this as well):


 build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when
 you run pig, put the pig-withouthadoop.jar on your classpath as well as your
 hadoop jar. In my case, I found that scripts only worked if I additionally
 manually registered the antlr jar:

 Thanks Jonathan! I will give it a shot.


 register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar;

 Is this a windows command? Sorry, have not used this before.


 2011/5/26 Mohit Anchlia mohitanch...@gmail.com

 For some reason I don't see that reply from Jonathan in my Inbox. I'll
 try to google it.

 What should be my next step in that case? I can't use pig then?

 On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote:
  I think Jonathan Coveney's reply on user@pig answered your question.
  Its basically an issue of hadoop version differences between the one
  Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
  is newer.
 
  On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I sent this to pig apache user mailing list but have got no response.
  Not sure if that list is still active.
 
  thought I will post here if someone is able to help me.
 
  I am in process of installing and learning pig. I have a hadoop
  cluster and when I try to run pig in mapreduce mode it errors out:
 
  Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1
 
  Error before Pig is launched
  
  ERROR 2999: Unexpected internal error. Failed to create DataStorage
 
  java.lang.RuntimeException: Failed to create DataStorage
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
        at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
        at
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
        at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
        at org.apache.pig.PigServer.init(PigServer.java:226)
        at org.apache.pig.PigServer.init(PigServer.java:215)
        at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
        at org.apache.pig.Main.run(Main.java:452)
        at org.apache.pig.Main.main(Main.java:107)
  Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
  failed on local exception: java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
        at org.apache.hadoop.ipc.Client.call(Client.java:743)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
        at
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
        at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
        at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
        at
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
        ... 9 more
  Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
        at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
 
 
 
 
  --
  Harsh J
 





Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
I added to PIG_CLASSPATH and went past the error but now I get a
different error. Looks like I need to add some other jars but not sure
which one.

export 
PIG_CLASSPATH=$HADOOP_CONF_DIR:$HADOOP_HOME/hadoop-core-0.20.203.0.jar:$PIG_HOME/../pig-withouthadoop.jar

ERROR 2998: Unhandled internal error.
org/apache/commons/configuration/Configuration

java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37)
at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34)
at 
org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:196)
at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
at 
org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409)
at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1418)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1319)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:109)
at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
at 
org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:196)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:116)
at org.apache.pig.impl.PigContext.connect(PigContext.java:187)
at org.apache.pig.PigServer.init(PigServer.java:243)
at org.apache.pig.PigServer.init(PigServer.java:228)
at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:46)
at org.apache.pig.Main.run(Main.java:484)
at org.apache.pig.Main.main(Main.java:108)
Caused by: java.lang.ClassNotFoundException:
org.apache.commons.configuration.Configuration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)


On Thu, May 26, 2011 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I've built pig-withouthadoop.jar and have copied it to my linux box.
 Now how do I put hadoop-core-0.20.203.0.jar and pig-withouthadoop.jar
 in the classpath. Is it by using CLASSPATH variable?

 On Thu, May 26, 2011 at 10:18 AM, Mohit Anchlia mohitanch...@gmail.com 
 wrote:
 On Thu, May 26, 2011 at 10:06 AM, Jonathan Coveney jcove...@gmail.com 
 wrote:
 I'll repost it here then :)

 Here is what I had to do to get pig running with a different version of
 Hadoop (in my case, the cloudera build but I'd try this as well):


 build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when
 you run pig, put the pig-withouthadoop.jar on your classpath as well as your
 hadoop jar. In my case, I found that scripts only worked if I additionally
 manually registered the antlr jar:

 Thanks Jonathan! I will give it a shot.


 register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar;

 Is this a windows command? Sorry, have not used this before.


 2011/5/26 Mohit Anchlia mohitanch...@gmail.com

 For some reason I don't see that reply from Jonathan in my Inbox. I'll
 try to google it.

 What should be my next step in that case? I can't use pig then?

 On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote:
  I think Jonathan Coveney's reply on user@pig answered your question.
  Its basically an issue of hadoop version differences between the one
  Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which
  is newer.
 
  On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I sent this to pig apache user mailing list but have got no response.
  Not sure if that list is still active.
 
  thought I will post here if someone is able to help me.
 
  I am in process of installing and learning pig. I have a hadoop
  cluster and when I try to run pig in mapreduce mode it errors out:
 
  Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1
 
  Error before Pig is launched
  
  ERROR 2999: Unexpected internal error. Failed to create DataStorage
 
  java.lang.RuntimeException: Failed to create 

java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal

2011-05-26 Thread subhransu
Hello Geeks,
 I am a new bee to use hadoop and i am currently installed hadoop-0.20.203.0
I am running the sample programs part of this package but getting this error 

Any pointer to fix this ???

~/Hadoop/hadoop-0.20.203.0 788 bin/hadoop jar
hadoop-examples-0.20.203.0.jar sort
java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal
 at
org.apache.hadoop.security.UserGroupInformation.clinit(UserGroupInformation.java:246)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:449)
 at org.apache.hadoop.mapred.JobClient.init(JobClient.java:437)
 at org.apache.hadoop.examples.Sort.run(Sort.java:82)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.examples.Sort.main(Sort.java:187)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.lang.ClassNotFoundException:
com.sun.security.auth.UnixPrincipal
 at java.net.URLClassLoader.findClass(URLClassLoader.java:434)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:653)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:358)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:619)
 ... 20 more


Also if there are some doc/steps/ thread to see how to use a Hello world
hadop program please send me it will be a great help.

--
View this message in context: 
http://hadoop-common.472056.n3.nabble.com/java-lang-NoClassDefFoundError-com-sun-security-auth-UnixPrincipal-tp2989927p2989927.html
Sent from the Users mailing list archive at Nabble.com.


Re: Help with pigsetup

2011-05-26 Thread Mohit Anchlia
I added all the jars in the classpath in HADOOP_HOME/lib and now I get
to the grunt prompt. Will try the tutorials and see how it behaves :)

Thanks for your help!

On Thu, May 26, 2011 at 9:56 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I sent this to pig apache user mailing list but have got no response.
 Not sure if that list is still active.

 thought I will post here if someone is able to help me.

 I am in process of installing and learning pig. I have a hadoop
 cluster and when I try to run pig in mapreduce mode it errors out:

 Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1

 Error before Pig is launched
 
 ERROR 2999: Unexpected internal error. Failed to create DataStorage

 java.lang.RuntimeException: Failed to create DataStorage
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75)
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58)
       at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214)
       at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134)
       at org.apache.pig.impl.PigContext.connect(PigContext.java:183)
       at org.apache.pig.PigServer.init(PigServer.java:226)
       at org.apache.pig.PigServer.init(PigServer.java:215)
       at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55)
       at org.apache.pig.Main.run(Main.java:452)
       at org.apache.pig.Main.main(Main.java:107)
 Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310
 failed on local exception: java.io.EOFException
       at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
       at org.apache.hadoop.ipc.Client.call(Client.java:743)
       at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
       at $Proxy0.getProtocolVersion(Unknown Source)
       at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
       at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
       at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207)
       at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170)
       at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
       at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
       at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
       at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
       at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
       at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72)
       ... 9 more
 Caused by: java.io.EOFException
       at java.io.DataInputStream.readInt(DataInputStream.java:375)
       at 
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
       at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)



Re: Sorting ...

2011-05-26 Thread Mark question
Well, I want something like TeraSort but for sequenceFiles instead of Lines
in Text.
My goal is efficiency and I'm currently working with Hadoop only.

Thanks for your suggestions,
Mark

On Thu, May 26, 2011 at 8:34 AM, Robert Evans ev...@yahoo-inc.com wrote:

 Also if you want something that is fairly fast and a lot less dev work to
 get going you might want to look at pig.  They can do a distributed order by
 that is fairly good.

 --Bobby Evans

 On 5/26/11 2:45 AM, Luca Pireddu pire...@crs4.it wrote:

 On May 25, 2011 22:15:50 Mark question wrote:
  I'm using SequenceFileInputFormat, but then what to write in my mappers?
 
each mapper is taking a split from the SequenceInputFile then sort its
  split ?! I don't want that..
 
  Thanks,
  Mark
 
  On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote:
   On May 25, 2011 01:43:22 Mark question wrote:
Thanks Luca, but what other way to sort a directory of sequence
 files?
   
I don't plan to write a sorting algorithm in mappers/reducers, but
hoping to use the sequenceFile.sorter instead.
   
Any ideas?
   
Mark
  


 If you want to achieve a global sort, then look at how TeraSort does it:

 http://sortbenchmark.org/YahooHadoop.pdf

 The idea is to partition the data so that all keys in part[i] are  all
 keys
 in part[i+1].  Each partition in individually sorted, so to read the data
 in
 globally sorted order you simply have to traverse it starting from the
 first
 partition and working your way to the last one.

 If your keys are already what you want to sort by, then you don't even need
 a
 mapper (just use the default identity map).



 --
 Luca Pireddu
 CRS4 - Distributed Computing Group
 Loc. Pixina Manna Edificio 1
 Pula 09010 (CA), Italy
 Tel:  +39 0709250452




Re: one question about hadoop

2011-05-26 Thread Mark question
web.xml is in:

 hadoop-releaseNo/webapps/job/WEB-INF/web.xml

Mark


On Thu, May 26, 2011 at 1:29 AM, Luke Lu l...@vicaya.com wrote:

 Hadoop embeds jetty directly into hadoop servers with the
 org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml
 is auto generated with the jasper compiler during the build phase. The
 new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop
 HttpServer and doesn't need web.xml and/or jsp support either.

 On Thu, May 26, 2011 at 12:14 AM, 王晓峰 sanlang2...@gmail.com wrote:
  hi,admin:
 
 I'm a  fresh fish from China.
 I want to know how the Jetty combines with the hadoop.
 I can't find the file named web.xml that should exist in usual
 system
  that combine with Jetty.
 I'll be very happy to receive your answer.
 If you have any question,please feel free to contract with me.
 
  Best Regards,
 
  Jack
 



No. of Map and reduce tasks

2011-05-26 Thread Mohit Anchlia
How can I tell how the map and reduce tasks were spread accross the
cluster? I looked at the jobtracker web page but can't find that info.

Also, can I specify how many map or reduce tasks I want to be launched?

From what I understand is that it's based on the number of input files
passed to hadoop. So if I have 4 files there will be 4 Map taks that
will be launced and reducer is dependent on the hashpartitioner.


How to debug why I don't get hadoop logs?

2011-05-26 Thread Gabriele Kahlout
Hello,

I'm running nutch on a hadoop cluster but unfortunately I don't find under
hadoop_home/logs datanote logs but only a jobtracker log. I've not modified
nutch log4j.properties nor hadoops.
To the console I get printed mapred.JobClient stuff and also nutch stuff the
nutch class logs directly before running as a job.


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains [LON] or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
 Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with X.
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).


Re: No. of Map and reduce tasks

2011-05-26 Thread jagaran das
Hi Mohit,

No of Maps - It depends on what is the Total File Size / Block Size 
No of Reducers - You can specify.

Regards,
Jagaran 




From: Mohit Anchlia mohitanch...@gmail.com
To: common-user@hadoop.apache.org
Sent: Thu, 26 May, 2011 2:48:20 PM
Subject: No. of Map and reduce tasks

How can I tell how the map and reduce tasks were spread accross the
cluster? I looked at the jobtracker web page but can't find that info.

Also, can I specify how many map or reduce tasks I want to be launched?

From what I understand is that it's based on the number of input files
passed to hadoop. So if I have 4 files there will be 4 Map taks that
will be launced and reducer is dependent on the hashpartitioner.


web site doc link broken

2011-05-26 Thread Lee Fisher

Th Hadoop Common home page:
http://hadoop.apache.org/common/
has a broken link (Learn About) to the docs. It tries to use:
http://hadoop.apache.org/common/docs/stable/
which doesn't exist (404). It should probably be:
http://hadoop.apache.org/common/docs/current/
Or, someone has deleted the stable docs, which I can't help you with. :-)
Thanks.


Re: No. of Map and reduce tasks

2011-05-26 Thread Mohit Anchlia
I ran a simple pig script on this file:

-rw-r--r-- 1 root root   208348 May 26 13:43 excite-small.log

that orders the contents by name. But it only created one mapper. How
can I change this to distribute accross multiple machines?

On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi Mohit,

 No of Maps - It depends on what is the Total File Size / Block Size
 No of Reducers - You can specify.

 Regards,
 Jagaran



 
 From: Mohit Anchlia mohitanch...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 2:48:20 PM
 Subject: No. of Map and reduce tasks

 How can I tell how the map and reduce tasks were spread accross the
 cluster? I looked at the jobtracker web page but can't find that info.

 Also, can I specify how many map or reduce tasks I want to be launched?

 From what I understand is that it's based on the number of input files
 passed to hadoop. So if I have 4 files there will be 4 Map taks that
 will be launced and reducer is dependent on the hashpartitioner.



Re: No. of Map and reduce tasks

2011-05-26 Thread James Seigel
have more data for it to process :)


On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote:

 I ran a simple pig script on this file:
 
 -rw-r--r-- 1 root root   208348 May 26 13:43 excite-small.log
 
 that orders the contents by name. But it only created one mapper. How
 can I change this to distribute accross multiple machines?
 
 On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi Mohit,
 
 No of Maps - It depends on what is the Total File Size / Block Size
 No of Reducers - You can specify.
 
 Regards,
 Jagaran
 
 
 
 
 From: Mohit Anchlia mohitanch...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 2:48:20 PM
 Subject: No. of Map and reduce tasks
 
 How can I tell how the map and reduce tasks were spread accross the
 cluster? I looked at the jobtracker web page but can't find that info.
 
 Also, can I specify how many map or reduce tasks I want to be launched?
 
 From what I understand is that it's based on the number of input files
 passed to hadoop. So if I have 4 files there will be 4 Map taks that
 will be launced and reducer is dependent on the hashpartitioner.
 



Re: No. of Map and reduce tasks

2011-05-26 Thread Mohit Anchlia
I think I understand that by last 2 replies :)  But my question is can
I change this configuration to say split file into 250K so that
multiple mappers can be invoked?

On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote:
 have more data for it to process :)


 On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote:

 I ran a simple pig script on this file:

 -rw-r--r-- 1 root root   208348 May 26 13:43 excite-small.log

 that orders the contents by name. But it only created one mapper. How
 can I change this to distribute accross multiple machines?

 On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote:
 Hi Mohit,

 No of Maps - It depends on what is the Total File Size / Block Size
 No of Reducers - You can specify.

 Regards,
 Jagaran



 
 From: Mohit Anchlia mohitanch...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 2:48:20 PM
 Subject: No. of Map and reduce tasks

 How can I tell how the map and reduce tasks were spread accross the
 cluster? I looked at the jobtracker web page but can't find that info.

 Also, can I specify how many map or reduce tasks I want to be launched?

 From what I understand is that it's based on the number of input files
 passed to hadoop. So if I have 4 files there will be 4 Map taks that
 will be launced and reducer is dependent on the hashpartitioner.





Re: No. of Map and reduce tasks

2011-05-26 Thread James Seigel
Set input split size really low,  you might get something.

I'd rather you fire up some nix commands and pack together that file
onto itself a bunch if times and the put it back into hdfs and let 'er
rip

Sent from my mobile. Please excuse the typos.

On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 I think I understand that by last 2 replies :)  But my question is can
 I change this configuration to say split file into 250K so that
 multiple mappers can be invoked?

 On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote:
 have more data for it to process :)


 On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote:

 I ran a simple pig script on this file:

 -rw-r--r-- 1 root root   208348 May 26 13:43 excite-small.log

 that orders the contents by name. But it only created one mapper. How
 can I change this to distribute accross multiple machines?

 On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in 
 wrote:
 Hi Mohit,

 No of Maps - It depends on what is the Total File Size / Block Size
 No of Reducers - You can specify.

 Regards,
 Jagaran



 
 From: Mohit Anchlia mohitanch...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 2:48:20 PM
 Subject: No. of Map and reduce tasks

 How can I tell how the map and reduce tasks were spread accross the
 cluster? I looked at the jobtracker web page but can't find that info.

 Also, can I specify how many map or reduce tasks I want to be launched?

 From what I understand is that it's based on the number of input files
 passed to hadoop. So if I have 4 files there will be 4 Map taks that
 will be launced and reducer is dependent on the hashpartitioner.





Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster

2011-05-26 Thread Xu, Richard
Hi Folks,

We try to get hbase and hadoop running on clusters, take 2 Solaris servers for 
now.

Because of the incompatibility issue between hbase and hadoop, we have to stick 
with hadoop 0.20.2-append release.

It is very straight forward to make hadoop-0.20.203 running, but stuck for 
several days with hadoop-0.20.2, even the official release, not the append 
version.

1. Once try to run start-mapred.sh(hadoop-daemon.sh --config $HADOOP_CONF_DIR 
start jobtracker), following errors shown in namenode and jobtracker logs:

2011-05-26 12:30:29,169 WARN 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough 
replicas, still in need of 1
2011-05-26 12:30:29,175 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 
on 9000, call addBlock(/tmp/hadoop-cfadm/mapred/system/jobtracker.info, DFSCl
ient_2146408809) from 169.193.181.212:55334: error: java.io.IOException: File 
/tmp/hadoop-cfadm/mapred/system/jobtracker.info could only be replicated to 0 n
odes, instead of 1
java.io.IOException: File /tmp/hadoop-cfadm/mapred/system/jobtracker.info could 
only be replicated to 0 nodes, instead of 1
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)


2. Also, Configured Capacity is 0, cannot put any file to HDFS.

3. in datanode server, no error in logs, but tasktracker logs has the following 
suspicious thing:
2011-05-25 23:36:10,839 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2011-05-25 23:36:10,839 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 41904: starting
2011-05-25 23:36:10,852 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 41904: starting
2011-05-25 23:36:10,853 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 41904: starting
2011-05-25 23:36:10,853 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 
on 41904: starting
2011-05-25 23:36:10,853 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 
on 41904: starting
.
2011-05-25 23:36:10,855 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
63 on 41904: starting
2011-05-25 23:36:10,950 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker 
up at: localhost/127.0.0.1:41904
2011-05-25 23:36:10,950 INFO org.apache.hadoop.mapred.TaskTracker: Starting 
tracker tracker_loanps3d:localhost/127.0.0.1:41904


I have tried all suggestions found so far, including
 1) remove hadoop-name and hadoop-data folders and reformat namenode;
 2) clean up all temp files/folders under /tmp;

But nothing works.

Your help is greatly appreciated.

Thanks,

RX


Re: No. of Map and reduce tasks

2011-05-26 Thread jagaran das
If you give really low size files, then the use of Big Block Size of Hadoop 
goes away.
Instead try merging files.

Hope that helps




From: James Seigel ja...@tynt.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Sent: Thu, 26 May, 2011 6:04:07 PM
Subject: Re: No. of Map and reduce tasks

Set input split size really low,  you might get something.

I'd rather you fire up some nix commands and pack together that file
onto itself a bunch if times and the put it back into hdfs and let 'er
rip

Sent from my mobile. Please excuse the typos.

On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 I think I understand that by last 2 replies :)  But my question is can
 I change this configuration to say split file into 250K so that
 multiple mappers can be invoked?

 On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote:
 have more data for it to process :)


 On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote:

 I ran a simple pig script on this file:

 -rw-r--r-- 1 root root   208348 May 26 13:43 excite-small.log

 that orders the contents by name. But it only created one mapper. How
 can I change this to distribute accross multiple machines?

 On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in 
wrote:
 Hi Mohit,

 No of Maps - It depends on what is the Total File Size / Block Size
 No of Reducers - You can specify.

 Regards,
 Jagaran



 
 From: Mohit Anchlia mohitanch...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thu, 26 May, 2011 2:48:20 PM
 Subject: No. of Map and reduce tasks

 How can I tell how the map and reduce tasks were spread accross the
 cluster? I looked at the jobtracker web page but can't find that info.

 Also, can I specify how many map or reduce tasks I want to be launched?

 From what I understand is that it's based on the number of input files
 passed to hadoop. So if I have 4 files there will be 4 Map taks that
 will be launced and reducer is dependent on the hashpartitioner.





Re: Are hadoop fs commands serial or parallel

2011-05-26 Thread Mapred Learn
Hi guys,
Another question related to it is that when you do hadoop fs -copyFromLocal
or use
API to call fs.write(), does it write to local filesystem first before
writing to HDFS. I read and found out that it writes on local file-system
until block-size is reached and then writes on HDFS.
Wouldn't HDFS Client choke if it writes to local filesystem if multiple such
fs -copyFromLocal commands are running. I thought atleast in fs.write(), if
you provide byte array, it should not write on local file-system ?

Could somebody tell how fs -copyFromLocal and fs.write() work ? Do they
write on local-filesystem beofre block size is reached and then write to
HDFS or write directly to HDFS ?

Thanks in advance,
-JJ

On Wed, May 18, 2011 at 9:39 AM, Patrick Angeles patr...@cloudera.comwrote:

 kinda clunky but you could do this via shell:

 for $FILE in $LIST_OF_FILES ; do
  hadoop fs -copyFromLocal $FILE $DEST_PATH 
 done

 If doing this via the Java API, then, yes you will have to use multiple
 threads.

 On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.com
 wrote:

  Thanks harsh !
  That means basically both APIs as well as hadoop client commands allow
 only
  serial writes.
  I was wondering what could be other ways to write data in parallel to
 HDFS
  other than using multiple parallel threads.
 
  Thanks,
  JJ
 
  Sent from my iPhone
 
  On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote:
 
   Hello,
  
   Adding to Joey's response, copyFromLocal's current implementation is
  serial
   given a list of files.
  
   On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com
   wrote:
   Thanks Joey !
   I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
   serially as you pointed out.
  
   Thanks,
   -JJ
  
   On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com
 wrote:
  
   The sequence file writer definitely does it serially as you can only
   ever write to the end of a file in Hadoop.
  
   Doing copyFromLocal could write multiple files in parallel (I'm not
   sure if it does or not), but a single file would be written serially.
  
   -Joey
  
   On Tue, May 17, 2011 at 5:44 PM, Mapred Learn 
 mapred.le...@gmail.com
   wrote:
   Hi,
   My question is when I run a command from hdfs client, for eg. hadoop
  fs
   -copyFromLocal or create a sequence file writer in java code and
  append
   key/values to it through Hadoop APIs, does it internally
  transfer/write
   data
   to HDFS serially or in parallel ?
  
   Thanks in advance,
   -JJ
  
  
  
  
   --
   Joseph Echeverria
   Cloudera, Inc.
   443.305.9434
  
  
   --
   Harsh J