Customizing machines to use for different jobs

2009-06-04 Thread Rakhi Khatwani
Hi, Can we specify which subset of machines to use for different jobs? E.g. We set machine A as namenode, and B, C, D as datanodes. Then for job 1, we have a mapreduce that runs on B C and for job 2, the map-reduce runs on C D. Regards, Raakhi

Re: streaming a binary processing file

2009-06-04 Thread Jeff Hammerbacher
Hey, If you don't want to wait for the release, you could try using the latest version of Cloudera's Distribution for Hadoop (see http://www.cloudera.com/hadoop), which is based off of the 0.18.3 release of Apache Hadoop but has the HADOOP-1722 patch backported (see

Re: question about when shuffle/sort start working

2009-06-04 Thread Jianmin Woo
Oh, I see. Thanks. - Jianmin From: Sharad Agarwal shara...@yahoo-inc.com To: core-user@hadoop.apache.org Sent: Thursday, June 4, 2009 12:59:12 PM Subject: Re: question about when shuffle/sort start working Jianmin Woo wrote: Do you have some sample on the

Re: Sharing object between mappers on same node (reuse.jvm ?)

2009-06-04 Thread Kevin Peterson
On Wed, Jun 3, 2009 at 10:59 AM, Tarandeep Singh tarand...@gmail.comwrote: I want to share a object (Lucene Index Writer Instance) between mappers running on same node of 1 job (not across multiple jobs). Please correct me if I am wrong - If I set the -1 for the property:

Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Aaron Kimball
Can you give an example of the exact arguments you're sending on the command line? - Aaron On Wed, Jun 3, 2009 at 5:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote: If after I call getConf to get the conf object, I manually add the key/value pair, it's there when I need it. So it feels like

Re: How do I convert DataInput and ResultSet to array of String?

2009-06-04 Thread Aaron Kimball
e.g. for readFields(), myItems = new ArrayListString(); int numItems = dataInput.readInt(); for (i = 0; i numItems; i++) { myItems.add(Text.readString(dataInput)); } then on the serialization (write) side, send: dataOutput.writeInt(myItems.length()); for (int i = 0; i myItems.length(); i++)

Re: Do I need to implement Readfields and Write Functions If I have Only One Field?

2009-06-04 Thread Aaron Kimball
If you don't add any member fields, then no, I don't think you need to change anything. - Aaron On Wed, Jun 3, 2009 at 4:11 PM, dealmaker vin...@gmail.com wrote: I have the following as my type of my value object. Do I need to implement readfields and write functions? private static

Re: Fastlz coming?

2009-06-04 Thread Johan Oskarsson
We're using Lzo still, works great for those big log files: http://code.google.com/p/hadoop-gpl-compression/ /Johan Kris Jirapinyo wrote: Hi all, In the remove lzo JIRA ticket https://issues.apache.org/jira/browse/HADOOP-4874 Tatu mentioned he was going to port fastlz from C to Java and

Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Ian Soboroff
bin/hadoop jar -files collopts -D prise.collopts=collopts p3l-3.5.jar gov.nist.nlpir.prise.mapred.MapReduceIndexer input output The 'prise.collopts' option doesn't appear in the JobConf. Ian Aaron Kimball aa...@cloudera.com writes: Can you give an example of the exact arguments you're

Re: Subdirectory question revisited

2009-06-04 Thread Ian Soboroff
Here's how I solved the problem using a custom InputFormat... the key part is in listStatus(), where we traverse the directory tree. Since HDFS doesn't have links this code is probably safe, but if you have a filesystem with cycles you will get trapped. Ian import java.io.IOException; import

Re: problem getting map input filename

2009-06-04 Thread Rares Vernica
On 6/2/09, Rares Vernica rvern...@gmail.com wrote: I have a problem getting the map input file name. Here is what I tried: public class Map extends MapperObject, Text, LongWritable, Text { public void map(Object key, Text value, Context context) throws IOException,

Re: *.gz input files

2009-06-04 Thread Ian Soboroff
If you're case is like mine, where you have lots of .gz files and you don't want splits in the middle of those files, you can use the code I just sent in the thread about traversing subdirectories. In brief, your RecordReader could do something like: public static class MyRecordReader

Re: Task files in _temporary not getting promoted out

2009-06-04 Thread jason hadoop
Are your tasks failing or completing successfully. Failed tasks have the output directory wiped, only successfully completed tasks have the files moved up. I don't recall if the FileOutputCommitter class appeared in 0.18 On Wed, Jun 3, 2009 at 6:43 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

Letting the Mapper handle multiple lines.

2009-06-04 Thread Per Stolpe
Hi. I'm quite new to Hadoop programming, so to get a good start I started writing my own program that summarizes a column in a large tab separated file (~100 000 000 lines). My first naive implementation was quite simple, a small rework of the WordCounter example that comes with Hadoop. This

Re: Fastlz coming?

2009-06-04 Thread Kris Jirapinyo
Is there any documentation on that site on how we can use lzo? I don't see any entries on the wiki page of the project. I see an entry on the Hadoop wiki (http://wiki.apache.org/hadoop/UsingLzoCompression) but seems like that's more oriented towards HBase. I am on hadoop 0.19.1. Thanks, Kris

Re: Fastlz coming?

2009-06-04 Thread Matt Massie
Kris- You might take a look at some of the previous lzo threads on this list for help. See: http://www.mail-archive.com/search?q=lzol=core-user%40hadoop.apache.org -Matt On Jun 4, 2009, at 10:29 AM, Kris Jirapinyo wrote: Is there any documentation on that site on how we can use lzo? I

Re: Customizing machines to use for different jobs

2009-06-04 Thread Alex Loddengaard
Hi Raakhi, Unfortunately there is no built-in way of doing this. You'd have to instantiate two entirely separate Hadoop clusters to accomplish what you're trying to do, which isn't an uncommon thing to do. I'm not sure why you're hoping to have this behavior, but the fair share scheduler might

Re: Sharing object between mappers on same node (reuse.jvm ?)

2009-06-04 Thread Tarandeep Singh
Thanks Kevin for the clarification. I ran couple of tests as well and the system behaved exactly what you had said. So now the question is, how can I achieve what I want to do - share an object (Lucene IndexWriter instance) between mappers running on same node. I thought of running the

Re: Letting the Mapper handle multiple lines.

2009-06-04 Thread HRoger
I has read your code ,I think you should add job.setInputFormatClass(MultiLineInputFormat.class); when you not set the that ,it would use TextInputFormat and the value is Text default.You may thought that MultiLineInputFormat.addInputPath() would set the InputFormatClass auto, but it doesn't do

Re: Fastlz coming?

2009-06-04 Thread Kris Jirapinyo
Thanks Matt. Hopefully we can have a new page on the hadoop wiki on how to use custom compression so that people won't have to go search through the threads to find the answer in the future. On Thu, Jun 4, 2009 at 10:33 AM, Matt Massie m...@cloudera.com wrote: Kris- You might take a look at

Processing files lying in a directory structure

2009-06-04 Thread akhil1988
Hi! I am working on applying WordCount example on the entire Wikipedia dump. The entire english wikipedia is around 200GB which I have stored in HDFS in a cluster to which I have access. The problem: Wikipedia dump contains many directories (it has a very big directory structure) containing

Subscription

2009-06-04 Thread Akhil langer
Please, add me to the hadoop-core user mailing list. email address: *akhilan...@gmail.com* Thank You! Akhil

Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Vasyl Keretsman
Perhaps, there should not be the space between -D and your option ? -Dprise.collopts= Vasyl 2009/6/4 Ian Soboroff ian.sobor...@nist.gov: bin/hadoop jar -files collopts -D prise.collopts=collopts p3l-3.5.jar gov.nist.nlpir.prise.mapred.MapReduceIndexer input output The

Cluster Setup Issues : Datanode not being initialized.

2009-06-04 Thread asif md
Hello all, I'm trying to setup a two node cluster remote using the following tutorials { NOTE : i'm ignoring the tmp directory property in hadoop-site.xml suggested by Michael } Running Hadoop On Ubuntu Linux (Single-Node Cluster) - Michael G.

Re: Subscription

2009-06-04 Thread Aaron Kimball
You need to send a message to core-user-subscr...@hadoop.apache.org from the address you want registered. See http://hadoop.apache.org/core/mailing_lists.html - Aaron On Thu, Jun 4, 2009 at 12:10 PM, Akhil langer akhilan...@gmail.com wrote: Please, add me to the hadoop-core user mailing list.

Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Tom White
Actually, the space is needed, to be interpreted as a Hadoop option by ToolRunner. Without the space it sets a Java system property, which Hadoop will not automatically pick up. Ian, try putting the options after the classname and see if that helps. Otherwise, it would be useful to see a snippet

Re: Fastlz coming?

2009-06-04 Thread Owen O'Malley
On Jun 4, 2009, at 11:19 AM, Kris Jirapinyo wrote: Hopefully we can have a new page on the hadoop wiki on how to use custom compression so that people won't have to go search through the threads to find the answer in the future. Yes, it would be extremely useful if you could start a wiki

Re: Cluster Setup Issues : Datanode not being initialized.

2009-06-04 Thread Ravi Phulari
From logs looks like your Hadoop cluster is facing two different issues . At Slave 1. exception: java.net.NoRouteToHostException: No route to host in your logs Diagnosis - One of your nodes cannot be reached correctly. Make sure you can ssh to your master and slave and passwordless ssh keys

Re: Letting the Mapper handle multiple lines.

2009-06-04 Thread Per Stolpe
I did indeed think that addInputPath() set the InputFormat class, so this is probably what has been my problem. I'll try this when I gain access to my cluster again on Monday, but I'm fairly confident that this will fix my program. Thank you very much for a good answer. Take care, I will post

Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Raghu Angadi
Tom White wrote: Actually, the space is needed, to be interpreted as a Hadoop option by ToolRunner. Without the space it sets a Java system property, which Hadoop will not automatically pick up. I don't think space is required. Something like -Dfs.default.name=host:port works. I don't see

Re: Cluster Setup Issues : Datanode not being initialized.

2009-06-04 Thread asif md
I can SSH both ways .i.e. From master to slave and slave to master. the datanode is getting intialized at master but the log at slave looks like this / 2009-06-04 15:20:06,066 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:

Re: Cluster Setup Issues : Datanode not being initialized.

2009-06-04 Thread Raghu Angadi
Did you try 'telnet 198.55.35.229 54310' from this datanode? The log show that it is not able to connect to master:54310. ssh from datanode does not matter. Raghu. asif md wrote: I can SSH both ways .i.e. From master to slave and slave to master. the datanode is getting intialized at

Re: Cluster Setup Issues : Datanode not being initialized.

2009-06-04 Thread asif md
@ Ravi. Not able to do that. On Thu, Jun 4, 2009 at 5:38 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Did you try 'telnet 198.55.35.229 54310' from this datanode? The log show that it is not able to connect to master:54310. ssh from datanode does not matter. Raghu. asif md wrote: I

Question about Hadoop filesystem

2009-06-04 Thread Harold Lim
How do I remove a datanode? Do I simply destroy my datanode and the namenode will automatically detect it? Is there a more elegent way to do it? Also, when I remove a datanode, does hadoop automatically re-replicate the data right away? Thanks, Harold

Re: Question about Hadoop filesystem

2009-06-04 Thread Brian Bockelman
It's in the FAQ: http://wiki.apache.org/hadoop/FAQ#17 Brian On Jun 4, 2009, at 6:26 PM, Harold Lim wrote: How do I remove a datanode? Do I simply destroy my datanode and the namenode will automatically detect it? Is there a more elegent way to do it? Also, when I remove a datanode,

HBase v0.19.3 with Hadoop v0.19.1?

2009-06-04 Thread Amandeep Khurana
I have a couple of questions: 1. Is Hbase 0.19.3 release stable for a production cluster? 2. Can it be deployed over Hadoop v0.19.1? ..amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz

Re: Fastlz coming?

2009-06-04 Thread Hong Tang
Using com.hadoop.compression.lzo.LzoCodec is not much different from using other codecs: adding the hadoop-gpl-compression-0.1.0-dev.jar in your classpath, and add the path to the native library libgplcompression.so in system property java.library.path. Hope this helps, Hong On Jun 4,

Hadoop scheduling question

2009-06-04 Thread Kristi Morton
Hi, I'm a Hadoop 17 user who is doing research with Prof. Magda Balazinska at the University of Washington on an improved progress indicator for Pig Latin. We have a question regarding how Hadoop schedules Pig Latin queries with JOIN operators. Does Hadoop schedule all MapReduce jobs in a

Re: Hadoop scheduling question

2009-06-04 Thread Pankil Doshi
Hello Kristi, I am Research Assistant at University of Texas at Dallas. We are working of RDF data and we come across many joins in our queries. But We are not able to carry out all joins in a single job..we also tried our hadoop code using Pig scripts and found that for each join in PIG script