Managing configurations

2010-08-18 Thread Mark
What is the preferred way of managing multiple configurations.. ie development, production etc. Is there someway I can tell hadoop to use a separate conf directory other than ${hadoop_home}/conf? I think I've read somewhere that one should just create multiple conf directories and then

Re: Managing configurations

2010-08-19 Thread Mark
On 8/18/10 9:45 PM, Hemanth Yamijala wrote: Mark, On Wed, Aug 18, 2010 at 10:59 PM, Markstatic.void@gmail.com wrote: What is the preferred way of managing multiple configurations.. ie development, production etc. Is there someway I can tell hadoop to use a separate conf directory

Latest Api

2010-08-19 Thread Mark
Where can I find out information on the latest hadoop api? Are the examples here using the latest? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html I am trying to go through all the example classes bundled w/ hadoop but I keep seeing the api change so I am unsure of which to

Re: Latest Api

2010-08-19 Thread Mark
On 8/19/10 7:46 AM, Mark wrote: Where can I find out information on the latest hadoop api? Are the examples here using the latest? http://hadoop.apache.org/common/docs/current/mapred_tutorial.html I am trying to go through all the example classes bundled w/ hadoop but I keep seeing the api

InputSplit and RecordReader

2010-08-19 Thread Mark
From what I understand the InputSplit is a byte slice of a particular file which is then handed off to an individual mapper for processing. Is the size of the InputSplit equal to the hadoop block ie 64/128mb? If not, what is the size. Now the RecordReaders takes in bytes from the InputSplit

Basic question

2010-08-25 Thread Mark
job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); Does this mean the input to the reducer should be Text/IntWritable or the output of the reducer is Text/IntWritable? What is the inverse of this.. setInputKeyClass/setInputValueClass? Is this inferred by the

Writing to an existing directory

2010-08-26 Thread Mark
/Exception in thread main org.apache.hadoop.fs.FileAlreadyExistsException: Output directory playground/output already exi/sts Is there anyway to force writing to an existing directory? It's quite annoying to keep specifiying a seperate output directory on each run.. especially when my task

KeyValueTextInputFormat

2010-08-26 Thread Mark
When I configure my job to use a KeyValueTextInputFormat doesn't that imply that the key and value to my mapper will be both Text? I have it set up like this and I am using the default Mapper.class ie IdentityMapper - KeyValueTextInputFormat.addInputPath(job, new Path(otherArgs[0])); but I

Re: KeyValueTextInputFormat

2010-08-27 Thread Mark
On 8/26/10 7:47 PM, newpant wrote: Hi, do you use JobConf.setInputFormat(KeyValueTextInputFormat.class) to set the input format class ? Default input format class is TextInputFormat, and the Key type is LongWritable, which store offset of lines in the file (in byte) if your reducer accept a

Ivy

2010-08-27 Thread Mark
Is there a public ivy repo that has the latest hadoop? Thanks

Re: Ivy

2010-08-28 Thread Mark
On 8/27/10 9:25 AM, Owen O'Malley wrote: On Aug 27, 2010, at 8:04 AM, Mark wrote: Is there a public ivy repo that has the latest hadoop? Thanks The hadoop jars and poms should be pushed into the central Maven repositories, which Ivy uses. -- Owen I am looking for the latest version

Classpath

2010-08-28 Thread Mark
How can I add jars to Hadoops classpath when running MapReduce jobs for the following situations? 1) Assuming that the jars are local the nodes that running the job. 2) The jobs are only local to the client submitting the job. I'm assuming I can just jar up all required jobs into the main job

Command line arguments

2010-08-29 Thread Mark
How can I access command line arguments from a Map or Reduce job? Thanks

Job in 0.21

2010-08-29 Thread Mark
How should I be creating a new Job instance in 0.21. It looks like Job(Configuration conf, String jobName) has been deprecated. It looks like Job(Cluster cluster) is the new way but I'm unsure of how to get a handle to the current cluster. Can someone advise. Thanks!

Re: Command line arguments

2010-08-29 Thread Mark
On 8/29/10 4:26 PM, Owen O'Malley wrote: You would need to save the arguments into the Configuration (aka JobConf) that you create your job with. -- Owen Thanks I just realized that.

Re: Classpath

2010-08-30 Thread Mark
On 8/29/10 10:38 PM, Amareshwari Sri Ramadasu wrote: You can use -libjars option. On 8/29/10 10:59 AM, Markstatic.void@gmail.com wrote: How can I add jars to Hadoops classpath when running MapReduce jobs for the following situations? 1) Assuming that the jars are local the nodes that

Re: Classpath

2010-08-30 Thread Mark
On 8/30/10 7:38 AM, Mark wrote: On 8/29/10 10:38 PM, Amareshwari Sri Ramadasu wrote: You can use -libjars option. On 8/29/10 10:59 AM, Markstatic.void@gmail.com wrote: How can I add jars to Hadoops classpath when running MapReduce jobs for the following situations? 1) Assuming

Writable questions

2010-08-31 Thread Mark
I have a question regarding outputting Writable objects. I thought all Writables know how to serialize themselves to output. For example I have an ArrayWritable of strings (or Texts) but when I output it to a file it shows up as 'org.apache.hadoop.io.arraywrita...@21f7186f' Am I missing

Re: Writable questions

2010-08-31 Thread Mark
On 8/31/10 10:04 AM, Steve Hoffman wrote: That is the default 'toString()' output of any Java object. If you want your custom Writable to print something different you have to override the toString() method. Steve On Tue, Aug 31, 2010 at 11:58 AM, Markstatic.void@gmail.com wrote: I

Re: Writable questions

2010-08-31 Thread Mark
On 8/31/10 10:07 AM, David Rosenstrauch wrote: On 08/31/2010 12:58 PM, Mark wrote: I have a question regarding outputting Writable objects. I thought all Writables know how to serialize themselves to output. For example I have an ArrayWritable of strings (or Texts) but when I output

Re: Writable questions

2010-09-02 Thread Mark
On 9/1/10 11:28 PM, Lance Norskog wrote: Wait- you want a print-to-user method or a 'serialize/deserialize' method? On Tue, Aug 31, 2010 at 2:42 PM, David Rosenstrauchdar...@darose.net wrote: On 08/31/2010 02:09 PM, Mark wrote: On 8/31/10 10:07 AM, David Rosenstrauch wrote: On 08/31/2010

Hadoop distributed mode

2010-09-06 Thread Mark
I am trying to configure out distributed Hadoop setup but for some reason my datanodes can not connect to the namenode. 2010-09-06 04:06:05,040 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.101.102.11:9000. Already tried 0 time(s). 2010-09-06 04:06:06,041 INFO

Re: Hadoop distributed mode

2010-09-06 Thread Mark
Have any idea on what conf that may be? I am able to ssh into the namenode from the datanodes though. Looks like the name node is starting the web-server on 0.0.0.0. Is this something I should change? 2010-09-06 03:13:12,599 INFO org.mortbay.log: Started

Re: Hadoop distributed mode

2010-09-06 Thread Mark
pinging the master-hadoop from the datanode resolves fine however if I try to curl master-hadoop:50070 from the datanodes I get curl: (7) couldn't connect to host On 9/6/10 11:26 AM, Ranjib Dey wrote: ya, firewall is the most common issue. apart from that take a look at the namenode log

Re: Hadoop distributed mode

2010-09-06 Thread Mark
FYI I am running fedora 9. Anything I need to configure to allow access? On 9/6/10 11:33 AM, Mark wrote: pinging the master-hadoop from the datanode resolves fine however if I try to curl master-hadoop:50070 from the datanodes I get curl: (7) couldn't connect to host On 9/6/10 11:26 AM

Re: Hadoop distributed mode

2010-09-06 Thread Mark
Nevermind. It was the default Fedora firewall that was configured. Disabled it and everything worked as it should. Thanks for the help On 9/6/10 11:36 AM, Mark wrote: FYI I am running fedora 9. Anything I need to configure to allow access? On 9/6/10 11:33 AM, Mark wrote: pinging

Client access

2010-09-06 Thread Mark
How do I go about uploading content from a remote machine to the hadoop cluster? Do I have to first move the data to one of the nodes and then do a fs -put or is there some client I can use to just access an existing cluster? Thanks

Re: Client access

2010-09-06 Thread Mark
Thanks.. ill give that a try On 9/6/10 2:02 PM, Harsh J wrote: Java: You can use a DFSClient instance with a proper config object (Configuration) from right about anywhere - basically all that matters is the right fs.default.name value, which is your namenode's communication point. Can even

Namenode Datanode

2010-09-07 Thread Mark
In a small cluster (4 machines) with not many jobs can a Namenode be configured as a Datanode so it can assist in our MR tasks. If so how is this configured? Is it just simply marking it as such in the slaves file? Thanks

Re: Namenode Datanode

2010-09-07 Thread Mark
Thanks On 9/7/10 9:16 AM, abhishek sharma wrote: On Tue, Sep 7, 2010 at 12:09 PM, Markstatic.void@gmail.com wrote: In a small cluster (4 machines) with not many jobs can a Namenode be configured as a Datanode so it can assist in our MR tasks. If so how is this configured? Is it just

Error starting namenode

2010-09-08 Thread Mark
I am getting the following errors from my datanodes when I start the namenode. 2010-09-08 14:17:40,690 INFO org.apache.hadoop.ipc.RPC: Server at hadoop1/10.XXX.XXX.XX:9000 not available yet, Z... 2010-09-08 14:17:42,690 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:

Re: Error starting namenode

2010-09-09 Thread Mark
Fedora actually On 9/9/10 2:20 AM, Steve Loughran wrote: On 09/09/10 06:28, Mark wrote: I am getting the following errors from my datanodes when I start the namenode. 2010-09-08 14:17:40,690 INFO org.apache.hadoop.ipc.RPC: Server at hadoop1/10.XXX.XXX.XX:9000 not available yet, Z... 2010

Re: Error starting namenode

2010-09-09 Thread Mark
Wait what? On 9/8/10 10:34 PM, Adarsh Sharma wrote: Mark just write down the following lines in /etc/hosts file Ip-address hostname e. g. 192.168.0.111 ws-test 192.1685.0.165 rahul samr for all nodes Mark wrote: I am getting the following errors from my datanodes when I start

Re: Error starting namenode

2010-09-09 Thread Mark
nm.. got it On 9/9/10 9:55 AM, Mark wrote: Wait what? On 9/8/10 10:34 PM, Adarsh Sharma wrote: Mark just write down the following lines in /etc/hosts file Ip-address hostname e. g. 192.168.0.111 ws-test 192.1685.0.165 rahul samr for all nodes Mark wrote: I am getting

Connecting remotely

2010-09-09 Thread Mark
How would I connect remotely to HDFS as a different user? ie, in my machine im logged in as mark but I want to log in as root. Thanks

Re: Connecting remotely

2010-09-09 Thread Mark
Thats it. Thanks On 9/9/10 12:59 PM, Mithila Nagendra wrote: Try sudo instead. On Thu, Sep 9, 2010 at 12:25 PM, Markstatic.void@gmail.com wrote: How would I connect remotely to HDFS as a different user? ie, in my machine im logged in as mark but I want to log in as root. Thanks

Stopping jobs

2010-09-09 Thread Mark
How would I go about gracefully stopping/aborting a job?

Submitting jobs to run in the background

2010-09-09 Thread Mark
I am trying to run a mahout example which consists of multiple jobs (4). If I put the process in the background and log off the machine then Hadoop will only finish the job it is currently running. Is there a way around this? Thanks

Question on classpath

2010-09-10 Thread Mark
If I submit a jar that has a lib directory that contains a bunch of jars, shouldn't those jars be in the classpath and available to all nodes? The reason I ask this is because I am trying to submit a jar myjar.jar that has the following structure --src \ (My source classes) -- lib \

Re: Question on classpath

2010-09-10 Thread Mark
I dont know? I'm running in a fully distributed environment.. ie not local or psuedo. On 9/10/10 12:03 PM, Allen Wittenauer wrote: On Sep 10, 2010, at 11:53 AM, Mark wrote: If I submit a jar that has a lib directory that contains a bunch of jars, shouldn't those jars be in the classpath

Re: Question on classpath

2010-09-10 Thread Mark
If I deploy 1 jar (that contains a lib directory with all the required dependencies) shouldn't that jar be inherently be distributed to all the nodes? On 9/10/10 2:49 PM, Mark wrote: I dont know? I'm running in a fully distributed environment.. ie not local or psuedo. On 9/10/10 12:03 PM

Dumping Cassandra into Hadoop

2010-10-19 Thread Mark
As the subject implies I am trying to dump Cassandra rows into Hadoop. What is the easiest way for me to accomplish this? Thanks. Should I be looking into pig for something like this?

Java heap space error on PFPGrowth

2010-11-11 Thread Mark
I am trying to run PFPGrowth but I keep receiving this Java heap space error at the end of the first step/beginning of second step. I am using the following parameters: -method mapreduce -regex [\\t] -s 5 -g 55000 Output: .. 10/11/11 08:12:56 INFO mapred.JobClient: map 100% reduce

Hadoop/Elastic MR on AWS

2010-12-09 Thread Mark
Does anyone have any thoughts/experiences on running Hadoop in AWS? What are some pros/cons? Are there any good AMI's out there for this? Thanks for any advice.

Hive import question

2010-12-14 Thread Mark
When I load a file from HDFS into hive i notice that the original file has been removed. Is there anyway to prevent this? If not, how can I got back and dump it as a file again? Thanks

Re: Hive import question

2010-12-15 Thread Mark
Exactly what I was looking for. Thanks On 12/14/10 8:53 PM, 김영우 wrote: Hi Mark, You can use 'External table' in Hive. http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL http://wiki.apache.org/hadoop/Hive/LanguageManual/DDLHive external table does not move or delete files. - Youngwoo 2010

Hive Partitioning

2010-12-15 Thread Mark
Can someone explain what partitioning is and why it would be used.. example? Thanks

Max Jobconf exceeded

2011-03-05 Thread Mark
I'm running the Mahout Frequent Pattern Mining Job (org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver) and I keep receiving the following: Caused by: java.io.IOException: Exceeded max jobconf size: 94278797 limit: 524288 Can someone explain the cause of this and more importantly the

Hadoop heap and other memory settings

2011-03-09 Thread Mark
How should I be configuring the heap and memory settings for my cluster? Currently the only settings we use are: HADOOP_HEAPSIZE=8192 (in hadoop-env.sh) mapred.child.java.opts=8192 (in mapred-site.xml) I have a feeling they settings are completely off. The only reason we increased it this

Question on Master

2011-03-16 Thread Mark
I know the master node is responsible for namenode and job tracker, but other than that is there any data stored on that machine? Basically what I am asking is should there be an generous amount of free space on that machine? So for example I have a large drive I want to swap out of my master

Re: Question on Master

2011-03-16 Thread Mark
Ok thanks for the clarification. Just to be sure though.. - The master will have the ${dfs.name.dir} but not ${dfs.data.dir} - The nodes will have ${dfs.data.dir} but not ${dfs.name.dir} Is that correct? On 3/16/11 10:43 AM, Harsh J wrote: NameNode and JobTracker do not require a lot of

Cloudera Flume

2011-03-16 Thread Mark
Sorry if this is not the correct list to post this on, it was the closest I could find. We are using a taildir('/var/log/foo/') source on all of our agents. If this agent goes down and data can not be sent to the collector for some time, what happens when this agent becomes available again?

Chukwa - Lightweight agents

2011-03-20 Thread Mark
Sorry but it doesn't look like Chukwa mailing list exists anymore? Is there an easy way to set up lightweight agents on cluster of machines instead of downloading the full Chukwa source (+50mb)? Has anyone build separate RPM's for the agents/collectors? Thanks

Chukwa?

2011-03-20 Thread Mark
Whats the deal with Chukwa? Mailing list doesn't look like its alive as well as any of the download options??? http://www.apache.org/dyn/closer.cgi/incubator/chukwa/ Is this project dead?

Re: Chukwa - Lightweight agents

2011-03-20 Thread Mark
is the primary mission of chukwa. If you are looking for data import, what about Flume? On Sun, Mar 20, 2011 at 9:59 AM, Mark static.void@gmail.com mailto:static.void@gmail.com wrote: Thanks but we need Chukwa to aggregate and store files from across our app servers into Hadoop. Doesn't

Re: Chukwa?

2011-03-21 Thread Mark
Is Chukwa primarily used for analytics or log aggregation. I thought it was the latter but it seems more and more its like the former. On 3/21/11 8:27 AM, Eric Yang wrote: Chukwa is waiting on a official release of Hadoop and HBase which works together. In Chukwa trunk, Chukwa is using HBase

Setting input paths

2011-04-06 Thread Mark
How can I tell my job to include all the subdirectories and their content of a certain path? My directory structure is as follows: logs/{YEAR}/{MONTH}/{DAY} and I tried setting my input path to 'logs/' using FileInputFormat.addInputPath however I keep receiving the following error:

Re: Setting input paths

2011-04-06 Thread Mark
Ok so the behavior is a little different when using FileInputFormat.addInputPath as opposed to using pig. Ill try the glob. Thanks On 4/6/11 8:41 AM, Robert Evans wrote: I believe that opening a directory as a file will result in a file not found. You probably need to set it to a glob,

Shared lib?

2011-04-08 Thread Mark
If I have some jars that I would like including into ALL of my jobs is there some shared lib directory on HDFS that will be available to all of my nodes? Something similar to how Oozie uses a shared lib directory when submitting workflows. As of right now I've been cheating and copying these

Backing up namenode

2011-06-04 Thread Mark
How would I go backing up our namenode data? I set up the secondarynamenode on a separate physical machine but just as a backup I would like to save the namenode data. Thanks

Re: Backing up namenode

2011-06-06 Thread Mark
? Thanks On 6/5/11 12:44 AM, sulabh choudhury wrote: Hey Mark, If you add more than one directory (comma separated) in the variable dfs.name.dir it would automatically be copied in all those location. On Sat, Jun 4, 2011 at 10:14 AM, Markstatic.void@gmail.com wrote: How would I go backing up

Configuration settings

2011-06-21 Thread Mark
We have a small 4 node clusters that have 12GB of ram and the cpus are Quad Core Xeons. I'm assuming the defaults aren't that generous so what are some configuration changes I should make to take advantage of this hardware? Max map task? Max reduce tasks? Anything else? Thanks

Writing out a single file

2011-07-05 Thread Mark
Is there anyway I can write out the results of my mapreduce job into 1 local file... ie the opposite of getmerge? Thanks

Default Compression

2011-10-31 Thread Mark
I recently added the following to my core-site.xml property nameio.compression.codecs/name value org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec /value /property However when I try and test a simple MR job I am

Re: Default Compression

2011-10-31 Thread Mark
That did it. Thanks On 10/31/11 12:52 PM, Joey Echeverria wrote: Try getting rid of the extra spaces and new lines. -Joey On Mon, Oct 31, 2011 at 1:49 PM, Markstatic.void@gmail.com wrote: I recently added the following to my core-site.xml property nameio.compression.codecs/name value

Upgrading master hardware

2011-11-16 Thread Mark
We will be adding more memory into our master node in the near future. We generally don't mind if our map/reduce jobs are unable to run for a short period but we are more concerned about the impact this may have on our HBase cluster. Will HBase continue to work will hadoops name-node and/or

Re: Use text vs binary output for speed?

2009-07-24 Thread Mark Kerzner
Maybe it was slow for me because I was writing from file system to HDFS, but now that I am using Amazon's MR, it will be OK. Thank you, Mark On Fri, Jul 24, 2009 at 3:19 PM, Owen O'Malley omal...@apache.org wrote: On Jul 24, 2009, at 1:15 PM, Mark Kerzner wrote: SequenceFileOutputFormat

Breaking up maps into separate files?

2009-07-24 Thread Mark Kerzner
MultipleTextOutputFormat to accomplish this? Thank you, Mark

Re: Output of a Reducer as a zip file?

2009-07-26 Thread Mark Kerzner
Now I am trying to do this: Open a ZipOutputStream in the static part of the Reducer, such as in configure(), then keep writing to this stream. I see too potential problems: cleanup in case of failure - I saw this discussed - and I don't know when to close the stream. Thank you, Mark On Sun, Jul

Re: Why is single reducer called twice?

2009-07-28 Thread Mark Kerzner
Thank you, worked like a charm On Tue, Jul 28, 2009 at 12:33 AM, Ted Dunning ted.dunn...@gmail.com wrote: Not safely. But you can use the close() method to do the same thing. On Mon, Jul 27, 2009 at 8:21 AM, Mark Kerzner markkerz...@gmail.com wrote: Can I use this side effect to close

Re: Output of a Reducer as a zip file?

2009-07-28 Thread Mark Kerzner
Worked great.Thank you On Tue, Jul 28, 2009 at 7:44 AM, Jason Venner jason.had...@gmail.comwrote: You close the zip stream in the close method of the reducer. You will get an error if no data has been written to the stream. On Sun, Jul 26, 2009 at 9:35 PM, Mark Kerzner markkerz...@gmail.com

conf.setNumReduceTasks(1) but the code called 3 times

2009-07-28 Thread Mark Kerzner
if this is stable. What am I missing in my understanding? Thank you, Mark

Can't turn off speculative execution

2009-07-29 Thread Mark Kerzner
I tried it in the config file and in the code, and I still get the Reduce code executed multiple times. Maybe it is just a hint to the system? Thank you, Mark On Wed, Jul 29, 2009 at 9:41 AM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Jul 29, 2009 at 12:58 AM, Mark Kerznermarkkerz

setting parameters for a hadoop job

2009-07-31 Thread Mark Kerzner
Hi, what is the preferred way of passing parameters to my job? I would want to put them all in a properties file and pass that file as an argument, after input and output. But I am not sure if this is recommended. Thank you, Mark

Logging when running locally

2009-08-21 Thread Mark Kerzner
Hi, when I run either on EC2 or inside the NetBeans IDE, I get a lot of logs. But when I execute the same job from the command line on my machine, the _logs in HDFS is empty. Do I need to set some switch? Thank you, Mark

Re: Where does System.out.println() go?

2009-08-24 Thread Mark Kerzner
Perfect, that's where it was! Mark On Mon, Aug 24, 2009 at 9:59 PM, Arvind Sharma arvind...@yahoo.com wrote: most of the user level log files goes under $HADOOP_HOME/logs/userlog...try there Arvind From: Mark Kerzner markkerz...@gmail.com To: core

Re: Where does System.out.println() go?

2009-08-29 Thread Mark Kerzner
/attempt folders. Regards,Sanjay Mark Kerzner-2 wrote: Hi, when I run Hadoop in pseudo-distributed mode, I can't find the log which System.out.println() goes. When I run in the IDE, I see it. When I run on EC2, it's part of the output logs. But here - do I need to set

Pregel

2009-09-03 Thread Mark Kerzner
Hi, guys, Pregel has been revealed on 8/11, what is your opinion of, does anybody know how to get the presentation, and is anyone interested in implementing it? Thank you, Mark

Re: Pregel

2009-09-03 Thread Mark Kerzner
Then we should think of a name and create the project somewhere. Does not have to be the same place as Hadoop, can be Google code to start with... How about Madoop Mississippi (221 bridges) Danube (lotsa bridges) Mark On Thu, Sep 3, 2009 at 2:53 PM, Amandeep Khurana ama...@gmail.com wrote

Re: Pregel

2009-09-03 Thread Mark Kerzner
Brenta http://en.wikipedia.org/wiki/Brenta_(river) On Thu, Sep 3, 2009 at 3:07 PM, Mark Kerzner markkerz...@gmail.com wrote: Then we should think of a name and create the project somewhere. Does not have to be the same place as Hadoop, can be Google code to start with... How about Madoop

Re: Pregel

2009-09-03 Thread Mark Kerzner
, Sep 3, 2009 at 1:08 PM, Mark Kerzner markkerz...@gmail.com wrote: Brenta http://en.wikipedia.org/wiki/Brenta_(river) http://en.wikipedia.org/wiki/Brenta_%28river%29 On Thu, Sep 3, 2009 at 3:07 PM, Mark Kerzner markkerz...@gmail.com wrote: Then we should think of a name and create

Re: Pregel

2009-09-03 Thread Mark Kerzner
like the name Hamburg, but I could live with that. Mark On Thu, Sep 3, 2009 at 6:36 PM, Ted Dunning ted.dunn...@gmail.com wrote: Hamburg has been excessively stable for some time. If you want to do something, I would recommend contributing to Mahout. On Thu, Sep 3, 2009 at 3:51 PM, Ashutosh

Code re-use?

2009-09-08 Thread Mark Kerzner
Hi, I have some code that's common between the main class, mapper, and reducer. Can I put it only in the main class and use it from mapper and reducer? A similar question about static variables in the main - are the available from mapper and reducer? Thank you, Mark

Re: Hadoop on Windows

2009-09-17 Thread Mark Kerzner
Ah, but it says about PowerSet that it runs on Amazon, so why would they pay extra for their own Windows and moreover, why would they use Windows given a choice of Linux? My personal opinion is that Microsoft has a pact with Yahoo only so that they would have an official excuse to use

JobConf in .20

2009-09-21 Thread Mark Kromer
I didn't see anything about this in the archive, so perhaps I'm doing something wrong, but I have run into a problem creating a job with the .20 release without using the deprecated JobConf class. The mapreduce.JobContext class is the replacement for the deprecated mapred.JobContext, but it

Custom Record Reader Example?

2009-10-06 Thread Mark Vigeant
() and getCurrentValue()? Thank you so much! Mark Vigeant RiskMetrics Group, Inc.

Outputting extended ascii characters in Hadoop?

2009-10-09 Thread Mark Kerzner
Hi, the strings I am writing in my reducer have characters that may present a problem, such as char represented by decimal 254, which is hex FE. It seems that instead I see hex C3, or something else is messed up. Or my understanding is messed up :) Any advice? Thank you, Mark

Re: Outputting extended ascii characters in Hadoop?

2009-10-12 Thread Mark Kerzner
Thanks, that is a great answer. My problem is that the application that reads my output accepts a comma-separated file with extended ASCII delimiters. Following your answer, however, I will try to use low-value ASCII, like 9 or 11, unless someone has a better suggestion. Thank you, Mark On Fri

Re: Outputting extended ascii characters in Hadoop?

2009-10-12 Thread Mark Kerzner
Thanks again, Todd. I need two delimiters, one for comma and one for quote. But I guess I can use ^A for quote, and keep the comma as is, and I will be good. Sincerely, Mark On Mon, Oct 12, 2009 at 10:15 PM, Todd Lipcon t...@cloudera.com wrote: Hey Mark, The most commonly used delimiter

Database to use with Hadoop

2009-10-13 Thread Mark Kerzner
do - but does it work with Hadoop? Thank you, Mark

Re: Database to use with Hadoop

2009-10-13 Thread Mark Kerzner
Thank you, all. It looks like SimpleDB may be good enough for my needs. The forums claim that you can write to it from all reducers at once, being that it is highly optimized for concurrent access. On Tue, Oct 13, 2009 at 5:30 PM, Jeff Hammerbacher ham...@cloudera.comwrote: Hey Mark, You

Re: Terminate Instances Terminating ALL EC2 Instances

2009-10-19 Thread Mark Stetzer
different AMI (OpenSolaris 2009.06 just to be very different). terminate-cluster only listed the 6 instances that were part of the cluster if I remember correctly. I have 4 security groups: default, default-master, default-slave, and mark-default. mark-default wasn't even added until after I started

How to give consecutive numbers to output records?

2009-10-27 Thread Mark Kerzner
Hi, I need to number all output records consecutively, like, 1,2,3... This is no problem with one reducer, making recordId an instance variable in the Reducer class, and setting conf.setNumReduceTasks(1) However, it is an architectural decision forced by processing need, where the reducer

Re: How to give consecutive numbers to output records?

2009-10-27 Thread Mark Kerzner
Aaron, although your notes are not a ready solution, but they are a great help. Thank you, Mark On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball aa...@cloudera.com wrote: There is no in-MapReduce mechanism for cross-task synchronization. You'll need to use something like Zookeeper

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Mark Kerzner
about the idea of JavaSpaces - I am going to check out this one. Mark On Wed, Oct 28, 2009 at 12:57 PM, Michael Klatt michael.kl...@gmail.comwrote: Hi Mark, Each mapper (or reducer) has an environment variable mapred_map_tasks (or mapred_reduce_tasks) which will describe how many tasks

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Mark Kerzner
assume that reducers are sorted. For example, if my records are sorted 1,2,...6, then one reducer would get maps 1,2,3, and the other one - maps 4,5,6. If that's the case, I need to know how the reducers are sorted. Then I could simply run the second stage. Thank you, Mark On Wed, Oct 28

Re: How to give consecutive numbers to output records?

2009-10-28 Thread Mark Kerzner
changes or new software - for, as you say, it might become a pain. Follow the rule (which I got from Scott Meyers' Effective C++) - avoid premature optimization. mark On Wed, Oct 28, 2009 at 2:22 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Oct 28, 2009 at 2:20 PM, Mark Kerzner markkerz

RE: Multiple Input Paths

2009-11-02 Thread Mark Vigeant
Ok, thank you very much Amogh, I will redesign my program. -Original Message- From: Amogh Vasekar [mailto:am...@yahoo-inc.com] Sent: Monday, November 02, 2009 11:45 AM To: common-user@hadoop.apache.org Subject: Re: Multiple Input Paths Mark, Set-up for a mapred job consumes

RE: Multiple Input Paths

2009-11-04 Thread Mark Vigeant
@hadoop.apache.org Subject: Re: Multiple Input Paths Hi Mark, A future release of Hadoop will have a MultipleInputs class, akin to MultipleOutputs. This would allow you to have a different inputformat, mapper depending on the path you are getting the split from. It uses special Delegating[mapper

Should I upgrade from 0.18.3 to the latest 0.20.1?

2009-11-10 Thread Mark Kerzner
of doing this? Other areas of concern are: - Will Amazon EMR work with the latest Hadoop? - What about Cloudera distribution or Yahoo distribution? Thank you, Mark

  1   2   3   4   >