Re: hadoop file system browser

2008-01-21 Thread Ted Dunning
There has been significant work on building a web-DAV interface for HDFS. I haven't heard any news for some time, however. On 1/21/08 11:32 AM, "Dawid Weiss" <[EMAIL PROTECTED]> wrote: > >> The Eclipse plug-in also features a DFS browser. > > Yep. That's all true, I don't mean to self-promot

Re: how to terminate a program in hadoop?

2008-01-21 Thread Ted Dunning
The web interface can also be used. This is handy if you are following the progress of the job via the web. Scroll to the bottom of the page. On 1/20/08 11:39 PM, "Jeff Hammerbacher" <[EMAIL PROTECTED]> wrote: > ./bin/hadoop job -Dmapred.job.tracker=: > -kill > > you can find the required c

Re: How to use hadoop wiht tomcat!!

2008-01-20 Thread Ted Dunning
I would say that it is generally better practice to deploy hadoop.jar in the lib directory of the war file that you are deploying so that you can change versions of hadoop more easily. Your problem is that you have dropped the tomcat support classes from your CLASSPATH in the process of getting h

Re: Does anyone have any experience with running a Map/Red node that is not also a DFS node?

2008-01-20 Thread Ted Dunning
We effectively have this situation on a significant fraction of our work-load as well. Much of our data is summarized hourly and is encrypted and compressed which makes it unsplittable. This means that the map processes are often not local to the data since the data is typically spread only to

Re: Hadoop only processing the first 64 meg block of a 2 gig file

2008-01-18 Thread Ted Dunning
> Yep, I can see all 34 blocks and view chunks of actual data from each > using the web interface (quite a nifty tool). Any other suggestions? > > --Matt > > -----Original Message- > From: Ted Dunning [mailto:[EMAIL PROTECTED] > Sent: Friday, January 18, 2008 11:2

Re: Hadoop only processing the first 64 meg block of a 2 gig file

2008-01-18 Thread Ted Dunning
Go into the web interface and look at the file. See if you can see all of the blocks. On 1/18/08 7:46 AM, "Matt Herndon" <[EMAIL PROTECTED]> wrote: > Hello, > > > > I'm trying to get Hadoop to process a 2 gig file but it seems to only be > processing the first block. I'm running the exact

Re: a question on number of parallel tasks

2008-01-16 Thread Ted Dunning
would see this: > /user/bear/output/part-0 > > I probably got confused on what the part-# means... I thought > part-# tells how many splits a file has... so far, I have only > seen part-0. When will it have part-1, 2, etc? > > &

Re: a question on number of parallel tasks

2008-01-16 Thread Ted Dunning
Parallelizing the processing of data occurs at two steps. The first is during the map phase where the input data file is (hopefully) split across multiple tasks. This should happen transparently most of the time unless you have a perverse data format or use unsplittable compression on your file

Re: how to deploy hadoop on many PCs quickly?

2008-01-16 Thread Ted Dunning
This isn't really a question about Hadoop, but is about system administration basics. You are probably missing a master boot record (MBR) on the disk. Ask a local linux expert to help you or look at the Norton documentation. On 1/16/08 4:59 AM, "Bin YANG" <[EMAIL PROTECTED]> wrote: > I use th

Re: Hadoop overhead

2008-01-16 Thread Ted Dunning
ould of course help in this case, but what about > when we process large datasets? Especially if a mapper fails. > > Reducers I also setup to use ~1 per core, slightly less. > > /Johan > > Ted Dunning wrote: >> Why so many mappers and reducers relative to the number o

Re: single output file

2008-01-15 Thread Ted Dunning
Output a constant key in the map function. On 1/15/08 9:31 PM, "Vadim Zaliva" <[EMAIL PROTECTED]> wrote: > On Jan 15, 2008, at 17:56, Peter W. wrote: > > That would output last 10 values for each key. I need > to do this across all the keys in the set. > > Vadim > >> Hello, >> >> Try using

Re: single output file

2008-01-15 Thread Ted Dunning
op-user@lucene.apache.org > Sent: Tuesday, January 15, 2008 4:13:11 PM > Subject: Re: single output file > > > > On Jan 15, 2008, at 13:57, Ted Dunning wrote: > >> This is happening because you have many reducers running, only one >> of which >> gets any data

Re: single output file

2008-01-15 Thread Ted Dunning
This is happening because you have many reducers running, only one of which gets any data. Since you have combiners, this probably isn't a problem. That reducer should only get as many records as you have maps. It would be a problem if your reducer were getting lots of input records. You can

Re: writing output files in hadoop streaming

2008-01-15 Thread Ted Dunning
ed. > > Miles > > On 15/01/2008, John Heidemann <[EMAIL PROTECTED]> wrote: >> >> On Tue, 15 Jan 2008 09:09:07 PST, Ted Dunning wrote: >>> >>> Regarding the race condition, hadoop builds task specific temporary >>> directories in the output di

Re: Hadoop overhead

2008-01-15 Thread Ted Dunning
Why so many mappers and reducers relative to the number of machines you have? This just causes excess heartache when running the job. My standard practice is to run with a small factor larger than the number of cores that I have (for instance 3 tasks on a 2 core machine). In fact, I find it mos

Re: writing output files in hadoop streaming

2008-01-15 Thread Ted Dunning
Regarding the race condition, hadoop builds task specific temporary directories in the output directory, one per reduce task, that hold these output files (as long as you don't use absolute path names). When the process completes successfully, the output files from that temporary directory are mo

Re: how to deploy hadoop on many PCs quickly?

2008-01-15 Thread Ted Dunning
That's a fine way. If you already have a Linux master distribution, then rsync can distribute the hadoop software very quickly. On 1/15/08 6:26 AM, "Bin YANG" <[EMAIL PROTECTED]> wrote: > Dear colleagues, > > Right now, I have to deploy ubuntu 7.10 + hadoop 0.15 on 16 PCs. > One PC will be se

Re: how to write data into HDFS from a remote machine

2008-01-14 Thread Ted Dunning
Just run that same command on a different machine. On 1/14/08 4:33 AM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > > I have a 4 node cluster setup up & running. Every time I have to copy > data to HDFS, I copy it to name node and using "hadoop dfs copyfromlocal > ..." I copy it to HDFS.

Re: Question on running simultaneous jobs

2008-01-10 Thread Ted Dunning
Presumably the limit could be made dynamic. The limit could be max(static_limit, number of cores in cluster / # active jobs) On 1/10/08 9:56 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: > this may be simple - but is this the right solution? (and i have the same > concern about hod) > >

Re: is a monolithic reduce task the right model?

2008-01-10 Thread Ted Dunning
Actually, all of my jobs tend to have one of these phases dominate the time. It isn't always the same phase that dominates, though, so the consideration isn't simple. The fact (if it is a fact) that one phase or another dominates means, however, that splitting them won't help much. On 1/10/08

Re: Question on running simultaneous jobs

2008-01-09 Thread Ted Dunning
sException within Hadoop; I believe because of the input >> dataset size (around 90 million lines). >> >> I think it is important to make a distinction between setting total >> number of map/reduce tasks and the number that can run(per job) at any >> given time.

Re: Question on running simultaneous jobs

2008-01-09 Thread Ted Dunning
You may need to upgrade, but 15.1 does just fine with multiple jobs in the cluster. Use conf.setNumMapTasks(int) and conf.setNumReduceTasks(int). On 1/9/08 11:25 AM, "Xavier Stevens" <[EMAIL PROTECTED]> wrote: > Does Hadoop support running simultaneous jobs? If so, what parameters > do I need

Re: Limit the space used by hadoop on a slave node

2008-01-08 Thread Ted Dunning
amount of dfs used space, > reserved space, and non-dfs used space when the out of disk problem > occurs. > > Hairong > > -Original Message- > From: Ted Dunning [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 08, 2008 1:37 PM > To: hadoop-user@lucene.apache.org

Re: Limit the space used by hadoop on a slave node

2008-01-08 Thread Ted Dunning
sks take a lot of disk space. > > Hairong > > -Original Message- > From: Ted Dunning [mailto:[EMAIL PROTECTED] > Sent: Tuesday, January 08, 2008 1:13 PM > To: hadoop-user@lucene.apache.org > Subject: Re: Limit the space used by hadoop on a slave node > > > I thin

Re: Limit the space used by hadoop on a slave node

2008-01-08 Thread Ted Dunning
> wrote: > We use, > > dfs.datanode.du.pct for 0.14 and dfs.datanode.du.reserved for 0.15. > > Change was made in the Jira Hairong mentioned. > https://issues.apache.org/jira/browse/HADOOP-1463 > > Koji > >> -Original Message- >> From: Ted Dunning [mailto:[EMAIL

Re: Limit the space used by hadoop on a slave node

2008-01-08 Thread Ted Dunning
I think I have seen related bad behavior on 15.1. On 1/8/08 11:49 AM, "Hairong Kuang" <[EMAIL PROTECTED]> wrote: > Has anybody tried 15.0? Please check > https://issues.apache.org/jira/browse/HADOOP-1463. > > Hairong > -Original Message- > From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTE

Re: missing VERSION files leading to failed datanodes

2008-01-08 Thread Ted Dunning
Can you put this on the wiki or as a comment on the jira? This could be (as you just noticed) a life-saver. On 1/8/08 10:48 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: > never mind. the storageID is logged in the namenode logs. i am able to restore > the version files and add the datano

Re: missing VERSION files leading to failed datanodes

2008-01-08 Thread Ted Dunning
Dhruba, It looks from the discussion like the file was overwritten in place. Is that good practice? Normally the way that this sort of update is handled is to write a temp file, move the live file to a backup, then move the temp file to the live place. Both moves are atomic so the worst case i

Re: missing VERSION files leading to failed datanodes

2008-01-08 Thread Ted Dunning
This has bitten me as well. It used to be that I would have two possible partitions depending on which kind of machine I was on. Some machines had both partitions available, but one was much smaller. Hadoop had a nasty tendency to fill up the smaller partition. Reordering the partitions in the

Re: Jar file location

2008-01-07 Thread Ted Dunning
everything into one big fat job jar > > Am I missing something? > > Question, is the JIRA 1622 actually usable yet? I am using a about 14 > day old nightly developers build, so that should have that in that case? > > Which way would you go? > > Lars > > > Ted

Re: Jar file location

2008-01-07 Thread Ted Dunning
as following this: > http://www.mail-archive.com/[EMAIL PROTECTED]/msg02860.html > > Which I could not find on the Wiki really, although the above is a > commit. Am I missing something? > > Lars > > > Ted Dunning wrote: >> /lib is definitely the way to go. >> &g

Re: Jar file location

2008-01-07 Thread Ted Dunning
/lib is definitely the way to go. But adding gobs and gobs of stuff there makes jobs start slowly because you have to propagate a multi-megabyte blob to lots of worker nodes. I would consider adding universally used jars to the hadoop class path on every node, but I would also expect to face con

Re: Under replicated block doesn't get fixed until DFS restart

2008-01-07 Thread Ted Dunning
The fsck output shows at least one file that doesn't have a replica. I have seen situations where a block would not replicate. It turned out to be due to a downed node that had not yet been marked as down. Once the system finally realized the node was down, the fsck changed from reporting low r

Re: Question for HBase users

2008-01-06 Thread Ted Dunning
is sorted out? I >> am willing to pay consulting fees if I have to. At the moment I am at a >> loss - sure I trial and error approach would keep me going forward, but >> I am on a tight deadline too and that counters that approach. >> >> Any help is appreciated. >>

Re: Question for HBase users

2008-01-05 Thread Ted Dunning
Lars, Can you dump your documents to external storage (either HDFS or ordinary file space storage)? On 1/4/08 10:01 PM, "larsgeorge" <[EMAIL PROTECTED]> wrote: > > Jim, > > I have inserted about 5million documents into HBase and translate them into > 15 languages (means I end up with about 7

Re: Under replicated block doesn't get fixed until DFS restart

2008-01-04 Thread Ted Dunning
It can take a long time to decide that a node is down. If that down node has the last copy of a file, then it won't get replicated. I run a balancing script every few hours. It wanders through the files and ups the replication of each file temporarily. This is important because initial allocat

Re: Datanode Problem

2008-01-03 Thread Ted Dunning
name in /conf/masters and > /conf/slaves files. It is working fine. > > -Original Message- > From: Ted Dunning [mailto:[EMAIL PROTECTED] > Sent: Thursday, January 03, 2008 1:00 AM > To: hadoop-user@lucene.apache.org; > public-hadoop-user-PPu3vs9EauNd/SJB6HiN2Ni2O/[EMAIL

Re: Datanode Problem

2008-01-02 Thread Ted Dunning
export HADOOP_SLAVE_SLEEP=0.1 > > # The directory where pid files are stored. /tmp by default. > # export HADOOP_PID_DIR=/var/hadoop/pids > > # A string representing this instance of hadoop. $USER by default. > # export HADOOP_IDENT_STRING=$USER > > # The scheduling priori

Re: Datanode Problem

2008-01-02 Thread Ted Dunning
Well, you have something very strange going on in your scripts. Have you looked at hadoop-env.sh? On 1/2/08 1:58 PM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote: >> /bin/bash: /root/.bashrc: Permission denied >> localhost: ssh: localhost: Name or service not known >> /bin/bash: /root/.bashr

Re: Datanode Problem

2008-01-02 Thread Ted Dunning
I don't know what your problem is, but I note that you appear to be running processes as root. This is a REALLY bad idea. It may also be related to your problem. On 1/2/08 1:33 PM, "Natarajan, Senthil" <[EMAIL PROTECTED]> wrote: > Hi, > I am new to Hadoop. I just downloaded release 0.14.4 (ha

Re: Is there an rsyncd for HDFS

2008-01-02 Thread Ted Dunning
That is a good idea. I currently use a shell script that does the rough equivalent of rsync -av, but it wouldn't be bad to have a one-liner that solves the same problem. One (slight) benefit to the scripted approach is that I get a list of directories to which files have been moved. That lets m

Re: how to create collections in the mapper class

2007-12-31 Thread Ted Dunning
I would like to point out that this is a REALLY bad idiom. You should use a static initializer. private static Map usersMap = new HashMap(); Also, since this is a static field in a very small class, there is very little reason to use a getter. No need for 7 lines of code when one will do.

Re: how to create collections in the mapper class

2007-12-31 Thread Ted Dunning
if(getUsersMap().get(nKey)==null){ > output.collect(name, ONE); > getUsersMap().put(nKey, data[12]); > } > > .. > > > } > > > the problem is my hashmap(userMap) is always empty.Now I hope > my problem is clear. > > Thanks, > > Helen > > >

Re: Re[4]: Matrix multiplication

2007-12-29 Thread Ted Dunning
I figured. On 12/29/07 7:03 AM, "Milan Simonovic" <[EMAIL PROTECTED]> wrote: > > That's what I wanted to say :) my mistake > > Saturday, December 29, 2007, 3:55:31 PM, you wrote: > >> Actually, this isn't true (you must know this). Each element is multiplied >> by every element of the corre

Re: Re[2]: Matrix multiplication

2007-12-29 Thread Ted Dunning
Actually, this isn't true (you must know this). Each element is multiplied by every element of the corresponding row or column of the other matrix. This is (thankfully) much less communication. On 12/29/07 6:48 AM, "Milan Simonovic" <[EMAIL PROTECTED]> wrote: > Ordinary matrix multiplication m

Re: Re[2]: Matrix multiplication

2007-12-29 Thread Ted Dunning
The most surprising thing about hadoop is the degree to which you are exactly correct. My feeling is that what is really happening is that the pain is moving (and moderating) to the process of adopting map-reduce as a programming paradigm. Once you do that, the pain is largely over. On 12/29/07

Re: Matrix multiplication

2007-12-28 Thread Ted Dunning
For dense matrix multiplication, the key problem is that you have O(n^3) arithmetic operations and O(n^2) element fetches. Most conventional machines now have nearly 10^2 or larger ratio between the speed of the arithmetic processor and memory so for n > 100, you should be able to saturate the ar

Re: how to create collections in the mapper class

2007-12-28 Thread Ted Dunning
This sounds like there is a little bit of confusion going on here. It is common for people who are starting with Hadoop that they are surprised when static fields of the mapper do not get shared across all parallel instances of the map function. This is, of course, because you are running many m

Re: Performance issues with large map/reduce processes

2007-12-27 Thread Ted Dunning
That is a very small heap. The reduces, in particular, would benefit sustantially from having more memory. Other than that (and having fewer reduces), I am at a bit of a loss. I know that others are working on comparably sized problems without much difficulty. There might be an interaction w

Re: Performance issues with large map/reduce processes

2007-12-27 Thread Ted Dunning
Can you say a bit more about your processes? Are they truly parallel maps without any shared state? Are you getting a good limit on maximum number of maps and reduces per machine? How are you measuring these times? Do they include shuffle time as well as map time? Do they include time before

Re: HashMap which can spill to disk for Hadoop?

2007-12-26 Thread Ted Dunning
Sounds much better to me. On 12/26/07 7:53 AM, "Eric Baldeschwieler" <[EMAIL PROTECTED]> wrote: > > With a secondary sort on the values during the shuffle, nothing would > need to be kept in memory, since it could all be counted in a single > scan. Right? Wouldn't that be a much more efficien

Re: Do people put their master node in the slave list - 0.15.1

2007-12-26 Thread Ted Dunning
My namenode and jobtracker are both on a machine that is a datanode and has a tasktracker as well. It is also less well outfitted than yours. I have no problems, but my data is encrypted which might make the CPU/disk trade-offs very different. On 12/26/07 12:11 PM, "Jason Venner" <[EMAIL PROTE

Re: how to pass user parameter for the mapper

2007-12-26 Thread Ted Dunning
That would be a fine way to solve the problem. You can also pass data in to the maps via the key since the key has little use for most maps. On 12/25/07 8:09 PM, "Norbert Burger" <[EMAIL PROTECTED]> wrote: > How should I approach this? Is overriding InputFileFormat so that the > header data i

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

2007-12-25 Thread Ted Dunning
orly with the standard input > split size as the mean time to finishing a split is very small, vrs > gigantic memory requirements for large split sizes. > > Time to play with parameters again ... since the answer doesn't appear > to be in working memory for the list. > > &

Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

2007-12-25 Thread Ted Dunning
What are your mappers doing that they run out of memory? Or is it your reducers? Often, you can write this sort of program so that you don't have higher memory requirements for larger splits. On 12/25/07 1:52 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote: > We have tried reducing the number of

Re: Appropriate use of Hadoop for non-map/reduce tasks?

2007-12-25 Thread Ted Dunning
Ahhh My previous comments assumed that "long-lived" meant jobs that run for days and days and days (essentially forever). 15 minute jobs with a finite work-list is actually a pretty good match for map-reduce as implemented by Hadoop. On 12/25/07 10:04 AM, "Kirk True" <[EMAIL PROTECTED]> wro

Re: Appropriate use of Hadoop for non-map/reduce tasks?

2007-12-21 Thread Ted Dunning
1 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote: > On Fri, Dec 21, 2007 at 12:43:38PM -0800, Ted Dunning wrote: >> >> * if you need some kind of work-flow, hadoop won't help (but it won't hurt >> either) >> > > Lets start a discussion around this, seems to be something lots of folks could > use...

Re: Appropriate use of Hadoop for non-map/reduce tasks?

2007-12-21 Thread Ted Dunning
Sorry. I meant to answer that. The short answer is that hadoop is often reasonable for this sort of problem, BUT * if you have lots of little files, you may do better otherwise * if you can't handle batch-oriented, merge-based designs, then map-reduce itself isn't going to help you much * if

Re: DFS Block Allocation

2007-12-20 Thread Ted Dunning
Yeah... We have that as well, but I put strict limits on how many readers are allowed on any NFS data source. With well organized reads, even a single machine can cause serious load on an ordinary NFS server. I have had very bad experiences where lots of maps read from a single source; the worst

Re: DFS Block Allocation

2007-12-20 Thread Ted Dunning
doop distcp" using multiple trackers to upload files in > parallel. > > Thanks, > > Rui > > - Original Message > From: Ted Dunning <[EMAIL PROTECTED]> > To: hadoop-user@lucene.apache.org > Sent: Thursday, December 20, 2007 6:01:50 PM > Subje

Re: Appropriate use of Hadoop for non-map/reduce tasks?

2007-12-20 Thread Ted Dunning
Map-reduce is just one way of organizing your computation. If you have something simpler, then I would say that you are doing fine. There are plenty of tasks that are best served by a DAG of simple tasks. Systems like Amazon's simple queue (where tasks come back to life if they aren't "finished"

Re: DFS Block Allocation

2007-12-20 Thread Ted Dunning
On 12/20/07 5:52 PM, "C G" <[EMAIL PROTECTED]> wrote: > Ted, when you say "copy in the distro" do you need to include the > configuration files from the running grid? You don't need to actually start > HDFS on this node do you? You are correct. You only need the config files (and the hadoo

Re: DFS Block Allocation

2007-12-20 Thread Ted Dunning
Just copy the hadoop distro directory to the other machine and use whatever command you were using before. A program that uses hadoop just have to have access to all of the nodes across the net. It doesn't assume anything else. On 12/20/07 2:35 PM, "Jeff Eastman" <[EMAIL PROTECTED]> wrote:

Re: Some Doubts of hadoop functionality

2007-12-20 Thread Ted Dunning
you rebuild the name >>>> node. There is no official solution for the high availability problem. >>>> Most hadoop systems work on batch problems where an hour or two of >>>> downtime >>>> every few years is not a problem. >> >> Actu

Re: DFS Block Allocation

2007-12-20 Thread Ted Dunning
Yes. I try to always upload data from a machine that is not part of the cluster for exactly that reason. I still find that I need to rebalance due to a strange problem in placement. My datanodes have 10x different sized HDFS disks and I suspect that the upload is picking datanodes uniformly rath

Re: Some Doubts of hadoop functionality

2007-12-20 Thread Ted Dunning
Well, we are kind of a poster child for this kind of reliability calculus. We opted for Mogile for real-time serving because we could see how to split the master into shards and how to do HA on it. For batch oriented processes where a good processing model is important, we use hadoop. I would ha

Re: Error on slave node log

2007-12-20 Thread Ted Dunning
What happened here is that you formatted the name node but have data left over from the previous incarnation of the namenode. The namenode can't deal with that situation. On 12/19/07 11:25 PM, "M.Shiva" <[EMAIL PROTECTED]> wrote: > > /**

Re: Some Doubts of hadoop functionality

2007-12-20 Thread Ted Dunning
On 12/19/07 11:17 PM, "M.Shiva" <[EMAIL PROTECTED]> wrote: > 1.Did Separate machines/nodes needed for Namenode ,Jobtracker, Slavenodes No. I run my namenode and job-tracker on one of my storage/worker nodes. You can run everything on a single node and still get some interesting results becau

Re: HashMap which can spill to disk for Hadoop?

2007-12-19 Thread Ted Dunning
You should also be able get quite a bit of mileage out of special purpose HashMaps. In general, java generic collections incur large to huge penalties for certain special cases. If you have one of these special cases or can put up with one, then you may be able to get 1+ order of magnitude impr

Re: question on file, inputformats and outputformats

2007-12-17 Thread Ted Dunning
However, it depended > upon the file output formats I used in the first step.Because I > got so confused, I thought it would be more important to nail down the > correct output format in the first step. > > -- Jim > > On Dec 17, 2007 10:24 PM, Ted Dunning <[EMAIL PROT

Re: question on file, inputformats and outputformats

2007-12-17 Thread Ted Dunning
the second step, or > were you asking me why I never set it in the second step? > > > On Dec 17, 2007 10:09 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> You never set the input format in the second step. >> >> But I think you want to stay

Re: question on file, inputformats and outputformats

2007-12-17 Thread Ted Dunning
alues are > clear Text, and they can subsequently be read by > KeyValueTextInputFormat. > > On Dec 17, 2007 10:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >> >> I thought that is what your input file already was. The >> KeyValueTextInputFormat should

Re: question on file, inputformats and outputformats

2007-12-17 Thread Ted Dunning
You never set the input format in the second step. But I think you want to stay with your KeyValueTextInputFormat for input and TextOutputFormat for output. On 12/17/07 7:03 PM, "Jim the Standing Bear" <[EMAIL PROTECTED]> wrote: > > So that's a part of the reason that I am having trouble conn

Re: question on file, inputformats and outputformats

2007-12-17 Thread Ted Dunning
I thought that is what your input file already was. The KeyValueTextInputFormat should read your input as-is. When you write out your intermediate values, just make sure that you use TextOutputFormat and put "DIR" as the key and the directory name as the value (same with files). On 12/17/07 6

Re: question on file, inputformats and outputformats

2007-12-17 Thread Ted Dunning
Part of your problem is that you appear to be using a TextInputFormat (the default input format). The TIF produces keys that are LongWritable and values that are Text. Other input formats produce different types. With recent versions of hadoop, classes that extend InputFormatBase can (and I th

Re: How is the hadoop marked in the UK/Europe?

2007-12-17 Thread Ted Dunning
Hadoop is new technology. You aren't going to find opportunities to work with it via job agencies. That said, there is a growing trend towards scalable systems in general and Hadoop in particular. Lately, it seems that everywhere I turn around, I find another startup company using hadoop. I ju

Re: How can the reducer be invoked lazily?

2007-12-16 Thread Ted Dunning
Devaraj is correct that there is no mechanism to create reduce tasks only as necessary, but remember that each reducer does many reductions. This means that empty ranges rarely have a large, unbalanced effect. If this is still a problem you can do two things, - first, you can use the hash of th

Re: map/reduce and Lucene integration question

2007-12-13 Thread Ted Dunning
Yes. On 12/13/07 12:22 PM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote: > On Thu, Dec 13, 2007 at 11:31:49AM -0800, Ted Dunning wrote: >> After indexing, indexes are moved to multiple query servers. ... (how nutch >> works) With this architecture, you g

Re: map/reduce and Lucene integration question

2007-12-13 Thread Ted Dunning
: > On Thu, Dec 13, 2007 at 11:03:50AM -0800, Ted Dunning wrote: >> >> I don't think so (but I don't run nutch) >> >> To actually run searches, the search engines copy the index to local >> storage. Having them in HDFS is very nice, however, as a way to

Re: map/reduce and Lucene integration question

2007-12-13 Thread Ted Dunning
I don't think so (but I don't run nutch) To actually run searches, the search engines copy the index to local storage. Having them in HDFS is very nice, however, as a way to move them to the right place. On 12/13/07 10:59 AM, "Eugeny N Dzhurinsky" <[EMAIL PROTECTED]> wrote: > On Thu, Dec 13,

Re: Question on Critical Region size for SequenceFile next/write - 0.15.1

2007-12-12 Thread Ted Dunning
It seems reasonable that (de)-serialization could be done in threaded fashion and then just block on the (read) write itself. That would explain the utilization which is suspect is close to 1/N where N is the number of processors. On 12/12/07 2:07 PM, "Jason Venner" <[EMAIL PROTECTED]> wrote:

Re: Error Nutchwax Search

2007-12-12 Thread Ted Dunning
I guess it would be even more of a surprise, then. :-) On 12/12/07 1:36 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote: >> Using gcj successfully would be a bit of a surprise. > > GCJ 4.2 does NOT work.

Re: some questions about hadoop

2007-12-12 Thread Ted Dunning
3. There is currently no security. Weak user level security will appear soon (but you will still be able to lie about who you are). Stronger security is in the works, but you should expect to protect a Hadoop cluster from the outside. 2. High availability is inherent in hadoop's map-reduce s

Re: Error Nutchwax Search

2007-12-12 Thread Ted Dunning
Hadoop *normally* uses the Sun JDK. Using gcj successfully would be a bit of a surprise. On 12/11/07 11:54 PM, "aonewa" <[EMAIL PROTECTED]> wrote: > > hadoop use gcj java but St.Ack said to try SUN's JDK that means modify code > in hadoop, yes or no? > > > stack-3 wrote: >> >> Try SUN's J

Re: commodity vs. high perf machines: which would you rather

2007-12-11 Thread Ted Dunning
Absolutely. Or on a machine scaling page. On 12/11/07 12:43 PM, "Chris Fellows" <[EMAIL PROTECTED]> wrote: > Does this belong in the FAQ?

Re: HDFS tool and replication questions...

2007-12-10 Thread Ted Dunning
More to the specific point, yes, all 100 nodes will wind up storing data for large files because blocks should be assigned pretty much at random. The exception is files that originate on a datanode. There, the local node gets one copy of each block. Replica blocks follow the random rule, howeve

Re: HDFS tool and replication questions...

2007-12-10 Thread Ted Dunning
The web interface to the namenode will let your drill down to the file itself. That will tell you where the blocks are (scroll down to the bottom). You can also use hadoop fsck For example: [EMAIL PROTECTED]:~/hadoop-0.15.1$ bin/hadoop fsck /user/rmobin/data/11/30Statu

Re: MapReduce Job on XML input

2007-12-10 Thread Ted Dunning
Can you post a Jira and a patch? On 12/10/07 1:12 AM, "Alan Ho" <[EMAIL PROTECTED]> wrote: > I've written a xml input splitter based on a Stax parser. Its much better than > StreamXMLRecordReader > > - Original Message > From: Peter Thygesen <[EMAIL PROTECTED]> > To: hadoop-user@lucen

Re: Mapper Out of Memory

2007-12-06 Thread Ted Dunning
There is a bug in the GZipInputStream on java 1.5 that can cause an out-of-memory error on a malformed gzip input. It is possible that you are trying to treat this input as a splittable file which is causing your maps to be fed from chunks of the gzip file. Those chunks would be ill-formed, of c

Re: Comparing Hadoop to Apple Xgrid?

2007-12-05 Thread Ted Dunning
up DNS and hadoop won't run"?). Item (B) is probably a bad thing for hadoop given the bandwidth required for the shuffle phase. Item (C) is inherent in map-reduce and is pretty neutral either way. On 12/5/07 9:23 AM, "Ted Dunning" <[EMAIL PROTECTED]> wrote: > >

Re: Comparing Hadoop to Apple Xgrid?

2007-12-05 Thread Ted Dunning
Sorry about not addressing this. (and I appreciate your gentle prod) The Xgrid would likely work well on these problems. They are, after all, nearly trivial to parallelize because of clean communication patterns. Consider an alternative problem of solving n-body gravitational dynamics for n >

Re: Comparing Hadoop to Apple Xgrid?

2007-12-04 Thread Ted Dunning
IF you are looking at large numbers of independent images then hadoop should be close to perfect for this analysis (the problem is embarrassingly parallel). If you are looking at video, then you can still do quite well by building what is essentially a probabilistic list of recognized items in th

Re: Hbase for dynamic web site?

2007-12-04 Thread Ted Dunning
It is conceivable that memcache would eventually have only or mostly active objects in memory while hbase might have active pages/tablets/groups of objects.That might give memcache a bit of an edge. Another thing that happens with memcache is that memcache can hold the results of a complex jo

Re: Multiple keys

2007-12-03 Thread Ted Dunning
There is the largely undocumented record stream stuff. You define your records in an IDL-like language which compiles to java code. I haven't used it, but it doesn't look particularly hard. I believe that this stuff includes definitions of comparators. Also, if you just put concatenated keys i

Re: Thank you for coming! Bay Area Hadoop Get-Together this Friday (11/30)

2007-12-03 Thread Ted Dunning
/30) > > Hi, > > It is getting closer to Friday and I wanted to remind everyone that we > will be meeting at Gordon Biersch in Palo Alto at 5pm this Fri (11/30): > http://upcoming.yahoo.com/event/324051/ > > No formal agenda, but we might have the opportunity to checkou

Re: Running Hadoop on FreeBSD

2007-12-02 Thread Ted Dunning
7;dfs[a-z.]+' > > I got: > > Error occurred during initialization of VM > Could not reserve enough space for object heap > Could not create the Java virtual machine. > > Thanks, > > Rui > > > - Original Message > From: Ted Dunning <[EMAIL PRO

Re: Running Hadoop on FreeBSD

2007-12-02 Thread Ted Dunning
s. > > Thanks, > > Rui > > - Original Message > From: Ted Dunning <[EMAIL PROTECTED]> > To: hadoop-user@lucene.apache.org > Sent: Sunday, December 2, 2007 6:43:36 AM > Subject: Re: Running Hadoop on FreeBSD > > > > You should be able to run it wit

Re: Running Hadoop on FreeBSD

2007-12-02 Thread Ted Dunning
You should be able to run it without any changes or recompilation. Hadoop is written in Java, after all. On 11/30/07 10:38 PM, "Rui Shi" <[EMAIL PROTECTED]> wrote: > Did anyone port and run Hadoop on FreeBSD clusters?

Re: Hbase for dynamic web site?

2007-11-30 Thread Ted Dunning
Are you already using memcache and related approaches? On 11/30/07 9:46 AM, "Mike Perkowitz" <[EMAIL PROTECTED]> wrote: > > > Hello! We have a web site currently built on linux/apache/mysql/php. Most > pages do some mysql queries and then stuff the results into php/html > templates. We've be

Re: Very weak mapred performance on small clusters with a massive amount of small files

2007-11-30 Thread Ted Dunning
> Joydeep Sen Sarma wrote: >> Would it help if the multifileinputformat bundled files into splits based on >> their location? (wondering if remote copy speed is a bottleneck in map) >> If you are going to access the files many times after they are generated - >> wri

  1   2   3   >