[ANNOUNCEMENT] Hadoop is a TLP

2008-01-16 Thread Doug Cutting
Apache's board this morning voted to make Hadoop a top-level project (TLP). The initial project management committee (PMC) for Hadoop will be composed of the following Hadoop committers: * Andrzej Bialecki [EMAIL PROTECTED] * Doug Cutting [EMAIL PROTECTED

Re: Question on running simultaneous jobs

2008-01-10 Thread Doug Cutting
Aaron Kimball wrote: Multiple students should be able to submit jobs and if one student's poorly-written task is grinding up a lot of cycles on a shared cluster, other students still need to be able to test their code in the meantime; I think a simple approach to address this is to limit the

Re: is a monolithic reduce task the right model?

2008-01-10 Thread Doug Cutting
Joydeep Sen Sarma wrote: - what if current reduce tasks were broken into separate copy, sort and reduce tasks? we would get much smaller units of recovery and scheduling. thoughts? If copy, sort and reduce are not scheduled together then it would be very hard to ensure they run on the same

Re: Question on running simultaneous jobs

2008-01-10 Thread Doug Cutting
Joydeep Sen Sarma wrote: if the cluster is unused - why restrict parallelism? if someone's willing to wake up at 4am to beat the crowd - they would just absolutely hate this. [It would be better to make your comments in Jira. ] But if someone starts a long-running job at night that uses the

Re: Question on running simultaneous jobs

2008-01-10 Thread Doug Cutting
Runping Qi wrote: An improvement over Doug's proposal is to make the limit soft in the following sense: 1. A job is entitled to run up to the limit number of tasks. 2. If there are free slots and no other job waits for their entitled slots, a job can run more tasks than the limit. 3. When a job

Re: Question on running simultaneous jobs

2008-01-10 Thread Doug Cutting
Joydeep Sen Sarma wrote: can we suspend jobs (just unix suspend) instead of killing them? We could, but they'd still consume RAM and disk. The RAM might eventually get paged out, but relying on that is probably a bad idea. So, this could work for tasks that don't use much memory and whose

Re: Question on Critical Region size for SequenceFile next/write - 0.15.1

2007-12-12 Thread Doug Cutting
Jason Venner wrote: On investigating, we discovered that the entirety of the next(key,value) and the entirety of the write( key, value) are synchronized on the file object. This causes all threads to back up on the serialization/deserialization. I'm not sure what you want to happen here.

Re: Question on Critical Region size for SequenceFile next/write - 0.15.1

2007-12-12 Thread Doug Cutting
Ted Dunning wrote: It seems reasonable that (de)-serialization could be done in threaded fashion and then just block on the (read) write itself. That would require a buffer per thread, e.g., replacing Writer#buffer with a ThreadLocal of DataOutputBuffers. The deflater-related objects would

Re: Mapper Out of Memory

2007-12-06 Thread Doug Cutting
Rui Shi wrote: It is hard to believe that you need to enlarge heap size given the input size is only 10MB. In particular, you don't load all input at the same time. As for the program logic, no much fancy stuff, mostly cut and sorting. So GC should be able to handle... Out-of-memory

Re: Removing nodes from the cluster?

2007-11-16 Thread Doug Cutting
Nate Carlson wrote: I'm testing out a Hadoop cluster on EC2.. we've currently got 20 nodes, and for some silly reason, I started the dfs daemon on all of the nodes. I'd like to drop back down to 3 nodes after we've finished testing the apps; is there any way to pull the other nodes from dfs

Re: anyone at apachecon?

2007-11-14 Thread Doug Cutting
Owen O'Malley wrote: Is anyone at ApacheCon this week? I'll be there tomorrow and Friday and will attend the BOF. See you soon, Doug

Re: anyone at apachecon?

2007-11-14 Thread Doug Cutting
John Wang wrote: What is the exact time and location for the thursday night roundtable http://wiki.apache.org/apachecon/BirdsOfaFeatherUs07 Doug

Re: Hadoop 0.15.0 - Reporter issue w/ timing out

2007-11-10 Thread Doug Cutting
Devaraj Das wrote: There has been a change with respect to the way since progress reporting is done since 0.14. The application has to explicitly send the status (incrCounter doesn't send any status). Even if the application hasn't made any progress, it is okay to call setStatus with the earlier

Re: Tech Talk: Dryad

2007-11-09 Thread Doug Cutting
Stu Hood wrote: The slide comparing the time taken to spill to disk between vertices vs operating purely in memory (around minute 26) is definitely something to think about. I have not had a chance to watch the video yet, but, in MapReduce, if the intermediate dataset is larger than the RAM

Re: Tech Talk: Dryad

2007-11-09 Thread Doug Cutting
Vuk Ercegovac wrote: If there is a (reasonably) simple solution that addresses failures (correctness and cost), would there be interest? Sure, if it provides some significant benefits too. A good benchmark might be swapping randomly-generated keys and values at each stage, so it becomes a

Re: performance of multiple map-reduce operations

2007-11-06 Thread Doug Cutting
Chris Dyer wrote: For one computation I've been working on lately, over 25% of the time is spent in the last 10% of each map/reduce operation (this has to do with the natural distribution of my input data and would be unavoidable even given an optimal partitioning). During this time, I have

Re: Very weak mapred performance on small clusters with a massive amount of small files

2007-11-06 Thread Doug Cutting
André Martin wrote: I was thinking of a similar solution/optimization but I have the following problem: We have a large distributed system that consists of several spider/crawler nodes - pretty much like a web crawler system - every node writes its gathered data directly to the DFS. So there

Re: performance of multiple map-reduce operations

2007-11-06 Thread Doug Cutting
Joydeep Sen Sarma wrote: One of the controversies is whether in the presence of failures, this makes performance worse rather than better (kind of like udp vs. tcp - what's better depends on error rate). The probability of a failure per job will increase non-linearly as the number of nodes

Hadoop release 0.15.0 available

2007-11-05 Thread Doug Cutting
Hadoop release 0.15.0 is now available. This release contains my improvements, new features, bug fixes and optimizations. For more release details and downloads, visit: http://lucene.apache.org/hadoop/releases.html Notably, this release contains the first working version of HBase:

Re: Ant build: touch in init breaks build

2007-11-02 Thread Doug Cutting
The problem is that these template files are in subversion but are not included in the released sources. Most folks who build are using sources checked out from subversion and hence do not have this issue. The sources included with releases should be buildable, but I don't think we should

Re: /tmp/hadoop-${user.name} interpreted wrong when using Cygwin

2007-11-02 Thread Doug Cutting
Holger Stenzhorn wrote: I am using Hadoop under Cygwin with the default settings. So hence the hadoop.tmp.dir is set to /tmp/hadoop-${user.name} via the hadoop-default.xml. Now when I start using Hadoop it creates a directory c:\tmp\hadoop-holste (as holste is my user name obviously). But

Re: How to Setup Hbase in 10 mintues

2007-10-30 Thread Doug Cutting
Holger Stenzhorn wrote: This fix is exactly the same as done for hadoop-daemon.sh (and introduced into the Subversion repository already). Which begs the question: could HBase use hadoop-daemon.sh directly? If not, could hadoop-daemon.sh be modified to support HBase? Maintaining two

Re: can jobs be launched recursively within a mapper ?

2007-10-30 Thread Doug Cutting
Johnson, Jorgen wrote: Create a QueueInputFormat, which provides a RecordReader implementation that pops values off a globally accessible queue*. This would require filling the queue with values prior to loading the map/red job. This would allow the mappers to cram values back into the

Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

2007-10-24 Thread Doug Cutting
Lance Amundsen wrote: I am starting to wonder if it might be indeed impossible to get map jobs running w/o writing to the file system as in, not w/o some major changes to the job and task tracker code. I was thinking about creating an InputFormat that does no file I/O, instead is queue

Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

2007-10-18 Thread Doug Cutting
Lance Amundsen wrote: There's lots of references on decreasing DFS block size to increase maps to record ratios. What is the easiest way to do this? Is it possible with the standard SequenceFile class? You could specify the block size in the Configuration parameter to

Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

2007-10-18 Thread Doug Cutting
Lance Amundsen wrote: Example: let's say I have 10K one second jobs and I want the whole thing to run 2 seconds. I currently see no way for Hadoop to achieve this, That's right. That has not been a design goal to date. Tasks are typically expected to last at least several seconds. To fix

Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

2007-10-18 Thread Doug Cutting
Lance Amundsen wrote: Thx, I'll give that a try. Seems to me a method to tell hadoop to split a file every n key/value pairs would be logical. Or maybe a createSplitBoundary when appending key/value records? Splits should not require examining the data: that's not scalable. So they're

Re: jdk6 on darwin

2007-10-12 Thread Doug Cutting
Michael Bieniosek wrote: Does anybody know if there is a jdk6 available for Mac? I checked the apple developer site, and there doesn't seem to be one available, despite blogs from last year claiming apple was distributing it. Since I do my development work on a Mac, switching to jdk6 would

Re: jdk6 on darwin

2007-10-12 Thread Doug Cutting
Colin Evans wrote: I'm a bit confused by this discussion though. How would compiling the jars with Java 1.5 and running on 1.6 degrade performance (assuming that the jars don't use any new 1.6 APIs)? It won't. The claim is just that running with Java 1.5 degrades performance significantly.

Re: HBase performance

2007-10-12 Thread Doug Cutting
Jonathan Hendler wrote: Since Vertica is also a distributed database, I think it may be interesting to the newbies like myself on the list. To keep the conversation topical - while it's true there's a major campaign of PR around Vertica, I'd be interested in hearing more about how HBase

Re: Hadoop on Windows

2007-10-10 Thread Doug Cutting
Nick Lothian wrote: That turns out to be a Unix vs DOS line endings thing (!). Running the following commands fixed that: dos2unix.exe /cygdrive/c/dev/prog/hadoop-0.14.1/conf/masters dos2unix.exe /cygdrive/c/dev/prog/hadoop-0.14.1/conf/slaves That should not be required. When you install

Re: 14.1 to 14.2

2007-10-10 Thread Doug Cutting
Stu Hood wrote: Is it necessary to run the -upgrade operation to take a cluster from 0.14.1 to 0.14.2? None of the release pages say... No. Bugfix releases should be compatible. Doug

Re: Hadoop Get-Together Details

2007-10-02 Thread Doug Cutting
Erich Nachbar wrote: Could we use the Hadoop Wiki for this or do we need to setup a separate Wiki (which I would not prefer)? Hadoop wiki is fine for this. Doug

Re: Multicore nodes

2007-10-01 Thread Doug Cutting
Toby DiPasquale wrote: In short, yes. Hadoop's code takes advantage of multiple native threads and you can tune the level of concurrency in the system by setting mapred.map.tasks and mapred.reduce.tasks to take advantage of multiple cores on the nodes which have them. More importantly, you

Re: Hadoop Get-Together Details

2007-09-27 Thread Doug Cutting
C G wrote: Are there any other east coast developers interested in a Boston-area get together? FYI, I'll be at ApacheCon in Atlanta this November 14th 15th, which might be a good place for a Hadoop BOF. http://www.us.apachecon.com/ Doug

Re: a million log lines from one job tracker startup

2007-09-26 Thread Doug Cutting
kate rhodes wrote: It retries as fast as it can. Yes, I can see that. It seems we should either insert a call to 'sleep(1000)' at JobTracker.java line 696, or remove that while loop altogether, since JobTracker#startTracker() will already retry on a one-second interval. In the latter

Re: Reduce Performance

2007-09-21 Thread Doug Cutting
Ross Boucher wrote: My cluster has 4 machines on it, so based on the recommendations on the wiki, I set my reduce count to 8. Unfortunately, the performance was less than ideal. Specifically, when the map functions had finished, I had to wait an additional 40% of the total job time just for

Re: Hadoop uses Client VM?

2007-09-18 Thread Doug Cutting
Toby DiPasquale wrote: Why does Hadoop use the Client JVM? I've been told that you should almost never use the Client JVM and instead use the Server JVM for anything even remotely long-running. Is the Server JVM less stable? It doesn't specify the client JVM, rather it just doesn't specify the

Re: JOIN-type operations with Hadoop...

2007-09-18 Thread Doug Cutting
Ted Dunning wrote: Is there any way to add our support to your proposal? Would that even help? Yes, plese. Join the incubator-general mailing list and participate in the discussion. Your opinions is welcome there. Only votes from folks on the Incubator's PMC are binding, but votes from

Re: rack-awareness for hdfs

2007-09-17 Thread Doug Cutting
Jeff Hammerbacher wrote: has anyone leveraged the ability of datanodes to specify which datacenter and rack they live in? if so, any evidence of performance improvements? it seems that rack-awareness is only leveraged in block replication, not in task execution. It often doesn't make a big

Hadoop release 0.14.1 available

2007-09-05 Thread Doug Cutting
Release 0.14.1 fixes bugs in 0.14.0. For release details and downloads, visit: http://lucene.apache.org/hadoop/releases.html Thanks to all who contributed to this release! Doug

Re: Compression using Hadoop...

2007-09-04 Thread Doug Cutting
Ted Dunning wrote: I have to say, btw, that the source tree structure of this project is pretty ornate and not very parallel. I needed to add 10 source roots in IntelliJ to get a clean compile. In this process, I noticed some circular dependencies. Would the committers be open to some small

Re: Using Map/Reduce without HDFS?

2007-08-31 Thread Doug Cutting
mfc wrote: How can this get higher on the priority list? Even just a single appender. Fundamentally, priorities are set by those that do the work. As a volunteer organization, we can't assign tasks. Folks must volunteer to do the work. Y! has volunteered more than others on Hadoop, but

Re: Using Map/Reduce without HDFS?

2007-08-31 Thread Doug Cutting
Ted Dunning wrote: Presumably this won't be the kind of thing an outsider could do easily. There are no outsiders here, I hope! We try to conduct everything in the open, from design through implementation and testing. If you feel that you're missing discussions, please ask questions. Some

Re: Compression using Hadoop...

2007-08-31 Thread Doug Cutting
Arun C Murthy wrote: One way to reap benefits of both compression and better parallelism is to use compressed SequenceFiles: http://wiki.apache.org/lucene-hadoop/SequenceFile Of course this means you will have to do a conversion from .gzip to .seq file and load it onto hdfs for your job,

Re: FW: Removing files after processing

2007-08-28 Thread Doug Cutting
I think this is related to HADOOP-1558: https://issues.apache.org/jira/browse/HADOOP-1558 Per-job cleanups that are not run clientside must be run in a separate JVM, since we, as a rule, don't run user code in long-lived daemons. Doug Stu Hood wrote: Does anyone have any ideas on this

Re: FW: Removing files after processing

2007-08-28 Thread Doug Cutting
Matt Kent wrote: I would find it useful to have some sort of listener mechanism, where you could register an object to be notified of a job completion event and then respond to it accordingly. There is a job completion notification feature. property namejob.end.notification.url/name

Re: Problem submitting a job with hadoop 0.14.0

2007-08-24 Thread Doug Cutting
Thomas Friol wrote: Other question, : Why the 'hadoop.tmp.dir' is user.name dependant ? We need a directory that a user can write and also not to interfere with other users. If we didn't include the username, then different users would share the same tmp directory. This can cause

Re: Poly-reduce?

2007-08-24 Thread Doug Cutting
Ted Dunning wrote: It isn't hard to implement these programs as multiple fully fledged map-reduces, but it appears to me that many of them would be better expressed as something more like a map-reduce-reduce program. [ ... ] Expressed conventionally, this would have write all of the user

Re: Issues with 0.14.0...

2007-08-24 Thread Doug Cutting
Michael Stack wrote: You might try backing out the HADOOP-1708 patch. It changed the test guarding the log message you report below. HADOOP-1708 isn't in 0.14.0. Doug

Re: Reduce Performance

2007-08-23 Thread Doug Cutting
Thorsten Schuett wrote: During the copy phase of reduce, the cpu load was very low and vmstat showed constant reads from the disk at ~15MB/s and bursty writes. At the same time, data was sent over the loopback device at ~15MB/s. I don't see what else could limit the performance here. The disk

Hadoop release 0.14.0 available

2007-08-21 Thread Doug Cutting
New features in release 0.14.0 include: - Better checksums in HDFS. Checksums are no longer stored in parallel HDFS files, but are stored directly by datanodes alongside blocks. This is more efficient for the namenode and also improves data integrity. - Pipes: A C++ API for MapReduce -

Re: Is mapred-default.xml read for dfs config?

2007-08-16 Thread Doug Cutting
Yes, that sounds correct. However it will probably change in 0.15, since so many folks have found it confusing. Exactly how it will change is still a matter of open debate. https://issues.apache.org/jira/browse/HADOOP-785 Doug Michael Bieniosek wrote: The wiki page

Re: Working with the output files of a hadoop application

2007-08-15 Thread Doug Cutting
Sebastien Rainville wrote: I am new to Hadoop. Looking at the documentation, I figured out how to write map and reduce functions but now I'm stuck... How do we work with the output file produced by the reducer? For example, the word count example produces a file with words as keys and the number

Re: To solve the checksum errors on the non-ecc mem machines.

2007-08-14 Thread Doug Cutting
Daeseong Kim wrote: To solve the checksum errors on the non-ecc memory machines, I modified some codes in DFSClient.java and DataNode.java. The idea is very simple. The original CHUNK structure is {chunk size}{chunk data}{chunk size}{chunk data}... The modified CHUNK structure is {chunk

Re: Specifying external jars in the classpath for Hadoop

2007-08-14 Thread Doug Cutting
Eyal Oren wrote: As far as I understand (that's what we do anyway), you have to submit one jar that contains all your dependencies (except for dependencies on hadoop libs), including external jars. The easiest is probably to build maven/ant to build such big jar externally with all its

Re: Error reporting from map function

2007-07-31 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I've written a map task that will on occasion not compute the correct result. This can easily be detected, at which point I'd like the map task to report the error and terminate the entire map/reduce job. Does anyone know of a way I can do this? You can easily kill

Re: NameNode failover procedure

2007-07-20 Thread Doug Cutting
Andrzej Bialecki wrote: So far I learned that the secondary namenode keeps refreshing periodically its backup copies of fsimage and editlog files, and if the primary namenode disappears, it's the responsibility of the cluster admin to notice this, shut down the cluster, switch the configs

Re: undelete

2007-07-17 Thread Doug Cutting
Since Hadoop 0.12, if you configure fs.trash.interval to a non-zero value then 'bin/hadoop dfs -rm' will move things to a trash directory instead of immediately removing them. The Trash is periodically emptied of older items. Perhaps we should change the default value for this to 60 (one

Re: HDFS replica management

2007-07-17 Thread Doug Cutting
Phantom wrote: Here is the scenario I was concerned about. Consider three nodes in the system A, B and C which are placed say in different racks. Let us say that the disk on A fries up today. Now the blocks that were stored on A are not going to re-replicated (this is my understanding but I

Re: HDFS replica management

2007-07-17 Thread Doug Cutting
Phantom wrote: I am sure re-replication is not done on every heartbeat miss since that would be very expensive and inefficient. At the same time you cannot really tell if a node is partitioned away, crashed or just slow. Is it threshold based i.e I missed N heartbeats so re-replicate ? Yes,

Re: Trying to run nutch: no address associated with name

2007-07-12 Thread Doug Cutting
In the slaves file, 'localhost' should only be used alone, not with other hosts, since 'localhost' is not a name that other hosts can use to refer to a host. It's equivalent to 127.0.0.1, the loopback address. So, if you're specifying more than one host, it's best to use real hostnames or IP

Re: Setting number of Maps

2007-07-03 Thread Doug Cutting
You could define an InputFormat whose InputSplits are not files, but rather simply have a field that is a complex number. The complex field would be written and read by Writable#write() and Writable#readFields. This InputFormat would ignore the input directory, since it is not a file-based

Re: Multi-case dfs.name.dir

2007-06-25 Thread Doug Cutting
KrzyCube wrote: I found that File[] editFiles in FSEditLog.java , then i trace the call stack and found that it can be configured as multi-case of dfs.name.dir . Is this means the NameNode data can be split into pieces or just set replication as the number of the strings of dirs that

Re: Examples of chained MapReduce?

2007-06-22 Thread Doug Cutting
James Kennedy wrote: So far what I've had trouble finding examples of MapReduce jobs that are kicked-off by some one time process that in turn kick off other MapReduce jobs long after the initial driver process is dead. This would be more distributed and fault tolerant since it removes

Re: map task in initializing phase for too long

2007-06-21 Thread Doug Cutting
Jun Rao wrote: I am wondering if anyone has experienced this problem. Sometimes when I ran a job, a few map tasks (often just one) hang in the initializing phase for more than 3 minutes (it normally finishes in a couple seconds). They will eventually finish, but the whole job is slowed down

Re: map task in initializing phase for too long

2007-06-21 Thread Doug Cutting
Raghu Angadi wrote: Doug Cutting wrote: Owen wrote: One side note is that all of the servers have a servlet such that if you do http://node:port/stacks you'll get a stack trace of all the threads in the server. I find that useful for remote debugging. *smile* Although if it is a task jvm

Re: Cluster efficiency

2007-06-21 Thread Doug Cutting
Mathijs Homminga wrote: Is there a way to easily determine the efficiency of my cluster? Example: - there are 5 slaves which can handle 1 task at the time each - there is one job, split into 5 sub tasks (5 maps and 5 reduces) - 4 slaves finish their tasks in 1 minute - 1 slave finishes its tasks

Re: MapFile inner workings

2007-06-20 Thread Doug Cutting
Every 128th key is held in memory. So if you've got 1M keys in a MapFile, then opening a MapFile.Reader would read 10k keys into memory. Binary search is used on these in-memory keys, so that a maximum of 127 entries must be scanned per random access. Doug Phantom wrote: Hi All I know

Re: hdfsOpenFile() API

2007-06-14 Thread Doug Cutting
Phantom wrote: Which would mean that if I want to have my logs to reside in HDFS I will have to move them using copyFromLocal or some version thereof and then run Map/Reduce process against them ? Am I right ? Yes. HDFS is probably not currently suitable for directly storing log output as it

Re: Can Hadoop MapReduce be used without using HDFS

2007-06-11 Thread Doug Cutting
Neeraj Mahajan wrote: I read from Hadoop docs that the task scheduler tries to execute the task closer to the data. Can this functionality be applied without using HDFS? How? You can subclass LocalFileSystem and override getFileCacheHints() to return the host where the file is known to be

Re: Bad concurrency bug in 0.12.3?

2007-06-01 Thread Doug Cutting
Calvin Yu wrote: The problem seems to be with the MapTask's (MapTask.java) sort progress thread (line #196) not stopping after the sort is completed, and hence the call to join() (line# 190) never returns. This is because that thread is only catching the InterruptedException, and not checking

Re: Bad concurrency bug in 0.12.3?

2007-06-01 Thread Doug Cutting
a thread dump of the hang up. Calvin On 6/1/07, Doug Cutting [EMAIL PROTECTED] wrote: Calvin Yu wrote: The problem seems to be with the MapTask's (MapTask.java) sort progress thread (line #196) not stopping after the sort is completed, and hence the call to join() (line# 190) never returns

Re: Pipe/redirection to HDFS?

2007-06-01 Thread Doug Cutting
Mark Meissonnier wrote: Sweet. It works. Thanks Someone should put it on this wiki page http://wiki.apache.org/lucene-hadoop/hadoop-0.1-dev/bin/hadoop_dfs I don't have editing priviledges. Anyone can create themselves a wiki account and edit pages. Just use the Login button at the top of

Re: Configuration and Hadoop cluster setup

2007-05-29 Thread Doug Cutting
Phantom wrote: (1) Set my fs.default.name set to hdfs://host:port and also specify it in the JobConf configuration. Copy my sample input file into HDFS using bin/hadoop fd -put from my local file system. I then need to specify this file to my WordCount sample as input. Should I specify this file

Re: slowness in hadoop reduce phase when using distributed mode

2007-05-03 Thread Doug Cutting
What version of Hadoop are you using? On what sort of a cluster? How big is your dataset? Doug moonwatcher wrote: hey guys, i've setup hadoop in distributed mode (jobtracker, tasktracker, and hdfs daemons), and observing that the map phase executes really quickly but the reduce phase

Re: Many Checksum Errors

2007-05-02 Thread Doug Cutting
Dennis Kubes wrote: Do we know if this is a hardware issue. If it is possibly a software issue I can dedicate some resources to tracking down bugs. I would just need a little guidance on where to start looking? We don't know. The checksum mechanism is designed to catch hardware problems.

Re: Serializing code to nodes: no can do?

2007-04-24 Thread Doug Cutting
Pedro Guedes wrote: For this I need to be able to register new steps in my chain and pass them to hadoop to execute as a mapreduce job. I see two choices here: 1 - build a .job archive (main-class: mycrawler, submits jobs thru JobClient) with my new steps and dependencies in the 'lib/'

Re: Running on multiple CPU's

2007-04-17 Thread Doug Cutting
Eelco Lempsink wrote: I'm not trying to run it on a cluster though, only on one host with multiple CPU's. So I guess the local filesystem is shared and therefore it should be fine. Yes, that should be fine. However, If I try with fs.default.name set to file:///tmp/hadoop-test/ still

Re: Running on multiple CPU's

2007-04-16 Thread Doug Cutting
Eelco Lempsink wrote: Inspired by http://www.mail-archive.com/[EMAIL PROTECTED]/msg02394.html I'm trying to run Hadoop on multiple CPU's, but without using HDFS. To be clear: you need some sort of shared filesystem, if not HDFS, then NFS, S3, or something else. For example, the job client

bandwidth (Was: Re: Running on multiple CPU's)

2007-04-16 Thread Doug Cutting
Please use a new subject when starting a new topic. jafarim wrote: Sorry if being off topic, but we experienced a very low bandwidth with hadoop while copying files to/from the cluster (some 1/100 comparing to plain samba share). The bandwidth did not improve at all by adding nodes to the

Re: Running on multiple CPU's

2007-04-16 Thread Doug Cutting
Ken Krugler wrote: Has anybody been using Hadoop with ZFS? Would ZFS count as a readily available shared file system that scales appropriately? Sun's ZFS? I don't think that's distributed, is it? Does it provide a single namespace across an arbitrarily large cluster? From the

Re: bandwidth (Was: Re: Running on multiple CPU's)

2007-04-16 Thread Doug Cutting
jafarim wrote: On linux and jvm6 with normal IDE disks and a giga ethernet switch with corresponding NIC and with hadoop 0.9.11's HDFS. We wrote a C program by using the native libs provided in the package but then we tested again with distcp. The scenario was as follows: We ran the test on a

Re: Using Hadoop for Record storage

2007-04-12 Thread Doug Cutting
Andy Liu wrote: I'm exploring the possibility of using the Hadoop records framework to store these document records on disk. Here are my questions: 1. Is this a good application of the Hadoop records framework, keeping in mind that my goals are speed and scalability? I'm assuming the answer

Re: Large data sets

2007-02-06 Thread Doug Cutting
Konstantin Shvachko wrote: 200 bytes per file is theoretically correct, but rather optimistic :-( From a real system memory utilization I can see that HDFS uses 1.5-2K per file. And since each real file is internally represented by two files (1 real + 1 crc) the real estimate per file should

Re: Best practice for in memory data?

2007-01-25 Thread Doug Cutting
Johan Oskarsson wrote: Any advice on how to solve this problem? I think your current solutions sound reasonable. Would it be possible to somehow share a hashmap between tasks? Not without running multiple tasks in the same JVM. We could implement a mode where child tasks are run directly

Re: How to use MultithreadedMapRunner and MapRunner with the same hadoop-site.xml

2007-01-25 Thread Doug Cutting
Gu wrote: How can I use in some case MultithreadedMapRunner, and in some case MapRunner for different jobs? Use JobConf#setMapRunnerClass() on jobs that you want to override the default MapRunner, with, e.g. MultithreadedMapRunner. Do I have to use one hadoop-site.xml for one job? But I

Re: Hadoop + Lucene integration: possible? how?

2007-01-15 Thread Doug Cutting
Andrzej Bialecki wrote: It's possible to use Hadoop DFS to host a read-only Lucene index and use it for searching (Nutch has an implementation of FSDirectory for this purpose), but the performance is not stellar ... Right, the best practice is to copy Lucene indexes to local drives in order

Re: s3

2007-01-08 Thread Doug Cutting
Tom White wrote: Any what do people think of the following. We already have a bunch of stuff up in S3 that we'd like to use as input to a hadoop mapreduce job only it wasn't put there by hadoop so it doesn't have the hadoop format where file-is-actually-a-list-of-blocks. [ ... ] The best

Re: s3

2007-01-08 Thread Doug Cutting
Tom White wrote: This sounds like a good plan. I wonder whether the existing block-based s3 scheme should be renamed (as s3block or similar) so s3 is the scheme that sores raw files as you describe? Perhaps s3fs would be best for the full FileSystem implementation, and simply s3 for direct

Re: default JAVA_HOME in hadoop-env.sh

2007-01-03 Thread Doug Cutting
Shannon -jj Behrens wrote: The default JAVA_HOME in hadoop-env.sh is /usr/bin/java. This is confusing because /usr/bin/java is a binary, not a directory. On my system, this resulted in: $ hadoop namenode -format /usr/local/hadoop-install/hadoop/bin/hadoop: 122: /usr/bin/java/bin/java: not

Re: Hadoop on Ubuntu 6.10

2007-01-03 Thread Doug Cutting
Can you please file a bug in Jira for this? https://issues.apache.org/jira/browse/HADOOP Select CREATE NEW ISSUE. Create yourself a Jira account if you don't already have one. Thanks, Doug Shannon -jj Behrens wrote: I'm using Hadoop on Ubuntu 6.10. I ran into: $ start-all.sh starting

Re: HadoopStreaming

2007-01-03 Thread Doug Cutting
Shannon -jj Behrens wrote: There's no link to http://wiki.apache.org/lucene-hadoop/HadoopStreaming on http://wiki.apache.org/lucene-hadoop/. It would be really nice if there were one. Please add one. Anyone can help maintain the wiki. Simply create yourself an account and edit the page.

Re: Urgent: Production Issues

2006-12-21 Thread Doug Cutting
Jagadeesh wrote: Over the past day we have managed to migrate our clusters from 0.7.2 to 0.9.0. Thanks for sharing your experiences. Please note that there is now a 0.9.2 release. There should be no compatibility issues upgrading from 0.9.0 to 0.9.2, and a number of bugs are fixed, so I

Re: How to say Hadoop

2006-12-08 Thread Doug Cutting
Owen O'Malley wrote: I think Hadoop is pronounced as h a: - d u: p with the emphasis on the second syllable. (key: http://en.wikipedia.org/wiki/IPA_chart_for_English) I believe the first vowel there is properly ae (as in cat), but in rapid speech this unstressed vowel turns to a schwa, so

Re: MapFile.get() has a bug?

2006-11-28 Thread Doug Cutting
Albert Chern wrote: Every time the size of the map file hits a multiple of the index interval, an index entry is written. Therefore, it is possible that an index entry is not added for the first occurrence of a key, but one of the later ones. The reader will then seek to one of those instead

Re: Mapredtest failure

2006-11-09 Thread Doug Cutting
Brendan Melville wrote: in hadoop-site.xml I had mapred.map.tasks and mapred.reduce.tasks set. Right, these parameters should be specified in mapred-default.xml, so that they do not override application code. This is a common confusion. Someday we should perhaps alter the configuration

Re: Help in setting Hadoop on multiple servers

2006-11-07 Thread Doug Cutting
howard chen wrote: 2006-11-07 21:53:35,492 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.lang.RuntimeException: Bad mapred.job.tracker: local To run distributed, you must configure mapred.job.tracker and fs.default.name to be a host:port pairs on all

Re: why to force a single reduce task for Local Runner?

2006-11-07 Thread Doug Cutting
Feng Jiang wrote: look at the code: job.setNumReduceTasks(1); // force a single reduce task why? Is there any difficulty there to allow multiple reduce tasks? There is not a strong reason why a single reduce task is required. This code attempts to implement things as simply

Re: Help in setting Hadoop on multiple servers

2006-11-06 Thread Doug Cutting
howard chen wrote: but when I stop-all --config...it show... no jobtracker to stop serverA: Login Success! serverB: Login Success! serverB: no tasktracker to stop It looks like the tasktracker crashed on startup. Login to ServerB and look in its logs to see what happened. Doug

  1   2   >