On Jun 25, 2009, at 9:42 PM, Mark Kerzner wrote:
my guess, as good as anybody's, is that Pregel is to large graphs is
what
Hadoop is to large datasets.
I think it is much more likely a language that allows you to easily
define fixed point algorithms. I would imagine a distributed
On Thu, Jun 18, 2009 at 9:29 PM, Zheng Shao zs...@facebook.com wrote:
Yuntao Jia, our intern this summer, did a simple performance benchmark for
Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A
Comparison of Approaches to Large-Scale Data Analysis
It should be noted
Keys and values can be large. They are certainly capped above by
Java's 2GB limit on byte arrays. More practically, you will have
problems running out of memory with keys or values of 100 MB. There is
no restriction that a key/value pair fits in a single hdfs block, but
performance would suffer.
On Jun 18, 2009, at 8:45 AM, Leon Mergen wrote:
Could you perhaps elaborate on that 100 MB limit ? Is that due to a
limit that is caused by the Java VM heap size ? If so, could that,
for example, be increased to 512MB by setting mapred.child.java.opts
to '-Xmx512m' ?
A couple of points:
On Jun 18, 2009, at 10:56 AM, pmg wrote:
Each line from FileA gets compared with every line from FileB1,
FileB2 etc.
etc. FileB1, FileB2 etc. are in a different input directory
In the general case, I'd define an InputFormat that takes two
directories, computes the input splits for each
On Wed, Jun 17, 2009 at 12:26 PM, Chuck Lam chuck@gmail.com wrote:
an alternative is to create a new WritableComparator and then set it
in the JobConf object with the method setOutputKeyComparatorClass().
You can use IntWritable.Comparator as a start.
The important part of that is to
Sorry, I forget how much isn't clear to people who are just starting.
FileInputFormat creates FileSplits. The serialization is very stable
and can't be changed without breaking things. The reason that pipes
can't stringify it is that the string form of input splits are
ambiguous (and since
*Sigh* We need Avro for input splits.
That is the expected behavior. It would be great if someone wrote a C+
+ FileInputSplit class that took a binary string and converted it back
to a filename, offset, and length.
-- Owen
On Jun 10, 2009, at 2:43 AM, Lukáš Vlček wrote:
I am wondering how Google implemented the driving directions
function in the
Maps. More specifically how did they do it that it is so fast. I
asked on
Google engineer about this and all he told me is just that there are
bunch
of MapReduce
Take a look at Arun's slide deck on Hadoop performance:
http://bit.ly/EDCg3
It is important to get io.sort.mb large enough, the io.sort.factor
should be closer to 100 instead of 10. I'd also use large block sizes
to reduce the number of maps. Please see the deck for other important
Ok I can understand your point - but I am sure that some people have been
trying to use map-reduce programming model to do CFD, or any other
scientific computing.
Any experience in this area from the list ?
I know of one project that assumes it has an entire Hadoop cluster,
and generates the
On Jun 4, 2009, at 11:19 AM, Kris Jirapinyo wrote:
Hopefully we can have a new page on the hadoop wiki on how to
use custom compression so that people won't have to go search
through the
threads to find the answer in the future.
Yes, it would be extremely useful if you could start a wiki
On May 28, 2009, at 5:15 AM, Stuart White wrote:
I need to process a dataset that contains text records of fixed length
in bytes. For example, each record may be 100 bytes in length
The update to the terasort example has an InputFormat that does
exactly that. The key is 10 bytes and the
On May 20, 2009, at 3:07 AM, Tom White wrote:
Why does mapred depend on hdfs? MapReduce should only depend on the
FileSystem interface, shouldn't it?
Yes, I should have been consistent. In terms of compile-time
dependences, mapred only depends on core.
-- Owen
On May 15, 2009, at 3:25 PM, Aaron Kimball wrote:
Yikes. So part of sqoop would wind up in one source repository, and
part in
another? This makes my head hurt a bit.
I'd say rather that Sqoop is in Mapred and the adapter to Hive is in
Hive.
I'm also not convinced how that helps.
On May 18, 2009, at 3:42 AM, Steve Loughran wrote:
Presumably its one of those hard-to-reproduce race conditions that
only surfaces under load on a big cluster so is hard to replicate in
a unit test, right?
Yes. It reliably happens on a 100TB or larger sort, but almost never
happens on
We have observed that the default jvm on RedHat 5 can cause
significant data corruption in the map/reduce shuffle for those using
Hadoop 0.20. In particular, the guilty jvm is:
java version 1.6.0_05
Java(TM) SE Runtime Environment (build 1.6.0_05-b13)
Java HotSpot(TM) Server VM (build
On May 15, 2009, at 2:05 PM, Aaron Kimball wrote:
In either case, there's a dependency there.
You need to split it so that there are no cycles in the dependency
tree. In the short term it looks like:
avro:
core: avro
hdfs: core
mapred: hdfs, core
hive: mapred, core
pig: mapred, core
On May 11, 2009, at 8:48 AM, Stuart White wrote:
I'd like to be able to access my job's counters from within my
Reducer's configure() method (so I can know how many records were
output from my mappers). Is this possible? Thanks!
The counters aren't available and if they were the
On May 7, 2009, at 12:38 PM, Foss User wrote:
Where can I find this example. I was not able to find it in the
src/examples directory.
It is in 0.20.
http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/SecondarySort.java
-- Owen
Best is to use one user for map/reduce and another for hdfs. Neither
of them should be root or real users. With the setuid patch
(HADOOP-4490), it is possible to run the jobs as the submitted user.
Note that if you do that, you no doubt want to block certain system
uids (bin, mysql, etc.)
-- Owen
If you use custom key types, you really should be defining a
RawComparator. It will perform much much better.
-- Owen
On Apr 22, 2009, at 10:44 PM, Koji Noguchi wrote:
Nigel,
When you have time, could you release 0.18.4 that contains some of the
patches that make our clusters 'stable'?
Is it just the patches that have already been applied to the 18
branch? Or are there more?
-- Owen
On Apr 14, 2009, at 11:10 AM, Saptarshi Guha wrote:
Thanks. I am using 0.19, and to confirm, the map and combiner (in
the map jvm) are run in *different* threads at the same time?
And the change was actually made in 0.18. So since then, the combiner
is called 0, 1, or many times on each
On Apr 10, 2009, at 11:12 AM, Sagar Naik wrote:
Hi,
I would like to implement a Multi-threaded reducer.
As per my understanding , the system does not have one coz we expect
the output to be sorted.
However, in my case I dont need the output sorted.
You'd probably want to make a blocking
On Thu, Apr 9, 2009 at 9:30 PM, Brian Bockelman bbock...@cse.unl.edu wrote:
On Apr 9, 2009, at 5:45 PM, Stas Oskin wrote:
Hi.
I have 2 questions about HDFS performance:
1) How fast are the read and write operations over network, in Mbps per
second?
Depends. What hardware? How much
On Apr 10, 2009, at 9:07 AM, Stas Oskin wrote:
From your experience, how RAM hungry HDFS is? Meaning, additional
4GB or
ram (to make it 8GB aas in your case), really change anything?
I don't think the 4 to 8GB would matter much for HDFS. For Map/Reduce,
it is very important.
-- Owen
On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote:
Owen,
Is binary streaming actually readily available?
https://issues.apache.org/jira/browse/HADOOP-1722
This was discussed over on:
https://issues.apache.org/jira/browse/HADOOP-5203
Doug uploaded a patch, but no one seems to be working on it.
-- Owen
On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
Hey all
I'm trying to connect two separate Hadoop clusters. Is it possible
to do so?
I need data to be shuttled back and forth between the two clusters.
Any
suggestions?
You should use hadoop distcp. It is a map/reduce program that
We always share the drives.
-- Owen
On Apr 5, 2009, at 0:52, zsongbo zson...@gmail.com wrote:
I usually set mapred.local.dir to share the disk space with DFS,
since some
mapreduce job need big temp space.
On Fri, Apr 3, 2009 at 8:36 PM, Craig Macdonald
cra...@dcs.gla.ac.ukwrote:
On Apr 5, 2009, at 3:30 AM, Foss User wrote:
However, I see that anyone can easily impersonate as fossist
Yes, it is easy to work around the security in Hadoop. It is only
intended to prevent accidents, such as the time a student accidently
wiped out the entire class' home directories.
On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:
Has anyone benchmark the performance difference of using Hadoop ?
1) Java vs C++
2) Java vs Streaming
Yes, a while ago. When I tested it using sort, Java and C++ were
roughly equal and streaming was 10-20% slower. Most of the cost with
On Apr 3, 2009, at 10:35 AM, Ricky Ho wrote:
I assume that the key is still sorted, right ? That mean I will get
all the key1, valueX entries before getting any of the key2
valueY entries and key2 is always bigger than key1.
Yes.
-- Owen
On Apr 3, 2009, at 1:41 PM, Tu, Tiankai wrote:
By the way, what is the largest size---in terms of total bytes, number
of files, and number of nodes---in your applications? Thanks.
The largest Hadoop application that has been documented is the Yahoo
Webmap.
10,000 cores
500 TB shuffle
300
On Apr 1, 2009, at 12:58 AM, Sandhya E wrote:
Hi
When the host is listed in slaves file both DataNode and TaskTracker
are started on that host. Is there a way in which we can configure a
node to be datanode and not tasktracker.
If you use hadoop-daemons.sh, you can pass a host list. So do:
What happened is that we added fast start (HADOOP-3136), which
launches more than one task per a heartbeat. Previously, if you maps
didn't take very long, they finished before the heartbeat and the task
tracker was assigned a new map task. A side effect was that no reduce
tasks were launched until
includes the Hadoop track. The
video of the keynote and lunch talks are free, but there is a charge
for the Hadoop track.
The Hadoop track this year includes:
* Opening Keynote - Data Management in the Cloud - Raghu
Ramakrishnan
* Introduction to Hadoop - Owen O'Malley
* Hadoop Map
On Mar 23, 2009, at 9:15 AM, tim robertson wrote:
If Akira was to write his/her own Mappers, using types like
IntWritable would result in it being numerically sorted right?
Yes.
Or they can use the KeyFieldBasedComparator. I think if you put the
following in your job conf, you'll get the
On Mar 18, 2009, at 10:26 AM, Stuart White wrote:
I'd like to implement some coordination between Mapper tasks running
on the same node. I was thinking of using ZooKeeper to provide this
coordination.
This is a very bad idea in the general case. It can be made to work,
but you need to
On Mar 17, 2009, at 9:18 AM, Richa Khandelwal wrote:
I was going through FAQs on Hadoop to optimize the performance of
map/reduce. There is a suggestion to set the number of reducers to a
prime
number closest to the number of nodes and number of mappers a prime
number
closest to several
On Mar 17, 2009, at 5:53 AM, Stuart White wrote:
Yeah, I thought of that, but I was concerned that, even if it did
work, if it wasn't guaranteed behavior, that it might stop working in
a future release. I'll go ahead and give that a try.
No, it will continue to work. It is required for both
On Mar 17, 2009, at 7:44 PM, Bryan Duxbury wrote:
There is no compression in the mix for us, so that's not the culprit.
I'd be sort of willing to believe that spilling and sorting play a
role in this, but, wow, over 10x read and write? That seems like a
big problem.
It happened recently
Just use the current working directory. Each task gets a unique
directory that is erased when the task finished.
-- Owen
On Mar 15, 2009, at 16:08, Mark Kerzner markkerz...@gmail.com wrote:
Hi,
what would be the best place to put temporary files for a reducer? I
believe
that since
On Mar 13, 2009, at 3:56 PM, Richa Khandelwal wrote:
You can initialize IntWritable with an empty constructor.
IntWritable i=new IntWritable();
NullWritable is better for that application than IntWritable. It
doesn't consume any space when serialized. *smile*
-- Owen
On Mar 7, 2009, at 10:56 PM, pavelkolo...@gmail.com wrote:
Does hadoop-default.xml + hadoop-site.xml of master host matter
for whole Job
or they matter for each node independently?
Please never modify hadoop-default. That is for the system defaults.
Please use hadoop-site for your
On Mar 2, 2009, at 11:46 PM, 鞠適存 wrote:
I wonder how to make the hadoop version number.
Each 0.18, 0.19 and 0.20 have their own branch. The first release on
each branch is 0.X.0, and then 0.X.1 and so on. New features are only
put into trunk and only important bug fixes are put into the
On Mar 9, 2009, at 1:00 PM, james warren wrote:
Speculative execution has existed far before 0.19.x, but AFAIK the
100%
issue has appeared (at least with greater frequency) since 0.19.0
came out.
Are you saying there are changes in how task progress is being
tracked?
In the past, it
On Mar 2, 2009, at 3:03 AM, Tom White wrote:
I believe the static singleton approach outlined by Scott will work
since the map classes are in a single classloader (but I haven't
actually tried this).
Even easier, you should just be able to do it with static
initialization in the Mapper
On Feb 26, 2009, at 2:03 PM, Nathan Marz wrote:
Do the reducers batch copy map outputs from a machine? That is, if a
machine M has 15 intermediate map outputs destined for machine R,
will machine R copy the intermediate outputs one at a time or all at
once?
Currently, one at a time. In
On Mon, Feb 23, 2009 at 12:06 PM, Jimmy Wan ji...@indeed.com wrote:
part of my map/reduce process could be greatly sped up by mapping
key/value pairs in batches instead of mapping them one by one.
Can I safely hang onto my OutputCollector and Reporter from calls to map?
Yes. You can even use
On Feb 23, 2009, at 2:19 PM, Jimmy Wan wrote:
I'm not sure if this is
possible, but it would certainly be nice to either:
1) pass the OutputCollector and Reporter to the close() method.
2) Provide accessors to the OutputCollector and the Reporter.
If you look at the 0.20 branch, which
On Feb 10, 2009, at 5:20 PM, Matei Zaharia wrote:
I'd like to write a combiner that shares a lot of code with a reducer,
except that the reducer updates an external database at the end.
The right way to do this is to either do the update in the output
format or do something like:
class
On Feb 7, 2009, at 11:52 PM, Nick Cen wrote:
Hi,
I hava a hadoop cluster with 4 pc. And I wanna to integrate hadoop and
lucene together, so i copy some of the source code from nutch's
Indexer
class, but when i run my job, i found that there is only 1 reducer
running
on 1 pc, so the
On Feb 6, 2009, at 8:52 AM, Bhupesh Bansal wrote:
Hey Tom,
I got also burned by this ?? Why does BytesWritable.getBytes() returns
non-vaild bytes ?? Or we should add a BytesWritable.getValidBytes()
kind of function.
It does it because continually resizing the array to the valid
length
There is a bug that when we restart the TaskTrackers they get counted twice.
The problem is the name is generated from the hostname and port number. When
TaskTrackers restart they get a new port number and get counted again. The
problem goes away when the old TaskTrackers time out in 10 minutes or
On Jan 22, 2009, at 7:25 AM, Brian MacKay wrote:
Is there a way to set the order of the keys in reduce as shown
below, no
matter what order the collection in MAP occurs in.
The keys to reduce are *always* sorted. If the default order is not
correct, you can change the compare function.
On Jan 17, 2009, at 5:53 PM, Chanwit Kaewkasi wrote:
I would like to integrate Hadoop to my project using Ivy.
Is there any maven repository containing Hadoop jars that I can point
my configuration to?
Not yet, but soon. We recently introduced ivy into Hadoop, so I
believe we'll upload the
On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote:
Jim,
As far as I know, there is no operation done after Reducer.
Correct, other than output promotion, which moves the output file to
the final filename.
But if you are a little experienced, you already know these.
Ordered list means one
On Jan 11, 2009, at 5:50 AM, tienduc_dinh wrote:
- Does Map-Reduce support parallel writing/reading ?
Yes. The maps all read in parallel with each other and the reduces all
write in parallel with each other.
- What happens after the Map-Reduce operation ?
The OutputFormat usually
On Jan 11, 2009, at 7:05 PM, tienduc_dinh wrote:
Is there any article which describes it ?
Please read the map/reduce tutorial:
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
-- Owen
On Jan 8, 2009, at 4:13 AM, Sandhya E wrote:
Does Hadoop 0.18 have an outputcollector that can print out only the
keys and supress the values. In our case we need only the keys, and
hence we do output.collect(key, blank) and then reprocess the entire
output file to remove trailing tabs.
If
On Jan 8, 2009, at 4:13 AM, Sandhya E wrote:
Does Hadoop 0.18 have an outputcollector that can print out only the
keys and supress the values. In our case we need only the keys, and
hence we do output.collect(key, blank) and then reprocess the entire
output file to remove trailing tabs.
If
This is exactly what the setOutputValueGroupingComparator is for. Take
a look at HADOOP-4545, for an example using the secondary sort. If you
are using trunk or 0.20, look at src/examples/org/apache/hadoop/
examples/SecondarySort.java. The checked in example uses the new map/
reduce api that
Yeah, I've seen this too. I just filed HADOOP-4983.
-- Owen
On Dec 16, 2008, at 9:30 AM, David Coe wrote:
Since the SequenceFileOutputFormat doesn't like nulls, how would I use
NullWritable? Obviously output.collect(key, null) isn't working.
If I
change it to output.collect(key, new IntWritable()) I get the result I
want (plus an int that I
On Dec 16, 2008, at 8:28 AM, David Coe wrote:
Is there a way to output the write() instead?
Use SequenceFileOutputFormat. It writes binary files using the write.
The reverse is SequenceFileInputFormat, which reads the sequence files
using readFields.
-- Owen
On Dec 16, 2008, at 12:36 AM, Stefan Groschupf wrote:
It is a neat way of visualizing who is behind the Hadoop source code
and how the project code base grew over the years.
It is interesting, but it would be more interesting to track the
authors of the patch rather than the committer.
On Dec 16, 2008, at 9:14 AM, David Coe wrote:
Does the SequenceFileOutputFormat work with NullWritable as the value?
Yes.
On Dec 15, 2008, at 8:08 AM, Sandhya E wrote:
I have a scenario where, while a map/reduce is working on a file, the
input file may get deleted and copied with a new version of the file.
All my files are compressed and hence each file is worked on by a
single node. I tried simulating the
On Dec 13, 2008, at 6:32 PM, Stuart White wrote:
First question: would Hadoop be an appropriate tool for something
like this?
Very
What is the best way to model this type of work in Hadoop?
As a map-only job with number of reduces = 0.
I'm thinking my mappers will accept a Long key
On Dec 14, 2008, at 4:47 PM, Ricky Ho wrote:
Yes, I am referring to the key INPUT INTO the map() function and
the key EMITTED FROM the reduce() function. Can someone explain
why do we need a key in these cases and what is the proper use of
it ?
It was a design choice and could have
On Dec 9, 2008, at 2:22 AM, Devaraj Das wrote:
I know that the tasktracker/jobtracker doesn't have any command for
re-reading the configuration. There is built-in support for
restart/shut-down but those are via external scripts that internally
do a
kill/start.
Actually, HADOOP-4348 does
On Dec 9, 2008, at 11:35 AM, Songting Chen wrote:
Is there a way for the Map process to know it's the end of records?
I need to flush some additional data at the end of the Map process,
but wondering where I should put that code.
The close() method is called at the end of the map.
--
On Dec 9, 2008, at 7:34 PM, Aaron Kimball wrote:
That's true, but you should be aware that you no longer have an
OutputCollector available in the close() method.
True, but in practice you can keep a handle to it from the map method
and it will work perfectly. This is required for both
On Dec 8, 2008, at 8:02 AM, Christian Kunz wrote:
Comparing hadoop-default.xml of hadoop-0.18 with hadoop-0.17, didn't
map.sort.class change from
org.apache.hadoop.mapred.MergeSorter to
org.apache.hadoop.util.QuickSort?
Yes, but the quick sort is only used in the mapper. The reducer
already
On Dec 4, 2008, at 9:19 PM, abhinit wrote:
I have set some variable using the JobConf object.
jobConf.set(Operator, operator) etc.
How can I get an instance of Configuration object/ JobConf object
inside
a map method so that I can retrieve these variables.
In your Mapper class,
On Nov 28, 2008, at 9:45 AM, Ricky Ho wrote:
[Ricky] What exactly does the job.split contains ? I assume it
contains the specification for each split (but not its data), such
as what is the corresponding file and the byte range within that
file. Correct ?
Yes
[Ricky] I am curious
On Nov 26, 2008, at 4:35 AM, Jürgen Broß wrote:
I'm not sure how to let a Reducer know in its configure() method
which partition it will get from the Partitioner,
From:
http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html#Task+JVM+Reuse
look for mapred.task.partition, which is
On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote:
Hi Dennis,
I don't think that is possible to do.
No, it is not possible.
The block placement is determined
by HDFS internally (which is local, rack local and off rack).
Actually, it was changed in 0.17 or so to be node-local, off-rack,
On Sun, Nov 16, 2008 at 2:18 PM, Saptarshi Guha [EMAIL PROTECTED]wrote:
Hello,
If my understanding is correct, the combiner will read in values for
a given key, process it, output it and then **all** values for a key are
given to the reducer.
Not quite. The flow looks like
On Nov 13, 2008, at 9:38 PM, Sunil Jagadish wrote:
I have a mapper which needs to write output into two different kinds
of
files (output.collect()).
For my purpose, I do not need any reducers.
Set the number of reduces to 0.
Open a sequence file in the mapper and write the second stream to
On Nov 7, 2008, at 12:35 PM, Elia Mazzawi wrote:
I have 2 hadooop map/reduce programs that have the same map, but a
different reduce methods.
can i run them in a way so that the map only happens once?
If the input to the reduces is the same, you can put the two reduces
together and use
On Nov 6, 2008, at 11:29 AM, Ricky Ho wrote:
Disk I/O overhead
==
- The output of a Map task is written to a local disk and then later
on upload to the Reduce task. While this enable a simple recovery
strategy when the map task failed, it incur additional disk I/O
On Nov 6, 2008, at 2:30 PM, Ricky Ho wrote:
Hmmm, sounds like the combiner is invoked after the map() process
completed for the file split.
No. The data path is complex, but the combiner is called when the map
outputs are being spilled to disk. So roughly, the map will output
key, value
On Nov 5, 2008, at 2:21 PM, Tarandeep Singh wrote:
I want to know whether the key,values received by a particular
reducer at a
node are stored locally on that node or are stored on DFS (and hence
replicated over cluster according to replication factor set by user)
Map outputs (and reduce
If you don't need a sort, which is what it sounds like, Hadoop
supports that by turning off the reduce. That is done by setting the
number of reduces to 0. This typically is much faster than if you need
the sort. It also sounds like you may need/want the library that does
map-side joins.
of talks about Hadoop
o Introduction to Hadoop by Owen O'Malley
o A Tour of Apache Hadoop by Tom White
o Programming Hadoop Map/Reduce by Arun Murthy
o Hadoop at Yahoo! by Eric Baldeschwieler
o Hadoop Futures Panel
o Using Hadoop
On Oct 31, 2008, at 3:15 PM, Bhupesh Bansal wrote:
Why do we need these setters in JobConf ??
jobConf.setMapOutputKeyClass(String.class);
jobConf.setMapOutputValueClass(LongWritable.class);
Just historical. The Mapper and Reducer interfaces didn't use to be
generic. (Hadoop used to run
On Oct 30, 2008, at 9:03 AM, Joel Welling wrote:
I'm writing a Hadoop Pipes application, and I need to generate a
bunch
of integers that are unique across all map tasks. If each map task
has
a unique integer ID, I can make sure my integers are unique by
including
that integer ID. I
I uploaded a patch that does a secondary sort. Take a look at:
https://issues.apache.org/jira/browse/HADOOP-4545
It reads input with two numbers per a line. Such as:
-1 -4
-3 23
5 10
-1 -2
-1 300
-1 10
4 1
4 2
4 10
4 -1
4 -10
10 20
10 30
10 25
And produces output like (with 2 reduces):
On Oct 28, 2008, at 6:29 AM, Malcolm Matalka wrote:
I am trying to write an InputFormat and I am having some trouble
understanding how my data is being broken up. My input is a previous
hadoop job and I have added code to my record reader to print out the
FileSplit's start and end position,
On Oct 28, 2008, at 7:53 AM, David M. Coe wrote:
My mapper is MapperLongWritable, Text, IntWritable, IntWritable
and my
reducer is the identity. I configure the program using:
conf.setOutputKeyClass(IntWritable.class);
conf.setOutputValueClass(IntWritable.class);
If your application that you are drawing from is doing some sort of
web crawl that connects to lots of random servers, you may want to use
MultiThreadedMapRunner and do the remote connections in the map. If
you are just connecting to a small set of servers for each map, you
should put it
On Oct 26, 2008, at 8:38 AM, chaitanya krishna wrote:
I forgot to mention that although the number of map tasks are set in
the
code as I mentioned before, the actual number of map tasks are not
essentially the same number but is very close to this number.
The number of reduces is precisely
On Oct 16, 2008, at 1:40 PM, Zhengguo 'Mike' SUN wrote:
I was trying to write an application using the pipes api. But it
seemed the serialization part is not working correctly. More
specifically, I can't deserialize a string from an StringInStream
constructed from context.getInputSplit().
On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
We at Linkedin are trying to run some Large Graph Analysis problems on
Hadoop. The fastest way to run would be to keep a copy of whole
Graph in RAM
at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
cores
machine with 8G
On Oct 16, 2008, at 3:09 PM, Bhupesh Bansal wrote:
Lets say I want to implement a DFS in my graph. I am not able to
picturise
implementing it with doing graph in pieces without putting a depth
bound to
(3-4). Lets say we have 200M (4GB) edges to start with
Start by watching the lecture
On Oct 14, 2008, at 8:37 PM, Bert Schmidt wrote:
I'm trying to think of how I might use it yet all the examples I
find are variations of word count.
Look in the src/examples directory.
PiEstimator - estimates the value of pi using distributed brute force
Pentomino - solves Pentomino
On Oct 3, 2008, at 7:49 AM, Devajyoti Sarkar wrote:
Briefly going through the DistributedCache information, it seems to
be a way
to distribute files to mappers/reducers.
Sure, but it handles the distribution problem for you.
One still needs to read the
contents into each map/reduce task
1 - 100 of 175 matches
Mail list logo