Hi all,
I am currently processing a lot of raw CSV data and producing a
summary text file which I load into mysql. On top of this I have a
PHP application to generate tiles for google mapping (sample tile:
http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
Here is a (dev
On Apr 14, 2009, at 9:11 AM, Jothi Padmanabhan wrote:
2. Framework kills the task because it did not progress enough
That should count as a 'failed' task, not 'killed' - it is a bug if we
are not counting timed-out tasks against the job...
Arun
Hey Tim,
Why don't you put the PNGs in a SequenceFile in the output of your
reduce task? You could then have a post-processing step that unpacks
the PNG and places it onto S3. (If my numbers are correct, you're
looking at around 3TB of data; is this right? With that much, you
might
Thanks Brian,
This is pretty much what I was looking for.
Your calculations are correct but based on the assumption that at all
zoom levels we will need all tiles generated. Given the sparsity of
data, it actually results in only a few 100GBs. I'll run a second MR
job with the map pushing to
Hey,
I am trying complex queries on hadoop and in which i require more than one
job to run to get final result..results of job one captures few joins of the
query and I want to pass those results as input to 2nd job and again do
processing so that I can get final results.queries are such that I
Hi.
Has any succeed running web-server from HDFS?
I mean, to serve websites and application directly from HDFS, perhaps via
FUSE/WebDav?
Regards.
Sorry Brian, can I just ask please...
I have the PNGs in the Sequence file for my sample set. If I use a
second MR job and push to S3 in the map, surely I run into the
scenario where multiple tasks are running on the same section of the
sequence file and thus pushing the same data to S3. Am I
(Hadoop is used in the benchmarks)
http://database.cs.brown.edu/sigmod09/
There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control flow of this framework has existed in parallel SQL
database management systems
Oh I agree caching, is wonderful when you plan to re-use the data in the
near term.
Solaris has an interesting feature, if the application writes enough
contiguous data, in a short time window, (tunable in later nevada builds),
solaris bypasses the buffer cache for the writes.
For reasons I have
Thanks for sharing this - I find these comparisons really interesting.
I have a small comment after skimming this very quickly.
[Please accept my apologies for commenting on such a trivial thing,
but personal experience has shown this really influences performance]
One thing not touched on in
I thought it a conspicuous omission to not discuss the cost of
various approaches. Hadoop is free, though you have to spend
developer time; how much does Vertica cost on 100 nodes?
-Bryan
On Apr 14, 2009, at 7:16 AM, Guilherme Germoglio wrote:
(Hadoop is used in the benchmarks)
I ve drawn a blank here! Can't figure out what s wrong with the ports. I can
ssh between the nodes but cant access the DFS from the slaves - says Bad
connection to DFS. Master seems to be fine.
Mithila
On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra mnage...@asu.edu wrote:
Yes I can..
On
Is there a way for all the reducers to have access to the total number of
records that were processed in the Map phase?
For example, I'm trying to perform a simple document frequency calculation.
During the map phase, I emit word, 1 pairs for every unique word in every
document. During the
Hello,
Suppose I have a Hadoop job and have set my combiner to the Reducer class.
Does the map function and the combiner function run in the same JVM in
different threads? or in different JVMs?
I ask because I have to load a native library and if they are in the same
JVM then the native library is
Hey Guilherme,
It's good to see comparisons, especially as it helps folks understand
better what tool is the best for their problem. As you show in your
paper, a MapReduce system is hideously bad in performing tasks that
column-store databases were designed for (selecting a single value
Hi Brian,
I'm sorry but it is not my paper. :-) I've posted the link here because
we're always looking for comparison data -- so, I thought this benchmark
would be welcome.
Also, I won't attend the conference. However, it would be a good idea to
someone who will to ask directly to the authors
Hi.
2009/4/14 Michael Bieniosek micb...@microsoft.com
webdav server - https://issues.apache.org/jira/browse/HADOOP-496
There's a fuse issue somewhere too, but I never managed to get it working.
As far as serving websites directly from HDFS goes, I would say you'd
probably have better luck
They're in the same JVM, and I believe in the same thread.
- Aaron
On Tue, Apr 14, 2009 at 10:25 AM, Saptarshi Guha
saptarshi.g...@gmail.comwrote:
Hello,
Suppose I have a Hadoop job and have set my combiner to the Reducer class.
Does the map function and the combiner function run in the same
Are there any error messages in the log files on those nodes?
- Aaron
On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra mnage...@asu.edu wrote:
I ve drawn a blank here! Can't figure out what s wrong with the ports. I
can
ssh between the nodes but cant access the DFS from the slaves - says Bad
On Apr 14, 2009, at 12:47 PM, Guilherme Germoglio wrote:
Hi Brian,
I'm sorry but it is not my paper. :-) I've posted the link here
because
we're always looking for comparison data -- so, I thought this
benchmark
would be welcome.
Ah, sorry, I guess I was being dense when looking at
Thanks. I am using 0.19, and to confirm, the map and combiner (in the map
jvm) are run in *different* threads at the same time?
My native library is not thread safe, so I would have to implement locks.
Aaron's email gave me hope(since the map and combiner would then be running
sequentially), but
Hi,
We have released 1.3 version of CloudBase on sourceforge-
http://cloudbase.sourceforge.net/
[ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce
architecture. It uses ANSI SQL as its query language and comes with a JDBC
driver. It is developed by Business.com and is
Hello,
I got another solution for this. I just pasted all the required jar files in
lib folder of each hadoop node. In this way the job jar is not too big and
will require less time to distribute in the cluster.
Thanks,
Farhan
On Mon, Apr 13, 2009 at 7:22 PM, Nick Cen cenyo...@gmail.com wrote:
On Apr 14, 2009, at 11:10 AM, Saptarshi Guha wrote:
Thanks. I am using 0.19, and to confirm, the map and combiner (in
the map jvm) are run in *different* threads at the same time?
And the change was actually made in 0.18. So since then, the combiner
is called 0, 1, or many times on each
Aaron: Which log file do I look into - there are alot of them. Here s what
the error looks like:
[mith...@node19:~]$ cd hadoop
[mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 0 time(s).
09/04/14
Also, Would the way the port is accessed change if all these node are
connected through a gateway? I mean in the hadoop-site.xml file? The Ubuntu
systems we worked with earlier didnt have a gateway.
Mithila
On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra mnage...@asu.edu wrote:
Aaron: Which
Hi Andy,
Take a look at this piece of code:
Counters counters = job.getCounters();
counters.findCounter(org.apache.hadoop.mapred.Task$Counter,
REDUCE_INPUT_RECORDS).getCounter()
This is for reduce input records but I believe there is also a counter for
reduce output records. You should dig into
REMINDER
The DC area Hadoop User Group is meeting tomorrow. Full details at:
http://www.meetup.com/Hadoop-DC/calendar/10073493/
Christophe Bisciglia and Dr. Jimmy Lin will be speaking.
Cloudera's Founder, Christophe Bisciglia, will give a talk about
simplifying Hadoop configuration,
Hello everyone;
I want to write a distributed agent program. But i can't understand one thing
that what's difference between client-server program and agent program? Pls
help me...
Burak ISIKLI
Dumlupinar University
Electric
Everything gets dumped into the same files.
We don't assume anything at all about the format of the input data; it
gets dumped into Hadoop sequence files, tagged with some metadata to
say what machine and app it came from, and where it was in the
original stream.
There is a slight penalty from
They are comparing an indexed system with one that isn't. Why is
Hadoop faster at loading than the others? Surely no one would be
surprised that it would be slower - I'm surprised at how well Hadoop
does. Who want to write a paper for next year, grep vs reverse
index?
2009/4/15 Guilherme
Hey guys,
I'm trying my hand at outputting into a MySQL table but I'm running into
a Communications link failure during the reduce (in the
getRecordReader() method of DBOutputFormat to be more specific). Google
tells me this seems to happen when a SQL server drops the client
(usually after a
Ah, I didn't realize you were using HBase. It could definitely be the case
that HBase is explicitly setting file replication to 1 for certain files.
Unfortunately I don't know enough about HBase to know if or why certain
files are set to have no replication. This might be a good question for the
First, did you bounce the Hadoop daemons after you changed the configuration
files? I think you'll have to do this.
Second, I believe 0.19.1 has hadoop-default.xml baked into the jar. Try
setting $HADOOP_CONF_DIR to the directory where hadoop-site.xml lives. For
whatever reason your
I think there is one important comparison missing in the paper- cost. The
paper does mention that Hadoop comes for free in the conclusion, but didn't
give any details of how much it would cost to get license for Vertica or
DBMS X to run on 100 nodes.
Further, with data warehouse products like
Responding to this for archiving purposes...
After being stuck for a couple hours I then realized that localhost meant a
different machine as it ran on different reducers :). Thus, replacing that
with the IP address did the trick.
-Original Message-
From: Streckfus, William [USA]
Aseem,
Regd over-replication, it is mostly app related issue as Alex mentioned.
But if you are concerned about under-replicated blocks in fsck output :
These blocks should not stay under-replicated if you have enough nodes
and enough space on them (check NameNode webui).
Try grep-ing for
I have actually built an add on class on top of ClusterMapReduceDelegate
that just runs a virtual cluster that persists for running tests on, it is
very nice, as you can interact via the web ui.
Especially since the virtual cluster stuff is somewhat flaky under windows.
I have a question in to
Hello Everyone,
At time I get following error,when i restart my cluster desktops.(Before
that I shutdown mapred and dfs properly though).
Temp folder contains of the directory its looking for.Still I get this
error.
Only solution I found to get rid with this error is I have to format my dfs
Hi,
I tried to use Hadoop Pipes for C++. The program was copied from the Hadoop
Wiki: http://wiki.apache.org/hadoop/C++WordCount.
But I am confused about the file name and the path. Should the program be
named as examples? And where should I put this code
for compiling by ant? I used the
btw that stack trace looks like the hadoop.log.dir issue
This is the code out of the init method, in JobHistory
LOG_DIR = conf.get(hadoop.job.history.location ,
file:/// + new File(
System.getProperty(hadoop.log.dir)).getAbsolutePath()
+ File.separator + history);
looks
41 matches
Mail list logo