Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
Hi all, I am currently processing a lot of raw CSV data and producing a summary text file which I load into mysql. On top of this I have a PHP application to generate tiles for google mapping (sample tile: http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800). Here is a (dev

Re: Reduce task attempt retry strategy

2009-04-14 Thread Arun C Murthy
On Apr 14, 2009, at 9:11 AM, Jothi Padmanabhan wrote: 2. Framework kills the task because it did not progress enough That should count as a 'failed' task, not 'killed' - it is a bug if we are not counting timed-out tasks against the job... Arun

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread Brian Bockelman
Hey Tim, Why don't you put the PNGs in a SequenceFile in the output of your reduce task? You could then have a post-processing step that unpacks the PNG and places it onto S3. (If my numbers are correct, you're looking at around 3TB of data; is this right? With that much, you might

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
Thanks Brian, This is pretty much what I was looking for. Your calculations are correct but based on the assumption that at all zoom levels we will need all tiles generated. Given the sparsity of data, it actually results in only a few 100GBs. I'll run a second MR job with the map pushing to

Re: Modeling WordCount in a different way

2009-04-14 Thread Pankil Doshi
Hey, I am trying complex queries on hadoop and in which i require more than one job to run to get final result..results of job one captures few joins of the query and I want to pass those results as input to 2nd job and again do processing so that I can get final results.queries are such that I

HDFS and web server

2009-04-14 Thread Stas Oskin
Hi. Has any succeed running web-server from HDFS? I mean, to serve websites and application directly from HDFS, perhaps via FUSE/WebDav? Regards.

Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
Sorry Brian, can I just ask please... I have the PNGs in the Sequence file for my sample set. If I use a second MR job and push to S3 in the map, surely I run into the scenario where multiple tasks are running on the same section of the sequence file and thus pushing the same data to S3. Am I

fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Guilherme Germoglio
(Hadoop is used in the benchmarks) http://database.cs.brown.edu/sigmod09/ There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems

Re: Interesting Hadoop/FUSE-DFS access patterns

2009-04-14 Thread jason hadoop
Oh I agree caching, is wonderful when you plan to re-use the data in the near term. Solaris has an interesting feature, if the application writes enough contiguous data, in a short time window, (tunable in later nevada builds), solaris bypasses the buffer cache for the writes. For reasons I have

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread tim robertson
Thanks for sharing this - I find these comparisons really interesting. I have a small comment after skimming this very quickly. [Please accept my apologies for commenting on such a trivial thing, but personal experience has shown this really influences performance] One thing not touched on in

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Bryan Duxbury
I thought it a conspicuous omission to not discuss the cost of various approaches. Hadoop is free, though you have to spend developer time; how much does Vertica cost on 100 nodes? -Bryan On Apr 14, 2009, at 7:16 AM, Guilherme Germoglio wrote: (Hadoop is used in the benchmarks)

Re: Map-Reduce Slow Down

2009-04-14 Thread Mithila Nagendra
I ve drawn a blank here! Can't figure out what s wrong with the ports. I can ssh between the nodes but cant access the DFS from the slaves - says Bad connection to DFS. Master seems to be fine. Mithila On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra mnage...@asu.edu wrote: Yes I can.. On

Total number of records processed in mapper

2009-04-14 Thread Andy Liu
Is there a way for all the reducers to have access to the total number of records that were processed in the Map phase? For example, I'm trying to perform a simple document frequency calculation. During the map phase, I emit word, 1 pairs for every unique word in every document. During the

Is combiner and map in same JVM?

2009-04-14 Thread Saptarshi Guha
Hello, Suppose I have a Hadoop job and have set my combiner to the Reducer class. Does the map function and the combiner function run in the same JVM in different threads? or in different JVMs? I ask because I have to load a native library and if they are in the same JVM then the native library is

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Brian Bockelman
Hey Guilherme, It's good to see comparisons, especially as it helps folks understand better what tool is the best for their problem. As you show in your paper, a MapReduce system is hideously bad in performing tasks that column-store databases were designed for (selecting a single value

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Guilherme Germoglio
Hi Brian, I'm sorry but it is not my paper. :-) I've posted the link here because we're always looking for comparison data -- so, I thought this benchmark would be welcome. Also, I won't attend the conference. However, it would be a good idea to someone who will to ask directly to the authors

Re: HDFS and web server

2009-04-14 Thread Stas Oskin
Hi. 2009/4/14 Michael Bieniosek micb...@microsoft.com webdav server - https://issues.apache.org/jira/browse/HADOOP-496 There's a fuse issue somewhere too, but I never managed to get it working. As far as serving websites directly from HDFS goes, I would say you'd probably have better luck

Re: Is combiner and map in same JVM?

2009-04-14 Thread Aaron Kimball
They're in the same JVM, and I believe in the same thread. - Aaron On Tue, Apr 14, 2009 at 10:25 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, Suppose I have a Hadoop job and have set my combiner to the Reducer class. Does the map function and the combiner function run in the same

Re: Map-Reduce Slow Down

2009-04-14 Thread Aaron Kimball
Are there any error messages in the log files on those nodes? - Aaron On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra mnage...@asu.edu wrote: I ve drawn a blank here! Can't figure out what s wrong with the ports. I can ssh between the nodes but cant access the DFS from the slaves - says Bad

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Brian Bockelman
On Apr 14, 2009, at 12:47 PM, Guilherme Germoglio wrote: Hi Brian, I'm sorry but it is not my paper. :-) I've posted the link here because we're always looking for comparison data -- so, I thought this benchmark would be welcome. Ah, sorry, I guess I was being dense when looking at

Re: Is combiner and map in same JVM?

2009-04-14 Thread Saptarshi Guha
Thanks. I am using 0.19, and to confirm, the map and combiner (in the map jvm) are run in *different* threads at the same time? My native library is not thread safe, so I would have to implement locks. Aaron's email gave me hope(since the map and combiner would then be running sequentially), but

Announcing CloudBase-1.3 release

2009-04-14 Thread Tarandeep Singh
Hi, We have released 1.3 version of CloudBase on sourceforge- http://cloudbase.sourceforge.net/ [ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce architecture. It uses ANSI SQL as its query language and comes with a JDBC driver. It is developed by Business.com and is

Re: Using 3rd party Api in Map class

2009-04-14 Thread Farhan Husain
Hello, I got another solution for this. I just pasted all the required jar files in lib folder of each hadoop node. In this way the job jar is not too big and will require less time to distribute in the cluster. Thanks, Farhan On Mon, Apr 13, 2009 at 7:22 PM, Nick Cen cenyo...@gmail.com wrote:

Re: Is combiner and map in same JVM?

2009-04-14 Thread Owen O'Malley
On Apr 14, 2009, at 11:10 AM, Saptarshi Guha wrote: Thanks. I am using 0.19, and to confirm, the map and combiner (in the map jvm) are run in *different* threads at the same time? And the change was actually made in 0.18. So since then, the combiner is called 0, 1, or many times on each

Re: Map-Reduce Slow Down

2009-04-14 Thread Mithila Nagendra
Aaron: Which log file do I look into - there are alot of them. Here s what the error looks like: [mith...@node19:~]$ cd hadoop [mith...@node19:~/hadoop]$ bin/hadoop dfs -ls 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/ 192.168.0.18:54310. Already tried 0 time(s). 09/04/14

Re: Map-Reduce Slow Down

2009-04-14 Thread Mithila Nagendra
Also, Would the way the port is accessed change if all these node are connected through a gateway? I mean in the hadoop-site.xml file? The Ubuntu systems we worked with earlier didnt have a gateway. Mithila On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra mnage...@asu.edu wrote: Aaron: Which

Re: Total number of records processed in mapper

2009-04-14 Thread Jim Twensky
Hi Andy, Take a look at this piece of code: Counters counters = job.getCounters(); counters.findCounter(org.apache.hadoop.mapred.Task$Counter, REDUCE_INPUT_RECORDS).getCounter() This is for reduce input records but I believe there is also a counter for reduce output records. You should dig into

Hadoop User Group - DC meeting tomorrow

2009-04-14 Thread Sullivan, Joshua [USA]
REMINDER The DC area Hadoop User Group is meeting tomorrow. Full details at: http://www.meetup.com/Hadoop-DC/calendar/10073493/ Christophe Bisciglia and Dr. Jimmy Lin will be speaking. Cloudera's Founder, Christophe Bisciglia, will give a talk about simplifying Hadoop configuration,

Distributed Agent

2009-04-14 Thread Burak ISIKLI
Hello everyone; I want to write a distributed agent program. But i can't understand one thing that what's difference between client-server program and agent program? Pls help me... Burak ISIKLI Dumlupinar University Electric

Re: HDFS as a logfile ??

2009-04-14 Thread Ariel Rabkin
Everything gets dumped into the same files. We don't assume anything at all about the format of the input data; it gets dumped into Hadoop sequence files, tagged with some metadata to say what machine and app it came from, and where it was in the original stream. There is a slight penalty from

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Andrew Newman
They are comparing an indexed system with one that isn't. Why is Hadoop faster at loading than the others? Surely no one would be surprised that it would be slower - I'm surprised at how well Hadoop does. Who want to write a paper for next year, grep vs reverse index? 2009/4/15 Guilherme

DBOutputFormat - Communications link failure

2009-04-14 Thread Streckfus, William [USA]
Hey guys, I'm trying my hand at outputting into a MySQL table but I'm running into a Communications link failure during the reduce (in the getRecordReader() method of DBOutputFormat to be more specific). Google tells me this seems to happen when a SQL server drops the client (usually after a

Re: More Replication on dfs

2009-04-14 Thread Alex Loddengaard
Ah, I didn't realize you were using HBase. It could definitely be the case that HBase is explicitly setting file replication to 1 for certain files. Unfortunately I don't know enough about HBase to know if or why certain files are set to have no replication. This might be a good question for the

Re: getting DiskErrorException during map

2009-04-14 Thread Alex Loddengaard
First, did you bounce the Hadoop daemons after you changed the configuration files? I think you'll have to do this. Second, I believe 0.19.1 has hadoop-default.xml baked into the jar. Try setting $HADOOP_CONF_DIR to the directory where hadoop-site.xml lives. For whatever reason your

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Tarandeep Singh
I think there is one important comparison missing in the paper- cost. The paper does mention that Hadoop comes for free in the conclusion, but didn't give any details of how much it would cost to get license for Vertica or DBMS X to run on 100 nodes. Further, with data warehouse products like

RE: DBOutputFormat - Communications link failure

2009-04-14 Thread Streckfus, William [USA]
Responding to this for archiving purposes... After being stuck for a couple hours I then realized that localhost meant a different machine as it ran on different reducers :). Thus, replacing that with the IP address did the trick. -Original Message- From: Streckfus, William [USA]

Re: More Replication on dfs

2009-04-14 Thread Raghu Angadi
Aseem, Regd over-replication, it is mostly app related issue as Alex mentioned. But if you are concerned about under-replicated blocks in fsck output : These blocks should not stay under-replicated if you have enough nodes and enough space on them (check NameNode webui). Try grep-ing for

Re: Extending ClusterMapReduceTestCase

2009-04-14 Thread jason hadoop
I have actually built an add on class on top of ClusterMapReduceDelegate that just runs a virtual cluster that persists for running tests on, it is very nice, as you can interact via the web ui. Especially since the virtual cluster stuff is somewhat flaky under windows. I have a question in to

Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist

2009-04-14 Thread Pankil Doshi
Hello Everyone, At time I get following error,when i restart my cluster desktops.(Before that I shutdown mapred and dfs properly though). Temp folder contains of the directory its looking for.Still I get this error. Only solution I found to get rid with this error is I have to format my dfs

hadoop pipes problem

2009-04-14 Thread stchu
Hi, I tried to use Hadoop Pipes for C++. The program was copied from the Hadoop Wiki: http://wiki.apache.org/hadoop/C++WordCount. But I am confused about the file name and the path. Should the program be named as examples? And where should I put this code for compiling by ant? I used the

Re: Extending ClusterMapReduceTestCase

2009-04-14 Thread jason hadoop
btw that stack trace looks like the hadoop.log.dir issue This is the code out of the init method, in JobHistory LOG_DIR = conf.get(hadoop.job.history.location , file:/// + new File( System.getProperty(hadoop.log.dir)).getAbsolutePath() + File.separator + history); looks