Re: NN Memory Jumps every 1 1/2 hours

2012-12-27 Thread Edward Capriolo
would like to hear from you about if it continued to grow. One instance of this I had seen in the past was related to weak reference related to socket objects. I do not see that happening here though. Sent from phone On Dec 23, 2012, at 10:34 AM, Edward Capriolo edlinuxg...@gmail.com wrote

Re: NN Memory Jumps every 1 1/2 hours

2012-12-27 Thread Edward Capriolo
releases (= 0.20.204), several memory optimization and startup optimizations have been done. It should help you as well. On Thu, Dec 27, 2012 at 1:48 PM, Edward Capriolo edlinuxg...@gmail.com wrote: So it turns out the issue was just the size of the filesystem. 2012-12-27 16:37:22,390 WARN

Re: NN Memory Jumps every 1 1/2 hours

2012-12-27 Thread Edward Capriolo
at 2:22 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I am not sure GC had a factor. Even when I forced a GC it cleared 0% memory. One would think that since the entire NameNode image is stored in memory that the heap would not need to grow beyond that, but that sure does not seem

Re: NN Memory Jumps every 1 1/2 hours

2012-12-23 Thread Edward Capriolo
Tried this.. NameNode is still Ruining my Xmas on its slow death march to OOM. http://imagebin.org/240453 On Sat, Dec 22, 2012 at 10:23 PM, Suresh Srinivas sur...@hortonworks.comwrote: -XX:NewSize=1G -XX:MaxNewSize=1G

Re: NN Memory Jumps every 1 1/2 hours

2012-12-22 Thread Edward Capriolo
this where GC kept falling behind and we either ran out of heap or would be in full gc. By reducing heap, we were forcing concurrent mark sweep to occur and avoided both full GC and running out of heap space as the JVM would collect objects more frequently. On Dec 21, 2012, at 8:24 PM, Edward

Re: NN Memory Jumps every 1 1/2 hours

2012-12-22 Thread Edward Capriolo
said that... outside of MapR, have any of the distros certified themselves on 1.7 yet? On Dec 22, 2012, at 6:54 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I will give this a go. I have actually went in JMX and manually triggered GC no memory is returned. So I assumed something

Re: NN Memory Jumps every 1 1/2 hours

2012-12-22 Thread Edward Capriolo
also be the reason. Do you collect gc logs? Send that as well. Sent from a mobile device On Dec 22, 2012, at 9:51 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Newer 1.6 are getting close to 1.7 so I am not going to fear a number and fight the future. I have been aat around 27

Re: NN Memory Jumps every 1 1/2 hours

2012-12-22 Thread Edward Capriolo
but not all and then the line keeps rising. Delta is about 10-17 hours until the heap is exhaused. On Sat, Dec 22, 2012 at 7:03 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Blocks is ~26,000,000 Files is a bit higher ~27,000,000 Currently running: [root@hnn217 ~]# java -version java version

NN Memory Jumps every 1 1/2 hours

2012-12-21 Thread Edward Capriolo
I have an old hadoop 0.20.2 cluster. Have not had any issues for a while. (which is why I never bothered an upgrade) Suddenly it OOMed last week. Now the OOMs happen periodically. We have a fairly large NameNode heap Xmx 17GB. It is a fairly large FS about 27,000,000 files. So the strangest

Re: Regarding DataJoin contrib jar for 1.0.3

2012-07-25 Thread Edward Capriolo
DataJoin is an example. Most people doing joins use Hive or Pig rather then code them up themselves. On Tue, Jul 24, 2012 at 5:19 PM, Abhinav M Kulkarni abhinavkulka...@gmail.com wrote: Hi, Do we not have any info on this? Join must be such a common scenario for most of the people out on

Re: hadoop FileSystem.close()

2012-07-24 Thread Edward Capriolo
In all my experience you let FileSystem instances close themselves. On Tue, Jul 24, 2012 at 10:34 AM, Koert Kuipers ko...@tresata.com wrote: Since FileSystem is a Closeable i would expect code using it to be like this: FileSystem fs = path.getFileSystem(conf); try { // do something with

Re: Group mismatches?

2012-07-16 Thread Edward Capriolo
In all places I have found it only to be the primary group, not all the users supplemental groups. On Mon, Jul 16, 2012 at 3:05 PM, Clay B. c...@clayb.net wrote: Hi all, I have a Hadoop cluster which uses Samba to map an Active Directory domain to my CentOS 5.7 Hadoop cluster. However, I

Re: stuck in safe mode after restarting dfs after found dead node

2012-07-14 Thread Edward Capriolo
me know what sort of details I can provide to help resolve this issue. Best, Juan On Fri, Jul 13, 2012 at 4:10 PM, Edward Capriolo edlinuxg...@gmail.comwrote: If the datanode is not coming back you have to explicitly tell hadoop to leave safemode. http://hadoop.apache.org/common/docs

Re: stuck in safe mode after restarting dfs after found dead node

2012-07-13 Thread Edward Capriolo
If the datanode is not coming back you have to explicitly tell hadoop to leave safemode. http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Safemode hadoop dfsadmin -safemode leave On Fri, Jul 13, 2012 at 9:35 AM, Juan Pino juancitomiguel...@gmail.com wrote: Hi, I can't get

Re: Setting number of mappers according to number of TextInput lines

2012-06-16 Thread Edward Capriolo
No. The number of lines is not known at planning time. All you know is the size of the blocks. You want to look at mapred.max.split.size . On Sat, Jun 16, 2012 at 5:31 AM, Ondřej Klimpera klimp...@fit.cvut.cz wrote: I tried this approach, but the job is not distributed among 10 mapper nodes.

Re: Ideal file size

2012-06-06 Thread Edward Capriolo
It does not matter what the file size is because the file size is split into blocks which is what the NN tracks. For larger deployments you can go with a large block size like 256MB or even 512MB. Generally the bigger the file the better split calculation is very input format dependent however.

Re: Hadoop on physical Machines compared to Amazon Ec2 / virtual machines

2012-05-31 Thread Edward Capriolo
We actually were in an Amazon/host it yourself debate with someone. Which prompted us to do some calculations: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/myth_busters_ops_editition_is We calculated the cost for storage alone of 300 TB on ec2 as 585K a month! The cloud people hate

Re: Hadoop with Sharded MySql

2012-05-31 Thread Edward Capriolo
Maybe you can do some VIEWs or unions or merge tables on the mysql side to overcome the aspect of launching so many sqoop jobs. On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani hivehadooplearn...@gmail.com wrote: All, We are trying to implement sqoop in our environment which has 30 mysql

Re: Splunk + Hadoop

2012-05-22 Thread Edward Capriolo
So a while back their was an article: http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data I recently did my own take on full text searching your logs with solandra, though I have prototyped using solr inside datastax enterprise as well.

Re: Problems with block compression using native codecs (Snappy, LZO) and MapFile.Reader.get()

2012-05-22 Thread Edward Capriolo
if You are getting a SIGSEG it never hurts to try a more recent JVM. 21 has many bug fixes at this point. On Tue, May 22, 2012 at 11:45 AM, Jason B urg...@gmail.com wrote: JIRA entry created: https://issues.apache.org/jira/browse/HADOOP-8423 On 5/21/12, Jason B urg...@gmail.com wrote:

Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Edward Capriolo
Honestly that is a hassle, going from 205 to cdh3u3 is probably more or a cross-grade then an upgrade or downgrade. I would just stick it out. But yes like Michael said two clusters on the same gear and distcp. If you are using RF=3 you could also lower your replication to rf=2 'hadoop dfs

Re: Feedback on real world production experience with Flume

2012-04-22 Thread Edward Capriolo
, Alexander Lorenz wget.n...@googlemail.com wrote: no. That is the Flume Open Source Mailinglist. Not a vendor list. NFS logging has nothing to do with decentralized collectors like Flume, JMS or Scribe. sent via my mobile device On Apr 22, 2012, at 12:23 AM, Edward Capriolo edlinuxg

Re: hadoop.tmp.dir with multiple disks

2012-04-22 Thread Edward Capriolo
Since each hadoop tasks is isolated from others having more tmp directories allows you to isolate that disk bandwidth as well. By listing the disks you give more firepower to shuffle-sorting and merging processes. Edward On Sun, Apr 22, 2012 at 10:02 AM, Jay Vyas jayunit...@gmail.com wrote: I

Re: Feedback on real world production experience with Flume

2012-04-21 Thread Edward Capriolo
It seems pretty relevant. If you can directly log via NFS that is a viable alternative. On Sat, Apr 21, 2012 at 11:42 AM, alo alt wget.n...@googlemail.com wrote: We decided NO product and vendor advertising on apache mailing lists! I do not understand why you'll put that closed source stuff

Re: Multiple data centre in Hadoop

2012-04-19 Thread Edward Capriolo
Hive is beginning to implement Region support where one metastore will manage multiple filesystems and jobtrackers. When a query creates a table it will then be copied to one ore more datacenters. In addition the query planner will intelligently attempt to run queries in regions only where all the

Re: Hive Thrift help

2012-04-16 Thread Edward Capriolo
You can NOT connect to hive thrift to confirm it's status. Thrift is thrift not http. But you are right to say HiveServer does not produce and output by default. if netstat -nl | grep 1 shows status it is up. On Mon, Apr 16, 2012 at 5:18 PM, Rahul Jain rja...@gmail.com wrote: I am

Re: Issue with loading the Snappy Codec

2012-04-15 Thread Edward Capriolo
You need three things. 1 install snappy to a place the system can pick it out automatically or add it to your java.library.path Then add the full name of the codec to io.compression.codecs. hive set io.compression.codecs;

Re: Accessing HDFS files from an servlet

2012-04-13 Thread Edward Capriolo
http://www.edwardcapriolo.com/wiki/en/Tomcat_Hadoop Have all the hadoop jars and conf files in your classpath --or-- construct your own conf and URI programatically URI i = URI.create(hdfs://192.168.220.200:54310); FileSystem fs = FileSystem.get(i,conf); On Fri, Apr 13, 2012 at 7:40 AM, Jessica

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Edward Capriolo
Nathan but together the steps together on this blog. http://blog.milford.io/2012/01/kicking-the-tires-on-hadoop-0-23-pseudo-distributed-mode/ Which fills out the missing details such as property nameyarn.nodemanager.local-dirs/name value/value descriptionthe local directories used

Re: activity on IRC .

2012-03-29 Thread Edward Capriolo
You are better off on the ML. Hadoop is not designed for high throughput not low latency operations. This carries over to the IRC room :) JK I feel most hadoop questions are harder to ask and answer on IRC (large code segments, deep questions) and as a result the mailing list is more natural for

Re: state of HOD

2012-03-09 Thread Edward Capriolo
It has been in a quasi-defunct state for a while now. It seems like hadoop.next and yarn, helps archive a similar effect of hod. Plus it has this new hotness factor. On Fri, Mar 9, 2012 at 2:41 AM, Stijn De Weirdt stijn.dewei...@ugent.be wrote: (my apologies for those who have received this

Re: Should splittable Gzip be a core hadoop feature?

2012-02-29 Thread Edward Capriolo
Mike, Snappy is cool and all, but I was not overly impressed with it. GZ zipps much better then Snappy. Last time I checked for our log file gzip took them down from 100MB- 40MB, while snappy compressed them from 100MB-55MB. That was only with sequence files. But still that is pretty significant

Re: LZO with sequenceFile

2012-02-26 Thread Edward Capriolo
On Sun, Feb 26, 2012 at 1:49 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks! Some questions I have is: 1. Would it work with sequence files? I am using SequenceFileAsTextInputStream Yes, you just need

Re: Writing small files to one big file in hdfs

2012-02-21 Thread Edward Capriolo
On Tue, Feb 21, 2012 at 7:50 PM, Mohit Anchlia mohitanch...@gmail.com wrote: It looks like in mapper values are coming as binary instead of Text. Is this expected from sequence file? I initially wrote SequenceFile with Text values. On Tue, Feb 21, 2012 at 4:13 PM, Mohit Anchlia

Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)

2012-02-17 Thread Edward Capriolo
I would almost agree with prospective. But their is a problem with 'java is slow' theory. The reason is that in a 100 percent write workload gc might be a factor. But in the real world people have to read data and read becomes disk bound as your data gets larger then memory. Unless C++ can make

Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)

2012-02-17 Thread Edward Capriolo
for HBase to load 1/2 trillion cells.  That makes HBase 10X more expensive in terms of hardware, power consumption, and data center real estate. - Doug On Fri, Feb 17, 2012 at 3:58 PM, Edward Capriolo edlinuxg...@gmail.comwrote: I would almost agree with prospective. But their is a problem

Re: Addendum to Hypertable vs. HBase Performance Test (w/ mslab enabled)

2012-02-16 Thread Edward Capriolo
You ain't gotta like me, you just mad Cause I tell it how it is, and you tell it how it might be -Attributed to Puff Daddy Now apparently T. Lipcon On Mon, Feb 13, 2012 at 2:33 PM, Todd Lipcon t...@cloudera.com wrote: Hey Doug, Want to also run a comparison test with inter-cluster

Re: Brisk vs Cloudera Distribution

2012-02-08 Thread Edward Capriolo
Hadoop can work on a number of filessytems hdfs , s3. Local files. Brisk file system is known as cfs. Cfs stores all block and meta data in cassandra. Thus it does not use a name node. Brisk fires up a jobtracker automatically as well. Brisk also has a hivemeta store backed by cassandra so takes

Re: Checking Which Filesystem Being Used?

2012-02-07 Thread Edward Capriolo
On Tue, Feb 7, 2012 at 5:24 PM, Eli Finkelshteyn iefin...@gmail.com wrote: Hi Folks, This might be a stupid question, but I'm new to Java and Hadoop, so... Anyway, if I want to check what FileSystem is currently being used at some point (i.e. evaluating FileSystem.get(conf)), what would be

Re: jobtracker url(Critical)

2012-01-27 Thread Edward Capriolo
Task tracker sometimes so not clean up their mapred temp directories well if that is the case the tt on startup can spent many minutes deleting files. I use find to delete files older then a couple of days. On Friday, January 27, 2012, hadoop hive hadooph...@gmail.com wrote: Hey Harsh, but

Re: NameNode per-block memory usage?

2012-01-17 Thread Edward Capriolo
On Tue, Jan 17, 2012 at 10:08 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello, How much memory/JVM heap does NameNode use for each block? I've tried locating this in the FAQ and on search-hadoop.com, but couldn't find a ton of concrete numbers, just these two:

Re: hadoop filesystem cache

2012-01-16 Thread Edward Capriolo
The challenges of this design is people accessing the same data over and over again is the uncommon usecase for hadoop. Hadoop's bread and butter is all about streaming through large datasets that do not fit in memory. Also your shuffle-sort-spill is going to play havoc on and file system based

Re: desperate question about NameNode startup sequence

2011-12-17 Thread Edward Capriolo
The problem with checkpoint /2nn is that it happily runs and has no outward indication that it is unable to connect. Because you have a large edits file you startup will complete, however with that size it could take hours. It logs nothing while this is going on but as long as the CPU is working

Re: Analysing Completed Job info programmatically apart from Jobtracker GUI

2011-12-14 Thread Edward Capriolo
I would check out hitune. I have a github project that connects to the JobTracker and stores counters, job times and other stats into Cassandra. https://github.com/edwardcapriolo/hadoop_cluster_profiler Worth checking out as discovering how to connect and mine information from the JobTracker was

Re: Matrix multiplication in Hadoop

2011-11-19 Thread Edward Capriolo
Sounds like a job for next gen map reduce native libraries and gpu's. A modern day Dr frankenstein for sure. On Saturday, November 19, 2011, Tim Broberg tim.brob...@exar.com wrote: Perhaps this is a good candidate for a native library, then? From: Mike

Re: Matrix multiplication in Hadoop

2011-11-18 Thread Edward Capriolo
A problem with matrix multiplication in hadoop is that hadoop is row oriented for the most part. I have thought about this use case however and you can theoretically turn a 2D matrix into a 1D matrix and then that fits into the row oriented nature of hadoop. Also being that the typical mapper can

Re: pointing mapred.local.dir to a ramdisk

2011-10-03 Thread Edward Capriolo
This directory can get very large, in many cases I doubt it would fit on a ram disk. Also RAM Disks tend to help most with random read/write, since hadoop is doing mostly linear IO you may not see a great benefit from the RAM disk. On Mon, Oct 3, 2011 at 12:07 PM, Vinod Kumar Vavilapalli

Re: linux containers with Hadoop

2011-09-30 Thread Edward Capriolo
On Fri, Sep 30, 2011 at 9:03 AM, bikash sharma sharmabiks...@gmail.comwrote: Hi, Does anyone knows if Linux containers (which are like kernel supported virtualization technique for providing resource isolation across process/appication) have ever been used with Hadoop to provide resource

Re: formatting hdfs without user interaction

2011-09-23 Thread Edward Capriolo
On Fri, Sep 23, 2011 at 11:52 AM, ivan.nov...@emc.com wrote: Hi Harsh, On 9/22/11 8:48 PM, Harsh J ha...@cloudera.com wrote: Ivan, Writing your own program was overkill. The 'yes' coreutil is pretty silly, but nifty at the same time. It accepts an argument, which it would repeat

Re: do HDFS files starting with _ (underscore) have special properties?

2011-09-02 Thread Edward Capriolo
On Fri, Sep 2, 2011 at 4:04 PM, Meng Mao meng...@gmail.com wrote: We have a compression utility that tries to grab all subdirs to a directory on HDFS. It makes a call like this: FileStatus[] subdirs = fs.globStatus(new Path(inputdir, *)); and handles files vs dirs accordingly. We tried to

Re: Help - Rack Topology Script - Hadoop 0.20 (CDH3u1)

2011-08-21 Thread Edward Capriolo
On Sun, Aug 21, 2011 at 10:22 AM, Joey Echeverria j...@cloudera.com wrote: Not that I know of. -Joey On Fri, Aug 19, 2011 at 1:16 PM, modemide modem...@gmail.com wrote: Ha, what a silly mistake. Thank you Joey. Do you also happen to know of an easier way to tell which racks the

Re: Why hadoop should be built on JAVA?

2011-08-16 Thread Edward Capriolo
This should explain it http://jz10.java.no/java-4-ever-trailer.html . On Tue, Aug 16, 2011 at 1:17 PM, Adi adi.pan...@gmail.com wrote: On Mon, Aug 15, 2011 at 9:00 PM, Chris Song sjh...@gmail.com wrote: Why hadoop should be built in JAVA? For integrity and stability, it is

Re: YCSB Benchmarking for HBase

2011-08-03 Thread Edward Capriolo
On Wed, Aug 3, 2011 at 6:10 AM, praveenesh kumar praveen...@gmail.comwrote: Hi, Anyone working on YCSB (Yahoo Cloud Service Benchmarking) for HBase ?? I am trying to run it, its giving me error: $ java -cp build/ycsb.jar com.yahoo.ycsb.CommandLine -db com.yahoo.ycsb.db.HBaseClient YCSB

Re: One file per mapper

2011-07-06 Thread Edward Capriolo
On Tue, Jul 5, 2011 at 5:28 PM, Jim Falgout jim.falg...@pervasive.comwrote: I've done this before by placing the name of each file to process into a single file (newline separated) and using the NLineInputFormat class as the input format. Run your job with the single file with all of the file

Re: Jobs are still in running state after executing hadoop job -kill jobId

2011-07-05 Thread Edward Capriolo
On Tue, Jul 5, 2011 at 10:05 AM, jeff.schm...@shell.com wrote: Um kill -9 pid ? -Original Message- From: Juwei Shi [mailto:shiju...@gmail.com] Sent: Friday, July 01, 2011 10:53 AM To: common-user@hadoop.apache.org; mapreduce-u...@hadoop.apache.org Subject: Jobs are still in

Re: Jobs are still in running state after executing hadoop job -kill jobId

2011-07-05 Thread Edward Capriolo
On Tue, Jul 5, 2011 at 11:45 AM, Juwei Shi shiju...@gmail.com wrote: We sometimes have hundreds of map or reduce tasks for a job. I think it is hard to find all of them and kill the corresponding jvm processes. If we do not want to restart hadoop, is there any automatic methods? 2011/7/5

Re: hadoop 0.20.203.0 Java Runtime Environment Error

2011-07-01 Thread Edward Capriolo
That looks like an ancient version of java. Get 1.6.0_u24 or 25 from oracle. Upgrade to a recent java and possibly update your c libs. Edward On Fri, Jul 1, 2011 at 7:24 PM, Shi Yu sh...@uchicago.edu wrote: I had difficulty upgrading applications from Hadoop 0.20.2 to Hadoop 0.20.203.0.

Re: extremely imbalance in the hdfs cluster

2011-06-29 Thread Edward Capriolo
We have run into this issue as well. Since hadoop is RR writing different size disks really screw things up royally especially if you are running at high capacity. We have found that decommissioning hosts for stretches of time is more effective then the balancer in extreme situations. Another

Re: Verbose screen logging on hadoop-0.20.203.0

2011-06-05 Thread Edward Capriolo
On Sun, Jun 5, 2011 at 1:04 PM, Shi Yu sh...@uchicago.edu wrote: We just upgraded from 0.20.2 to hadoop-0.20.203.0 Running the same code ends up a massive amount of debug information on the screen output. Normally this type of information is written to logs/userlogs directory. However,

Hadoop Filecrusher! V2 Released!

2011-06-01 Thread Edward Capriolo
All, You know the story: You have data files that are created every 5 minutes. You have hundreds of servers. You want to put those files in hadoop. Eventually: You get lots of files and blocks. Your namenode and secondary name node need more memory (BTW JVM's have issues at large Xmx values).

Re: Why don't my jobs get preempted?

2011-05-31 Thread Edward Capriolo
On Tue, May 31, 2011 at 2:50 PM, W.P. McNeill bill...@gmail.com wrote: I'm launching long-running tasks on a cluster running the Fair Scheduler. As I understand it, the Fair Scheduler is preemptive. What I expect to see is that my long-running jobs sometimes get killed to make room for other

Re: Hadoop and WikiLeaks

2011-05-22 Thread Edward Capriolo
On Sat, May 21, 2011 at 4:13 PM, highpointe highpoint...@gmail.com wrote: Does this copy text bother anyone else? Sure winning any award is great but does hadoop want to be associated with innovation like WikiLeaks? [Only] through the free distribution of information, the guaranteed

Re: Hadoop and WikiLeaks

2011-05-22 Thread Edward Capriolo
On Sun, May 22, 2011 at 7:29 PM, Todd Lipcon t...@cloudera.com wrote: C'mon guys -- while this is of course an interesting debate, can we please keep it off common-user? -Todd On Sun, May 22, 2011 at 3:30 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Sat, May 21, 2011 at 4:13 PM

Re: Hadoop and WikiLeaks

2011-05-22 Thread Edward Capriolo
On Sun, May 22, 2011 at 8:44 PM, Todd Lipcon t...@cloudera.com wrote: On Sun, May 22, 2011 at 5:10 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Correct. But it is a place to discuss changing the content of http://hadoop.apache.org which is what I am advocating. Fair enough

Re: Using df instead of du to calculate datanode space

2011-05-21 Thread Edward Capriolo
Good job. I brought this up an another thread, but was told it was not a problem. Good thing I'm not crazy. On Sat, May 21, 2011 at 12:42 AM, Joe Stein charmal...@allthingshadoop.comwrote: I came up with a nice little hack to trick hadoop into calculating disk usage with df instead of du

Re: Hadoop and WikiLeaks

2011-05-19 Thread Edward Capriolo
On Thu, May 19, 2011 at 11:54 AM, Ted Dunning tdunn...@maprtech.com wrote: ZK started as sub-project of Hadoop. On Thu, May 19, 2011 at 7:27 AM, M. C. Srivas mcsri...@gmail.com wrote: Interesting to note that Cassandra and ZK are now considered Hadoop projects. There were independent

Hadoop and WikiLeaks

2011-05-18 Thread Edward Capriolo
http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation Awards The Hadoop project won the innovator of the yearaward from the UK's Guardian newspaper, where it was described as had the potential as a greater

Re: Memory mapped resources

2011-04-11 Thread Edward Capriolo
On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Yes you can however it will require customization of HDFS.  Take a look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch.  I have been altering it for use with HBASE-3529.  Note that the patch

Re: How is hadoop going to handle the next generation disks?

2011-04-08 Thread Edward Capriolo
always cached so calculating the 'du -sk' on a host even with hundreds of thousands of files the du -sk generally uses high i/o for a couple of seconds. I am using 2TB disks too.  Sridhar On Fri, Apr 8, 2011 at 12:15 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I have a 0.20.2 cluster. I

How is hadoop going to handle the next generation disks?

2011-04-07 Thread Edward Capriolo
I have a 0.20.2 cluster. I notice that our nodes with 2 TB disks waste tons of disk io doing a 'du -sk' of each data directory. Instead of 'du -sk' why not just do this with java.io.file? How is this going to work with 4TB 8TB disks and up ? It seems like calculating used and free disk space could

Re: Is anyone running Hadoop 0.21.0 on Solaris 10 X64?

2011-03-31 Thread Edward Capriolo
On Thu, Mar 31, 2011 at 10:43 AM, XiaoboGu guxiaobo1...@gmail.com wrote: I have trouble browsing the file system vi namenode web interface, namenode saying in log file that th –G option is invalid to get the groups for the user. I thought this was not the case any more but hadoop forks to

Re: check if a sequenceFile is corrupted

2011-03-17 Thread Edward Capriolo
On Thursday, March 17, 2011, Marc Sturlese marc.sturl...@gmail.com wrote: Is there any way to check if a seqfile is corrupted without iterate over all its keys/values till it crashes? I've seen that I can get an IOException when opening it or an IOException reading the X key/value (depending

Re: how to get rid of attempt_201101170925_****_m_**** directories safely?

2011-03-17 Thread Edward Capriolo
On Thu, Mar 17, 2011 at 1:20 PM, jigar shah js...@pandora.com wrote: Hi,    we are running a 50 node hadoop cluster and have a problem with these attempt directories piling up(for eg attempt_201101170925_126956_m_000232_0) and taking a lot of space. when i restart the tasktracker daemon these

Re: Anyone knows how to attach a figure on Hadoop Wiki page?

2011-03-14 Thread Edward Capriolo
On Mon, Mar 14, 2011 at 1:23 PM, He Chen airb...@gmail.com wrote: Hi all Any suggestions? Bests Chen Images have been banned.

Re: Reason of Formatting Namenode

2011-03-10 Thread Edward Capriolo
On Thu, Mar 10, 2011 at 12:48 AM, Adarsh Sharma adarsh.sha...@orkash.com wrote: Thanks Harsh, i.e why if we again format namenode after loading some data INCOMATIBLE NAMESPACE ID's error occurs. Best Regards, Adarsh Sharma Harsh J wrote: Formatting the NameNode initializes the

Re: Hadoop and image processing?

2011-03-03 Thread Edward Capriolo
On Thu, Mar 3, 2011 at 10:00 AM, Tom Deutsch tdeut...@us.ibm.com wrote: Along with Brian I'd also suggest it depends on what you are doing with the images, but we used Hadoop specifically for this purpose in several solutions we build to do advanced imaging processing. Both scale out ability

Re: recommendation on HDDs

2011-02-12 Thread Edward Capriolo
On Fri, Feb 11, 2011 at 7:14 PM, Ted Dunning tdunn...@maprtech.com wrote: Bandwidth is definitely better with more active spindles.  I would recommend several larger disks.  The cost is very nearly the same. On Fri, Feb 11, 2011 at 3:52 PM, Shrinivas Joshi jshrini...@gmail.comwrote: Thanks

Re: Hadoop is for whom? Data architect or Java Architect or All

2011-01-27 Thread Edward Capriolo
On Thu, Jan 27, 2011 at 5:42 AM, Steve Loughran ste...@apache.org wrote: On 27/01/11 07:28, Manuel Meßner wrote: Hi, you may want to take a look into the streaming api, which allows users to write there map-reduce jobs with any language, which is capable of writing to stdout and reading

Re: How to get metrics information?

2011-01-23 Thread Edward Capriolo
On Sat, Jan 22, 2011 at 9:59 PM, Ted Yu yuzhih...@gmail.com wrote: In the test code, JobTracker is returned from:        mr = new MiniMRCluster(0, 0, 0, file:///, 1, null, null, null, conf);        jobTracker = mr.getJobTrackerRunner().getJobTracker(); I guess it is not exposed in non-test

Re: Hive rc location

2011-01-21 Thread Edward Capriolo
On Fri, Jan 21, 2011 at 9:56 AM, abhatna...@vantage.com abhatna...@vantage.com wrote: Where is this file located? Also does anyone has a sample -- View this message in context: http://lucene.472066.n3.nabble.com/Hive-rc-tp2296028p2302262.html Sent from the Hadoop lucene-users mailing list

Re: Why Hadoop is slow in Cloud

2011-01-19 Thread Edward Capriolo
On Wed, Jan 19, 2011 at 1:32 PM, Marc Farnum Rendino mvg...@gmail.com wrote: On Tue, Jan 18, 2011 at 8:59 AM, Adarsh Sharma adarsh.sha...@orkash.com wrote: I want to know *AT WHAT COSTS  *it comes. 10-15% is tolerable but at this rate, it needs some work. As Steve rightly suggest , I am in

Re: No locks available

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 8:13 AM, Adarsh Sharma adarsh.sha...@orkash.com wrote: Harsh J wrote: Could you re-check your permissions on the $(dfs.data.dir)s for your failing DataNode versus the user that runs it? On Mon, Jan 17, 2011 at 6:33 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote:

Re: Why Hadoop is slow in Cloud

2011-01-17 Thread Edward Capriolo
On Mon, Jan 17, 2011 at 6:08 AM, Steve Loughran ste...@apache.org wrote: On 17/01/11 04:11, Adarsh Sharma wrote: Dear all, Yesterday I performed a kind of testing between *Hadoop in Standalone Servers* *Hadoop in Cloud. *I establish a Hadoop cluster of 4 nodes ( Standalone Machines ) in

Re: new mapreduce API and NLineInputFormat

2011-01-14 Thread Edward Capriolo
On Fri, Jan 14, 2011 at 5:05 PM, Attila Csordas attilacsor...@gmail.com wrote: Hi, what other jars should be added to the build path from 0.21.0 besides hadoop-common-0.21.0.jar in order to make 0.21.0 NLineInputFormat work in 0.20.2 as suggested below? Generally can somebody provide me a

Re: Topology : Script Based Mapping

2010-12-29 Thread Edward Capriolo
On Tue, Dec 28, 2010 at 11:36 PM, Hemanth Yamijala yhema...@gmail.com wrote: Hi, On Tue, Dec 28, 2010 at 6:03 PM, Rajgopal Vaithiyanathan raja.f...@gmail.com wrote: I wrote a script to map the IP's to a rack. The script is as follows. : for i in $* ; do        topo=`echo $i | cut -d.

Re: HDFS and libhfds

2010-12-07 Thread Edward Capriolo
2010/12/7 Petrucci Andreas petrucci_2...@hotmail.com: hello there, im trying to compile libhdfs in order  but there are some problems. According to http://wiki.apache.org/hadoop/MountableHDFS  i have already installes fuse. With ant compile-c++-libhdfs -Dlibhdfs=1 the buils is successful.

Re: small files and number of mappers

2010-11-30 Thread Edward Capriolo
On Tue, Nov 30, 2010 at 3:21 AM, Harsh J qwertyman...@gmail.com wrote: Hey, On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese marc.sturl...@gmail.com wrote: Hey there, I am doing some tests and wandering which are the best practices to deal with very small files which are continuously being

Re: 0.21 found interface but class was expected

2010-11-13 Thread Edward Capriolo
On Sat, Nov 13, 2010 at 9:50 PM, Todd Lipcon t...@cloudera.com wrote: We do have policies against breaking APIs between consecutive major versions except for very rare exceptions (eg UnixUserGroupInformation went away when security was added). We do *not* have any current policies that

Re: Caution using Hadoop 0.21

2010-11-13 Thread Edward Capriolo
On Sat, Nov 13, 2010 at 4:33 PM, Shi Yu sh...@uchicago.edu wrote: I agree with Steve. That's why I am still using 0.19.2 in my production. Shi On 2010-11-13 12:36, Steve Lewis wrote: Our group made a very poorly considered decision to build out cluster using Hadoop 0.21 We discovered

Re: hd fs -head?

2010-09-27 Thread Edward Capriolo
On Mon, Sep 27, 2010 at 3:23 AM, Keith Wiley kwi...@keithwiley.com wrote: Is there a particularly good reason for why the hadoop fs command supports -cat and -tail, but not -head? Keith Wiley    

Re: A new way to merge up those small files!

2010-09-27 Thread Edward Capriolo
number so that it can attempt to *detect* the type of the file. Cheers On Fri, Sep 24, 2010 at 11:41 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Many times a hadoop job produces a file per reducer and the job has many reducers. Or a map only job one output file per input file and you

Re: hd fs -head?

2010-09-27 Thread Edward Capriolo
On Mon, Sep 27, 2010 at 11:13 AM, Keith Wiley kwi...@keithwiley.com wrote: On 2010, Sep 27, at 7:02 AM, Edward Capriolo wrote: On Mon, Sep 27, 2010 at 3:23 AM, Keith Wiley kwi...@keithwiley.com wrote: Is there a particularly good reason for why the hadoop fs command supports -cat and -tail

A new way to merge up those small files!

2010-09-25 Thread Edward Capriolo
Many times a hadoop job produces a file per reducer and the job has many reducers. Or a map only job one output file per input file and you have many input files. Or you just have many small files from some external process. Hadoop has sub optimal handling of small files. There are some ways to

Re: How to disable secondary node

2010-09-09 Thread Edward Capriolo
It is a bad idea to permanently disable 2nn. The edits file grows very very large and will not be processed until the name node restart. We had a 12GB edit file that took 40 minutes of downtime to process. On Thu, Sep 9, 2010 at 3:08 AM, Jeff Zhang zjf...@gmail.com wrote: then, do not start

Re: SequenceFile Header

2010-09-08 Thread Edward Capriolo
On Wed, Sep 8, 2010 at 1:06 PM, Matthew John tmatthewjohn1...@gmail.com wrote: Hi guys, I m trying to run a sort on a metafile which had a record consisting of a key8 bytes and a value32 bytes. Sort will be with respect to the key. But my input file does not have a header. So inorder to avail

Re: Re: namenode consume quite a lot of memory with only serveral hundredsof files in it

2010-09-07 Thread Edward Capriolo
The fact that the memory is high is not necessarily a bad thing. Faster garbage collection implies more CPU usage. I had some success following the tuning advice here, to make my memory usage less spikey http://blog.mikiobraun.de/2010/08/cassandra-gc-tuning.html Again, less spikes != better

Re: Why does Generic Options Parser only take the first -D option?

2010-09-03 Thread Edward Capriolo
) {        String[] keyval = prop.split(=, 2);        if (keyval.length == 2) {          conf.set(keyval[0], keyval[1]);        }      }    } You can add a log after the bold line to verify that all -D options are returned. On Thu, Sep 2, 2010 at 10:09 AM, Edward Capriolo edlinuxg

Why does Generic Options Parser only take the first -D option?

2010-09-02 Thread Edward Capriolo
This is 0.20.0 I have an eclipse run configuration passing these as arguments -D hive2rdbms.jdbc.driver=com.mysql.jdbc.Driver -D hive2rdbms.connection.url=jdbc:mysql://localhost:3306/test -D hive2rdbms.data.query=SELECT id,name FROM name WHERE $CONDITIONS -D hive2rdbms.bounding.query=SELECT

Re: accounts permission on hadoop

2010-08-31 Thread Edward Capriolo
On Tue, Aug 31, 2010 at 5:07 PM, Gang Luo lgpub...@yahoo.com.cn wrote: Hi all, I am the administrator of a hadoop cluster. I want to know how to specify a group a user belong to. Or hadoop just use the group/user information from the linux system it runs on? For example, if a user 'smith'

DataDrivenInputFormat setInput with boundingQuery

2010-08-31 Thread Edward Capriolo
I am working with DataDrivenOutputFormat from trunk. None of the unit tests seem to test the bounded queries Configuration conf = new Configuration(); Job job = new Job(conf); job.setJarByClass(TestZ.class);

  1   2   3   >