Text file character encoding
Hello, I'm using Hadoop 0.17.0 to analyze some large amount of CSV files. And I need to read such files in different character encoding from UTF-8, but I think TextInputFormat doesn't support such character encoding. I guess LineRecordReader class or Text class should support encoding settings like this. conf.set(io.file.defaultEncoding, MS932); Is there any plan to supoort different character encoding in TextInputFormat? Regards, -- NOMURA Yoshihide: Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan Tel: 044-754-2675 (Ext: 7112-6358) Fax: 044-754-2570 (Ext: 7112-3834) E-Mail: [EMAIL PROTECTED]
Using hadoop to store large backups
I'm starting to use Hadoop as a simple storage pool to store backups of large things (currently Oracle database backups). My Hadoop usage is at a pretty primitive level so far and I am really only scratching the surface of what it can do. I haven't used map/reduce at all--so far it's just been big file goes in, big file comes out. My first concern is to increase the throughput so that I can put files in faster. Right now I'm doing the equivalent of tar -cf - /volume1/data | hadoop fs -put - /backups/vol1.tar and I've got 4 of those running in parallel. Throughput is acceptable, but not great... I am able to utilize about 0.5Gbits of my 10G network connection. Here's what I've observed so far... * hadoop fs -put processes have a large virtual size (1000m) but only use a small amount of resident/chip ram (like 50M). This is smaller than the configured chunk size (64m) * The tar processes have also larger virtual size (50m) than their resident size (800k). Not sure what that means... maybe I could force tar's output and hadoop's input to have larger buffers? * I don't think CPU is constrained... there are 4 cpus and they are 60% idle * Network is not really breaking a sweat, I have a 10G connection on the sender and 16 hadoop nodes have 1G connections for receiving. * I don't think disk IO is limiting me either; a quick test sending cat /dev/zero to hadoop fs put instead of tar gives about the same throughput I haven't changed much over the fresh-install defaults so I'm hoping some tuning will help boost the throughput. I'll probably be trying various things but some pointers on where to look first would be welcome. As I get further down this road, I can imagine I'd like to do some cooler things with my cluster than just tar in, tar out. Eventually I would like to do things like: * dump uncompressed data in and have hadoop compress the stream later using its nodes, maybe depending on age * record checksum/md5/sha1 on the original files, then later ask map/reduce to give me a file list with checksums from the saved file * regularly read the files back again to make sure the checksums are still matching If I'm understanding correctly, these tasks are probably difficult/impossible using tar and gzip but would probably be quite easy using a padded/splittable archive format (e.g. zip?) and chunkable bzip compression. Has this been done before, and/or is anyone working on something similar? Right now I'm only using tar because the directory I'm reading from has symlinks in it--otherwise I would probably just crawl the directory and put one file in hadoop for each 1 file on the source. This would probably take longer to write, though maybe not much. Maybe I'll switch later to putting files in as-is (along with a tarfile of symlinks, device special files, or whatever else tar handles and hadoop doesn't). At least then I can do stuff like compress and checksum in distributed fashion. In this case the files aren't small, but I might have another application later to store backups of JPG files and in that case I would definitely need some sort of many-to-one sequencing many input files into one output file. Though this second part of the message is less important right now, I'd be interested in any comments/feedback here, so I can start to read up and get an idea of what's ahead of me as I head down this path. Thanks gregc
MetricsIntValue/MetricsLongValue publish once
Hi, In javadoc for MetricsIntValue and MetricsLongValue is written: Each time its value is set, it is published only *once* at the next update call. Looking at the those classes is right they push the data into the MetricsRecord only once, but digging dipper into the AbstractMericsContext the MetricsRecord data is published on every update, because ins not cleared once is published (i think is normal to be cleared because MetricsRecod records only instructions for changing the bufferedData from MetricsContext, maybe I'm wrong but thats what code dose). Ion
DataNode often self-stopped
Hi I am simulating a 4-DataNodes environment using VMWare. I found some data nodes often self-stopped after receiving a large file (or block). In fact , not so large , it is just smaller than 10MB. This is the error messages : 2008-05-27 16:40:54,727 INFO org.apache.hadoop.dfs.DataNode: Received block blk_3604066788791074317 of size 16777216 from /192.168.10.4 2008-05-27 16:40:54,727 INFO org.apache.hadoop.dfs.DataNode: PacketResponder 0 for block blk_3604066788791074317 terminating 2008-05-27 16:40:54,743 WARN org.apache.hadoop.dfs.DataNode: DataNode is shutting down: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.UnregisteredDatanodeException: Data node 192.168.10.7:50010 is attempting to report storage ID DS-1812686469-192.168.10.5-50010-1211793342121. Node 192.168.10.6:50010 is expected to serve this storage. at org.apache.hadoop.dfs.FSNamesystem.getDatanode(FSNamesystem.java:3594) at org.apache.hadoop.dfs.FSNamesystem.blockReceived(FSNamesystem.java:3102) at org.apache.hadoop.dfs.NameNode.blockReceived(NameNode.java:625) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy4.blockReceived(Unknown Source) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:652) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2667) at java.lang.Thread.run(Thread.java:619) 2008-05-27 16:40:54,745 INFO org.mortbay.util.ThreadedServer: Stopping Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50075] 2008-05-27 16:40:54,753 INFO org.mortbay.http.SocketListener: Stopped SocketListener on 0.0.0.0:50075 2008-05-27 16:40:54,914 INFO org.mortbay.util.Container: Stopped HttpContext[/static,/static] 2008-05-27 16:40:54,950 INFO org.mortbay.util.Container: Stopped HttpContext[/logs,/logs] 2008-05-27 16:40:54,951 INFO org.mortbay.util.Container: Stopped [EMAIL PROTECTED] 2008-05-27 16:40:55,044 INFO org.mortbay.util.Container: Stopped WebApplicationContext[/,/] 2008-05-27 16:40:55,044 INFO org.mortbay.util.Container: Stopped [EMAIL PROTECTED] 2008-05-27 16:40:55,044 INFO org.apache.hadoop.dfs.DataNode: Waiting for threa 2008-05-27 16:40:55,046 INFO org.apache.hadoop.dfs.DataNode: 192.168.10.7:5001 2008-05-27 16:40:55,284 INFO org.apache.hadoop.dfs.DataBlockScanner: Exiting D 2008-05-27 16:40:56,047 INFO org.apache.hadoop.dfs.DataNode: Waiting for threa 2008-05-27 16:40:56,050 INFO org.apache.hadoop.dfs.DataNode: 192.168.10.7:5001 2008-05-27 16:40:56,113 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: What is the problem ?
Re: Realtime Map Reduce = Supercomputing for the Masses?
Christophe Taton wrote: Actually Hadoop could be made more friendly to such realtime Map/Reduce jobs. For instance, we could consider running all tasks inside the task tracker jvm as separate threads, which could be implemented as another personality of the TaskRunner. I have been looking into this a couple of weeks ago... Would you be interested in such a feature? Why does that have benefits? So that you can share stuff via local data structures? Because you'd better be sharing classloaders if you are going to play that game. And that is very hard to get right (to the extent that I dont think any apache project other than Felix does it well)
Matrix Multiplication Problem
I downloaded the Matrix Multiplication code from: http://code.google.com/p/hama/source/browse/trunk/src/java/org/apache/hama/ but I do not know how can I run it in the right way. Could you please give steps how to run the code? -- View this message in context: http://www.nabble.com/Matrix-Multiplication-Problem-tp17599862p17599862.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Realtime Map Reduce = Supercomputing for the Masses?
Hi Steve, On Mon, Jun 2, 2008 at 12:23 PM, Steve Loughran [EMAIL PROTECTED] wrote: Christophe Taton wrote: Actually Hadoop could be made more friendly to such realtime Map/Reduce jobs. For instance, we could consider running all tasks inside the task tracker jvm as separate threads, which could be implemented as another personality of the TaskRunner. I have been looking into this a couple of weeks ago... Would you be interested in such a feature? Why does that have benefits? So that you can share stuff via local data structures? Because you'd better be sharing classloaders if you are going to play that game. And that is very hard to get right (to the extent that I dont think any apache project other than Felix does it well) The most obvious improvement to my mind concerns the memory footprint of the infrastructure. Running jobs leads to at least 3 jvms per machine (the data node, the task tracker and the task), if you forget parallelism and accept to run only one task per node at a time. This is problematic if you have machines with low memory capacities. That said, I agree with your concerns about classloading. I have actually been thinking that we might try to rely on osgi to do the job, and package hadoop daemons, jobs and tasks as osgi bundles and services; but I faced many tricky issues in doing that (the last one being the resolution of configuration files by the classloaders). To my mind, one short term and minimal way of achieving this would be to use a URLClassLoader in conjunction with the hdfs URLStreamHandler, to let the task tracker run tasks directly... Cheers, Christophe T.
Re: Realtime Map Reduce = Supercomputing for the Masses?
Yes you would have to do it with classloaders (not 'hello world' but not 'rocket science' either). You'll be limited on using native libraries, even if you use classloaders properly as native libs can be loaded only once. You will have to ensure you get rid of the task classloader once the task is over (thus removing all singleton stuff that may be in it). You will have to put in place a security manager for the code running out the task classloader. You'll end up doing somemthing similar to servlet containers webapp classloading model with the extra burden of hot-loading for each task run. Which in the end may have a similar overhead of bootstrapping a JVM for the task, this should be measured to see what is the time delta to see if it is worth the effort. A On Mon, Jun 2, 2008 at 3:53 PM, Steve Loughran [EMAIL PROTECTED] wrote: Christophe Taton wrote: Actually Hadoop could be made more friendly to such realtime Map/Reduce jobs. For instance, we could consider running all tasks inside the task tracker jvm as separate threads, which could be implemented as another personality of the TaskRunner. I have been looking into this a couple of weeks ago... Would you be interested in such a feature? Why does that have benefits? So that you can share stuff via local data structures? Because you'd better be sharing classloaders if you are going to play that game. And that is very hard to get right (to the extent that I dont think any apache project other than Felix does it well)
Re: About Metrics update
In MetricsIntValue, incrMetrics() was being called on pushMetrics(), instead of setMetrics(). This used to cause the values to be incremented periodically. Thanks, Lohit - Original Message From: Ion Badita [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Saturday, May 31, 2008 4:11:00 AM Subject: Re: About Metrics update Hi, I'm using version 0.17. Do you know how is fixed? Thanks Ion lohit wrote: Hi Ion, Which version of Hadoop are you using? The problem you reported about safeModeTime and fsImageLoadTime keep growing was fixed in 0.18 (or trunk) Thanks, Lohit - Original Message From: Ion Badita [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Friday, May 30, 2008 8:10:52 AM Subject: Re: About Metrics update Hi, I found (because of the Metrics behavior reported in the previous e-mail) some errors in the metrics reported by the NameNodeMetrics: safeModeTime and fsImageLoadTime keep growing (they should be the same over time). The mentioned metrics use MetricsIntValue for the values, on MetricsIntValue .pushMetric() if changed field is marked true then the value is published in MetricsRecod else the method does nothing. public synchronized void pushMetric(final MetricsRecord mr) { if (changed) mr.incrMetric(name, value); changed = false; } The problem is in AbstractMetricsContext.update() method, because the metricUpdates are not cleared after been merged in the record's internal data. Ion Ion Badita wrote: Hi, A looked over the class org.apache.hadoop.metrics.spi.AbstractMetricsContext and i have a question: why in the update(MetricsRecordImpl record) metricUpdates Map is not cleared after the updates are merged in metricMap. Because of this on every update() old increments are merged in metricMap. Is this the right behavior? If i want to increment only one metric in the record using current implementation is not possible without modifying other metrics that are incremented rare. Thanks Ion
Re: Realtime Map Reduce = Supercomputing for the Masses?
Alejandro Abdelnur wrote: Yes you would have to do it with classloaders (not 'hello world' but not 'rocket science' either). That's where we differ. I do actually think that classloaders are incredibly hard to get right, and I say that as someone who has single stepped through the Axis2 code in terror, and help soak-test Ant so that it doesnt leak even over extended builds, but even there we draw the line at saying this lets you run forever. In fact, I think apache should only allow people to write classloader code that have passed some special malicious classloader competence test devised by all the teams. You'll be limited on using native libraries, even if you use classloaders properly as native libs can be loaded only once. Plus there's that mess that is called endorsed libraries, and you have to worry about native library leakage. You will have to ensure you get rid of the task classloader once the task is over (thus removing all singleton stuff that may be in it). Which you can only do by getting rid of every single instance of every single class .. very, very hard to do this You will have to put in place a security manager for the code running out the task classloader. You'll end up doing somemthing similar to servlet containers webapp classloading model with the extra burden of hot-loading for each task run. Which in the end may have a similar overhead of bootstrapping a JVM for the task, this should be measured to see what is the time delta to see if it is worth the effort. It really comes down to start time and total memory footprint. If you can afford the startup delay and the memory, then separate processes give you best isolation and robustness. FWIW, in SmartFrog, we let you deploy components (as of last week, hadoop server components) in their own processes, by declaring the process name to use. These are pooled; you can deploy lots of things into a single process; once they are all terminated that process halts itself and all is well...there's a single root process on every box that can take others. Bringing up a nearly-empty child process is good for waiting for a faster deployment of other stuff. One issue here is always JVM options; should child processes have different parameters (like max heap size) from the root process. For long lived apps, deploying things into child processes is the most robust; it keeps the root process lighter. Deploying into the root process is better for debugging (you just start that one process), and for simplifying liveness. When the root process dies, everything inside is guaranteed 100% dead. But the child processes have to wait to discover they have lost any links to a parent on that process (if they care about such things) and start timing out themselves. -steve
Re: hadoop on EC2
if you use the new scripts in 0.17.0, just run hadoop-ec2 proxy cluster-name this starts a ssh tunnel to your cluster. installing foxy proxy in FF gives you whole cluster visibility.. obviously this isn't the best solution if you need to let many semi trusted users browse your cluster. On May 28, 2008, at 1:22 PM, Andreas Kostyrka wrote: Hi! I just wondered what other people use to access the hadoop webservers, when running on EC2? Ideas that I had: 1.) opening ports 50030 and so on = not good, data goes unprotected over the internet. Even if I could enable some form of authentication it would still plain http. 2.) Some kind of tunneling solution. The problem on this side is that each of my cluster node is in a different subnet, plus the dualism between the internal and external addresses of the nodes. Any hints? TIA, Andreas Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: hadoop on EC2
obviously this isn't the best solution if you need to let many semi trusted users browse your cluster. Actually, it would be much more secure if the tunnel service ran on a trusted server letting your users connect remotely via SOCKS and then browse the cluster. These users wouldn't need any AWS keys etc. Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
RE: Stack Overflow When Running Job
Hi, do you have a testcase that we can run to reproduce this? Thanks! -Original Message- From: jkupferman [mailto:[EMAIL PROTECTED] Sent: Monday, June 02, 2008 9:22 AM To: core-user@hadoop.apache.org Subject: Stack Overflow When Running Job Hi everyone, I have a job running that keeps failing with Stack Overflows and I really dont see how that is happening. The job runs for about 20-30 minutes before one task errors, then a few more error and it fails. I am running hadoop-17 and ive tried lowering these settings to no avail: io.sort.factor50 io.seqfile.sorter.recordlimit 50 java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write( MapTask.java:594) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write( MapTask.java:576) at java.io.DataOutputStream.writeInt(DataOutputStream.java:180) at Group.write(Group.java:68) at GroupPair.write(GroupPair.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$Writable Serializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$Writable Serializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTa sk.java:434) at MyMapper.map(MyMapper.java:27) at MyMapper.map(MyMapper.java:10) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Caused by: java.lang.StackOverflowError at java.io.DataInputStream.readInt(DataInputStream.java:370) at Group.readFields(Group.java:62) at GroupPair.readFields(GroupPair.java:60) at org.apache.hadoop.io.WritableComparator.compare(WritableCompar ator.java:91) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTa sk.java:494) at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29) at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58) at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58) the above line repeated 200x I defined writeablecomparable called GroupPair which simply holds to Group objects, each of which contains two integers. I fail to see how QuickSort could recurse 200+ times since that would require an insanely large amount of entries , far more then the 500 million that had been output at that point. How is this even possible? And what can be done to fix this? -- View this message in context: http://www.nabble.com/Stack-Overflow-When-Running-Job-tp175935 94p17593594.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop installation folders in multiple nodes
Depending on your windows version, there is a dos command called subst which you could use to virtualize a drive letter on your third machine On Fri, May 30, 2008 at 4:35 AM, Sridhar Raman [EMAIL PROTECTED] wrote: Should the installation paths be the same in all the nodes? Most documentation seems to suggest that it is _*recommended*_ to have the _*same *_ paths in all the nodes. But what is the workaround, if, because of some reason, one isn't able to have the same path? That's the problem we are facing right now. After making Hadoop work perfectly in a 2-node cluster, when we tried to accommodate a 3rd machine, we realised that this machine doesn't have a E:, which is where the installation of hadoop is in the other 2 nodes. All our machines are Windows machines. The possible solutions are: 1) Move the installations in M1 M2 to a drive that is present in M3. We will keep this as the last option. 2) Map a folder in M3's D: to E:. We used the subst command to do this. But when we tried to start DFS, it wasn't able to find the hadoop installation. Just to verify, we tried a ssh to the localhost, and were unable to find the mapped drive. It's only visible as a folder of D:. Whereas, in the basic cygwin prompt, we are able to view E:. 3) Partition M3's D drive and create an E. This carries the risk of loss of data. So, what should we do? Is there any way we can specify in the NameNode the installation paths of hadoop in each of the remaining nodes? Or is there some environment variable that can be set, which can make the hadoop installation path specific to each machine? Thanks, Sridhar
Re: Hadoop installation folders in multiple nodes
Oops, missed the part where you already tried that. On Mon, Jun 2, 2008 at 3:23 PM, Michael Di Domenico [EMAIL PROTECTED] wrote: Depending on your windows version, there is a dos command called subst which you could use to virtualize a drive letter on your third machine On Fri, May 30, 2008 at 4:35 AM, Sridhar Raman [EMAIL PROTECTED] wrote: Should the installation paths be the same in all the nodes? Most documentation seems to suggest that it is _*recommended*_ to have the _*same *_ paths in all the nodes. But what is the workaround, if, because of some reason, one isn't able to have the same path? That's the problem we are facing right now. After making Hadoop work perfectly in a 2-node cluster, when we tried to accommodate a 3rd machine, we realised that this machine doesn't have a E:, which is where the installation of hadoop is in the other 2 nodes. All our machines are Windows machines. The possible solutions are: 1) Move the installations in M1 M2 to a drive that is present in M3. We will keep this as the last option. 2) Map a folder in M3's D: to E:. We used the subst command to do this. But when we tried to start DFS, it wasn't able to find the hadoop installation. Just to verify, we tried a ssh to the localhost, and were unable to find the mapped drive. It's only visible as a folder of D:. Whereas, in the basic cygwin prompt, we are able to view E:. 3) Partition M3's D drive and create an E. This carries the risk of loss of data. So, what should we do? Is there any way we can specify in the NameNode the installation paths of hadoop in each of the remaining nodes? Or is there some environment variable that can be set, which can make the hadoop installation path specific to each machine? Thanks, Sridhar
Re: DataNode often self-stopped
2008/6/3 Konstantin Shvachko [EMAIL PROTECTED]: Is it possible that your different data-nodes point to the same storage directory on the hard drive? If so one of the data-nodes will be shut down. In general this is impossible because storage directories are locked once one of the nodes claims them under its authority. But I don't know whether this work in VMWare environment. No , it is in different storage file. Is it because of network problem ? The VMWare-simulated network sometimes inter-ping time out because of host server's high load. Is it because of the network temporarily unavailable(for some seconds) , results in the data-nodes self shut down ?
create less than 10G data/host with RandomWrite
Hello Hadoopers: I am running the RandomWrite on a 8 nodes cluster. Because the default setting is creating 1G/mapper, 10mappers/host. Considering replications, it is essentially creating 30G/host. Because each node in the cluster has at most 30G. So my cluster is full and can not execute further command. I create a new application configuration file specifying 1G/mapper. But it seems it is still creating 30G data and still running out of each node's disk. Is this the right way to generate the less than 10G data with RandomWriter? Below is the command and application configuration file I used. bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter rand -conf randConfig.xml Below is the application configuration file I am using: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property nametest.randomwriter.maps_per_host/name value10/value description/description /property property nametest.randomwrite.bytes_per_map/name value103741824/value description/description /property property nametest.randomwrite.min_key/name value10/value description /description /property property nametest.randomwrite.max_key/name value1000/value description /description /property property nametest.randomwrite.min_value/name value0/value descriptionDefault block replication. /description /property property nametest.randomwrite.max_value/name value2/value descriptionDefault block replication. /description /property /configuration