Re: Improving locality of table access...
generate a patch and post it here https://issues.apache.org/jira/browse/HBASE-675 Billy Arthur van Hoff [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Hi, Below is some code for improving the read performance of large tables by processing each region on the host holding that region. We measured 50-60% lower network bandwidth. To use this class instead of org.apache.hadoop.hbase.mapred.TableInputFormat class use: jobconf.setInputFormat(ellerdale.mapreduce.TableInputFormatFix); Please send me feedback, if you can think off better ways to do this. -- Arthur van Hoff - Grand Master of Alphabetical Order The Ellerdale Project, Menlo Park, CA [EMAIL PROTECTED], 650-283-0842 -- TableInputFormatFix.java -- /** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses this file * to you under the Apache License, Version 2.0 (the * License); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ // Author: Arthur van Hoff, [EMAIL PROTECTED] package ellerdale.mapreduce; import java.io.*; import java.util.*; import org.apache.hadoop.io.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.util.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.hbase.*; import org.apache.hadoop.hbase.mapred.*; import org.apache.hadoop.hbase.client.*; import org.apache.hadoop.hbase.client.Scanner; import org.apache.hadoop.hbase.io.*; import org.apache.hadoop.hbase.util.*; // // Attempt to fix the localized nature of table segments. // Compute table splits so that they are processed locally. // Combine multiple splits to avoid the number of splits exceeding numSplits. // Sort the resulting splits so that the shortest ones are processed last. // The resulting savings in network bandwidth are significant (we measured 60%). // public class TableInputFormatFix extends TableInputFormat { public static final int ORIGINAL= 0; public static final int LOCALIZED= 1; public static final int OPTIMIZED= 2;// not yet functional // // A table split with a location. // static class LocationTableSplit extends TableSplit implements Comparable { String location; public LocationTableSplit() { } public LocationTableSplit(byte [] tableName, byte [] startRow, byte [] endRow, String location) { super(tableName, startRow, endRow); this.location = location; } public String[] getLocations() { return new String[] {location}; } public void readFields(DataInput in) throws IOException { super.readFields(in); this.location = Bytes.toString(Bytes.readByteArray(in)); } public void write(DataOutput out) throws IOException { super.write(out); Bytes.writeByteArray(out, Bytes.toBytes(location)); } public int compareTo(Object other) { LocationTableSplit otherSplit = (LocationTableSplit)other; int result = Bytes.compareTo(getStartRow(), otherSplit.getStartRow()); return result; } public String toString() { return location.substring(0, location.indexOf('.')) + : + Bytes.toString(getStartRow()) + - + Bytes.toString(getEndRow()); } } // // A table split with a location that covers multiple regions. // static class MultiRegionTableSplit extends LocationTableSplit { byte[][] regions; public MultiRegionTableSplit() { } public MultiRegionTableSplit(byte[] tableName, String location, byte[][] regions) throws IOException { super(tableName, regions[0], regions[regions.length-1], location); this.location = location; this.regions = regions; } public void readFields(DataInput in) throws IOException { super.readFields(in); int n = in.readInt(); regions = new byte[n][]; for (int i = 0 ; i n ; i++) { regions[i] = Bytes.readByteArray(in); } } public void write(DataOutput out) throws IOException { super.write(out); out.writeInt(regions.length); for (int i = 0 ; i regions.length ; i++) { Bytes.writeByteArray(out, regions[i]); } } public String toString() { String str = location.substring(0, location.indexOf('.')) + : ; for (int i = 0 ; i regions.length ; i += 2) { if (i 0) { str += , ; } str += Bytes.toString(regions[i]) + - +
Re: distcp port for 0.17.2
Hi, The dfs.http.address is for human use, not program interoperability. You can visit http://whatever.address.your.namenode.has:50070 in a web browser and see statistics about your filesystem. The address of cluster 2 is in its fs.default.name. This should be set to something like hdfs://cluster2.master.name:9000/ The file:// protocol only refers to paths on the current machine in its real (non-DFS) filesystem. - Aaron On Wed, Oct 22, 2008 at 3:47 PM, bzheng [EMAIL PROTECTED] wrote: Thanks. The fs.default.name is file:/// and dfs.http.address is 0.0.0.0:50070. I tried: hadoop dfs -ls /path/file to make sure file exists on cluster1 hadoop distcp file:///cluster1_master_node_ip:50070/path/file file:///cluster2_master_node_ip:50070/path/file It gives this error message: 08/10/22 15:43:47 INFO util.CopyFiles: srcPaths=[file:/cluster1_master_node_ip:50070/path/file] 08/10/22 15:43:47 INFO util.CopyFiles: destPath=file:/cluster2_master_node_ip:50070/path/file With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source file:/cluster1_master_node_ip:50070/path/file does not exist. at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:578) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763) If I use hdfs:// instead of file:///, I get: Copy failed: java.net.SocketTimeoutException: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:559) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313) at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102) at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178) at org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1280) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:572) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763) s29752-hadoopuser wrote: Hi, There is no such thing called distcp port. distcp uses (generic) file system API and so it does not care about the file system implementation details like port number. It is common to use distcp with HDFS or HFTP. The urls will look like hdfs://namenode:port/path and hftp://namenode:port/path for HDFS and HFTP, respectively. The HDFS and HFTP ports are specified by fs.default.name and dfs.http.address, respectively. Nicholas Sze - Original Message From: bzheng [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, October 22, 2008 11:57:43 AM Subject: distcp port for 0.17.2 What's the port number for distcp in 0.17.2? I can't find any documentation on distcp for version 0.17.2. For version 0.18, the documentation says it's 8020. I'm using a standard install and the only open ports associated with hadoop are 50030, 50070, and 50090. None of them work with distcp. So, how do you use distcp in 0.17.2? are there any extra setup/configuration needed? Thanks in advance for your help. -- View this message in context: http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20117463.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20121246.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Passing Constants from One Job to the Next
See Configuration.setInt() in the API. (JobConf inherits from Configuration). You can read it back in the configure() method of your mappers/reducers - Aaron On Wed, Oct 22, 2008 at 3:03 PM, Yih Sun Khoo [EMAIL PROTECTED] wrote: Are you saying that I can pass, say, a single integer constant with either of these three: JobConf? A HDFS file? DistributedCache? Or are you asking if I can pass given the context of: JobConf? A HDFS file? DistributedCache? I'm thinking of how to pass a single int so from one Jobconf to the next On Wed, Oct 22, 2008 at 2:57 PM, Arun C Murthy [EMAIL PROTECTED] wrote: On Oct 22, 2008, at 2:52 PM, Yih Sun Khoo wrote: I like to hear some good ways of passing constants from one job to the next. Unless I'm missing something: JobConf? A HDFS file? DistributedCache? Arun These are some ways that I can think of: 1) The obvious solution is to carry the constant as part of your value from one job to the next, but that would mean every value would hold that constant 2) Use the reporter as a hack so that you can set the status message and then get the status message back when u need the constant Any other ideas? (Also please do not include code)
Re: Is it possible to change parameters using org.apache.hadoop.conf.Configuration API?
Alex Loddengaard wrote: Just to be clear, you want to persist a configuration change to your entire cluster without bringing it down, and you're hoping to use the Configuration API to do so. Did I get your question correct? I don't know of a way to do this without restarting the cluster, because I'm pretty sure Configuration changes will only affect the current job. Does anyone else have a suggestion? Alex Its not doable with the current architecture -confs never get reread -they aren't persisted If you did persist them, it would get into a mess if every node in the cluster was reading the same shared file from an NFS mount; the last node to start up and set a value would set it for everything else. Better to have a persistent central configuration management (CM) infrastructure and drive hadoop that way. Even with CM tooling, the idea of persisting all changes back from the apps is generally considered a bad thing. What is preferred is to have a gui for managing system state that you can view, rollback, compare configuration changes too, as having the app do it means you don't know what's going on -and your app can leave your system in a mess, which is exactly what CM tools try to prevent. -steve
Auto-shutdown for EC2 clusters
Hi folks, Anybody tried scripting Hadoop on EC2 to... 1. Launch a cluster 2. Pull data from S3 3. Run a job 4. Copy results to S3 5. Terminate the cluster ... without any user interaction? -Stuart
Re: Auto-shutdown for EC2 clusters
Hey Stuart I did that for a client using Cascading events and SQS. When jobs completed, they dropped a message on SQS where a listener picked up new jobs and ran with them, or decided to kill off the cluster. The currently shipping EC2 scripts are suitable for having multiple simultaneous clusters for this purpose. Cascading has always and now Hadoop supports (thanks Tom) raw file access on S3, so this is quite natural. This is the best approach as data is pulled directly into the Mapper, instead of onto HDFS first, then read into the Mapper from HDFS. YMMV chris On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote: Hi folks, Anybody tried scripting Hadoop on EC2 to... 1. Launch a cluster 2. Pull data from S3 3. Run a job 4. Copy results to S3 5. Terminate the cluster ... without any user interaction? -Stuart -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Auto-shutdown for EC2 clusters
We're doing the same thing, but doing the scheduling just with shell scripts running on a machine outside of the Hadoop cluster. It works but we're getting into a bit of scripting hell as things get more complex. We're using distcp to first copy the files the jobs need from S3 to HDFS and it works nicely. When going the other direction we have to pull the data down from HDFS to one of the EC2 machines and then push back it up to S3. If I understand things right the support for that will be better in 0.19. / Per On Thu, Oct 23, 2008 at 8:52 AM, Chris K Wensel [EMAIL PROTECTED] wrote: Hey Stuart I did that for a client using Cascading events and SQS. When jobs completed, they dropped a message on SQS where a listener picked up new jobs and ran with them, or decided to kill off the cluster. The currently shipping EC2 scripts are suitable for having multiple simultaneous clusters for this purpose. Cascading has always and now Hadoop supports (thanks Tom) raw file access on S3, so this is quite natural. This is the best approach as data is pulled directly into the Mapper, instead of onto HDFS first, then read into the Mapper from HDFS. YMMV chris On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote: Hi folks, Anybody tried scripting Hadoop on EC2 to... 1. Launch a cluster 2. Pull data from S3 3. Run a job 4. Copy results to S3 5. Terminate the cluster ... without any user interaction? -Stuart -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Auto-shutdown for EC2 clusters
Hi Stuart, Yes, we do that. Ditto on most of what Chris described. We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from S3 when it launches. That controls the versions for tools/frameworks, instead of redoing an AMI each time a tool has an update. A remote server -- in our data center -- acts as a controller, to launch and manage the cluster. FWIW, an engineer here wrote those scripts in Python using boto. Had to patch boto, which was submitted. The mapper for the first MR job in the workflow streams in data from S3. Reducers in subsequent jobs have the option to write output to S3 (as Chris mentioned) After the last MR job in the workflow completes, it pushes a message into SQS. The remote server polls SQS, then performs a shutdown of the cluster. We may replace use of SQS with RabbitMQ -- more flexible to broker other kinds of messages between Hadoop on AWS and the controller/consumer of results back in our data center. This workflow could be initiated from a crontab -- totally automated. However, we still see occasional failures of the cluster, and must restart manually, but not often. Stability for that has improved much since the 0.18 release. For us, it's getting closer to total automation. FWIW, that's running on EC2 m1.xl instances. Paco On Thu, Oct 23, 2008 at 9:47 AM, Stuart Sierra [EMAIL PROTECTED] wrote: Hi folks, Anybody tried scripting Hadoop on EC2 to... 1. Launch a cluster 2. Pull data from S3 3. Run a job 4. Copy results to S3 5. Terminate the cluster ... without any user interaction? -Stuart
[Help needed] Is there a way to know the input filename at Hadoop Streaming?
Sorry for the email. Thanks for any help or hint. I am using Hadoop Streaming. The input are multiple files. Is there a way to get the current filename in mapper? For example: $HADOOP_HOME/bin/hadoop \ jar $HADOOP_HOME/hadoop-streaming.jar \ -input file1 \ -input file2 \ -output myOutputDir \ -mapper mapper \ -reducer reducer In mapper: while (STDIN){ //how to tell the current line is from file1 or file2? }
RE: Is there a way to know the input filename at Hadoop Streaming?
Thanks, Amogh. But my case is slightly different. The command line inputs are 2 files: file1 and file2. I need to tell in the mapper which line is from which file: #In mapper while (STDIN){ //how to tell the current line is from file1 or file2? } -jobconfs map.input.file param does not help in this case because file1 and file2 are both input. -Steve --- On Thu, 10/23/08, Amogh Vasekar [EMAIL PROTECTED] wrote: From: Amogh Vasekar [EMAIL PROTECTED] Subject: RE: Is there a way to know the input filename at Hadoop Streaming? To: [EMAIL PROTECTED] Date: Thursday, October 23, 2008, 12:11 AM Personally haven't worked with streaming but I guess the ur jobconfs map.input.file param should do it for you. -Original Message- From: Steve Gao [mailto:[EMAIL PROTECTED] Sent: Thursday, October 23, 2008 7:26 AM To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Subject: Is there a way to know the input filename at Hadoop Streaming? I am using Hadoop Streaming. The input are multiple files. Is there a way to get the current filename in mapper? For example: $HADOOP_HOME/bin/hadoop \ jar $HADOOP_HOME/hadoop-streaming.jar \ -input file1 \ -input file2 \ -output myOutputDir \ -mapper mapper \ -reducer reducer In mapper: while (STDIN){ //how to tell the current line is from file1 or file2? }
Re: [Help needed] Is there a way to know the input filename at Hadoop Streaming?
I guess one trick you can do without the help of hadoop is to encode the file identifier inside the file itself. For example, each line of file1 could start with 1'space''content of the original line'. - Original Message From: Steve Gao [EMAIL PROTECTED] To: core-user@hadoop.apache.org Cc: [EMAIL PROTECTED] Sent: Thursday, October 23, 2008 1:48:11 PM Subject: [Help needed] Is there a way to know the input filename at Hadoop Streaming? Sorry for the email. Thanks for any help or hint. I am using Hadoop Streaming. The input are multiple files. Is there a way to get the current filename in mapper? For example: $HADOOP_HOME/bin/hadoop \ jar $HADOOP_HOME/hadoop-streaming.jar \ -input file1 \ -input file2 \ -output myOutputDir \ -mapper mapper \ -reducer reducer In mapper: while (STDIN){ //how to tell the current line is from file1 or file2? }
Re: Is there a way to know the input filename at Hadoop Streaming?
On Wed, Oct 22, 2008 at 18:55, Steve Gao [EMAIL PROTECTED] wrote: I am using Hadoop Streaming. The input are multiple files. Is there a way to get the current filename in mapper? Streaming map tasks should have a map_input_file environment variable like the following: map_input_file=hdfs://HOST/path/to/file rick For example: $HADOOP_HOME/bin/hadoop \ jar $HADOOP_HOME/hadoop-streaming.jar \ -input file1 \ -input file2 \ -output myOutputDir \ -mapper mapper \ -reducer reducer In mapper: while (STDIN){ //how to tell the current line is from file1 or file2? }
Re: distcp port for 0.17.2
It's working for me now. Turns out the cluster have multiple network interfaces and I was using the wrong one. Thanks. Aaron Kimball-3 wrote: Hi, The dfs.http.address is for human use, not program interoperability. You can visit http://whatever.address.your.namenode.has:50070 in a web browser and see statistics about your filesystem. The address of cluster 2 is in its fs.default.name. This should be set to something like hdfs://cluster2.master.name:9000/ The file:// protocol only refers to paths on the current machine in its real (non-DFS) filesystem. - Aaron On Wed, Oct 22, 2008 at 3:47 PM, bzheng [EMAIL PROTECTED] wrote: Thanks. The fs.default.name is file:/// and dfs.http.address is 0.0.0.0:50070. I tried: hadoop dfs -ls /path/file to make sure file exists on cluster1 hadoop distcp file:///cluster1_master_node_ip:50070/path/file file:///cluster2_master_node_ip:50070/path/file It gives this error message: 08/10/22 15:43:47 INFO util.CopyFiles: srcPaths=[file:/cluster1_master_node_ip:50070/path/file] 08/10/22 15:43:47 INFO util.CopyFiles: destPath=file:/cluster2_master_node_ip:50070/path/file With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source file:/cluster1_master_node_ip:50070/path/file does not exist. at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:578) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763) If I use hdfs:// instead of file:///, I get: Copy failed: java.net.SocketTimeoutException: timed out waiting for rpc response at org.apache.hadoop.ipc.Client.call(Client.java:559) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313) at org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102) at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178) at org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1280) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:572) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763) s29752-hadoopuser wrote: Hi, There is no such thing called distcp port. distcp uses (generic) file system API and so it does not care about the file system implementation details like port number. It is common to use distcp with HDFS or HFTP. The urls will look like hdfs://namenode:port/path and hftp://namenode:port/path for HDFS and HFTP, respectively. The HDFS and HFTP ports are specified by fs.default.name and dfs.http.address, respectively. Nicholas Sze - Original Message From: bzheng [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, October 22, 2008 11:57:43 AM Subject: distcp port for 0.17.2 What's the port number for distcp in 0.17.2? I can't find any documentation on distcp for version 0.17.2. For version 0.18, the documentation says it's 8020. I'm using a standard install and the only open ports associated with hadoop are 50030, 50070, and 50090. None of them work with distcp. So, how do you use distcp in 0.17.2? are there any extra setup/configuration needed? Thanks in advance for your help. -- View this message in context: http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20117463.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20121246.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20137577.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: LHadoop Server simple Hadoop input and output
Hey Edward, The Thrift interface to HDFS allows clients to be developed in any Thrift-supported language: http://wiki.apache.org/hadoop/HDFS-APIs. Regards, Jeff On Thu, Oct 23, 2008 at 1:04 PM, Edward Capriolo [EMAIL PROTECTED] wrote: One of my first questions about hadoop was, How do systems outside the cluster interact with the file system? I read several documents that described streaming data into hadoop for processing, but I had trouble finding examples. The goal of LHadoop Server (L stands for Lightweight) is to produce a VERY simple interface to allow streaming READ and WRITE access to hadoop. The client side of the connection interacts using a simple text based protocol. Any type of client, perl,c++, telnet, can interact with hadoop. There is no need to have Java on the client. The protocol works like this: bash-3.2# nc localhost 9090 AUTH ecapriolo password serverOK:AUTH READ /letsgo serverOK. OMG. Is this going to work Lets see ^C Site: http://www.jointhegrid.com/jtgweb/lhadoopserver/ SVN: http://www.jointhegrid.com/jtgwebrepo/jtglhadoopserver I know several other methods exist to get access to a hadoop including Fuse. Again, I could not find anyone doing something like this. Does anyone have any ideas or think this is useful? Thank you,
Re: LHadoop Server simple Hadoop input and output
I had downloaded thrift and ran the example applications after the Hive meet up. It is very cool stuff. The thriftfs interface is more elegant than what I was trying to do, and that implementation is more complete. Still, someone might be interested in what I did if they want a super-light API :) I will link to http://wiki.apache.org/hadoop/HDFS-APIs from my page so people know the options.
Seeking Someone to Review Hadoop Article
Each month the developers at my company write a short article about a Java technology we find exciting. I've just finished one about Hadoop for November and am seeking a volunteer knowledgeable about Hadoop to look it over to help ensure it's both clear and technically accurate. If you're interested in helping me, please contact me offlist and I will send you the draft. Meanwhile, you can get a feel for the length and general style of the articles from our archives: http://www.ociweb.com/articles/publications/jnb.html Thanks in advance, Tom Wheeler
Re: Seeking Someone to Review Hadoop Article
I'm interesting in it. On Fri, Oct 24, 2008 at 6:31 AM, Tom Wheeler [EMAIL PROTECTED] wrote: Each month the developers at my company write a short article about a Java technology we find exciting. I've just finished one about Hadoop for November and am seeking a volunteer knowledgeable about Hadoop to look it over to help ensure it's both clear and technically accurate. If you're interested in helping me, please contact me offlist and I will send you the draft. Meanwhile, you can get a feel for the length and general style of the articles from our archives: http://www.ociweb.com/articles/publications/jnb.html Thanks in advance, Tom Wheeler -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.