Re: Improving locality of table access...

2008-10-23 Thread Billy Pearson

generate a patch and post it here
https://issues.apache.org/jira/browse/HBASE-675

Billy

Arthur van Hoff [EMAIL PROTECTED] wrote in 
message news:[EMAIL PROTECTED]

Hi,

Below is some code for improving the read performance of large tables by
processing each region on the host holding that region. We measured 50-60%
lower network bandwidth.

To use this class instead of 
org.apache.hadoop.hbase.mapred.TableInputFormat

class use:

   jobconf.setInputFormat(ellerdale.mapreduce.TableInputFormatFix);

Please send me feedback, if you can think off better ways to do this.

--
Arthur van Hoff - Grand Master of Alphabetical Order
The Ellerdale Project, Menlo Park, CA
[EMAIL PROTECTED], 650-283-0842


-- TableInputFormatFix.java --

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements.  See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership.  The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* License); you may not use this file except in compliance
* with the License.  You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an AS IS BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// Author: Arthur van Hoff, [EMAIL PROTECTED]

package ellerdale.mapreduce;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapred.*;

import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.mapred.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.client.Scanner;
import org.apache.hadoop.hbase.io.*;
import org.apache.hadoop.hbase.util.*;

//
// Attempt to fix the localized nature of table segments.
// Compute table splits so that they are processed locally.
// Combine multiple splits to avoid the number of splits exceeding
numSplits.
// Sort the resulting splits so that the shortest ones are processed last.
// The resulting savings in network bandwidth are significant (we measured
60%).
//
public class TableInputFormatFix extends TableInputFormat
{
   public static final int ORIGINAL= 0;
   public static final int LOCALIZED= 1;
   public static final int OPTIMIZED= 2;// not yet functional

   //
   // A table split with a location.
   //
   static class LocationTableSplit extends TableSplit implements 
Comparable

   {
   String location;

   public LocationTableSplit()
   {
   }
   public LocationTableSplit(byte [] tableName, byte [] startRow, byte []
endRow, String location)
   {
   super(tableName, startRow, endRow);
   this.location = location;
   }
   public String[] getLocations()
   {
   return new String[] {location};
   }
   public void readFields(DataInput in) throws IOException
   {
   super.readFields(in);
   this.location = Bytes.toString(Bytes.readByteArray(in));
   }
   public void write(DataOutput out) throws IOException
   {
   super.write(out);
   Bytes.writeByteArray(out, Bytes.toBytes(location));
   }
   public int compareTo(Object other)
   {
   LocationTableSplit otherSplit = (LocationTableSplit)other;
   int result = Bytes.compareTo(getStartRow(),
otherSplit.getStartRow());
   return result;
   }
   public String toString()
   {
   return location.substring(0, location.indexOf('.')) + :  +
Bytes.toString(getStartRow()) + - + Bytes.toString(getEndRow());
   }
   }

   //
   // A table split with a location that covers multiple regions.
   //
   static class MultiRegionTableSplit extends LocationTableSplit
   {
   byte[][] regions;

   public MultiRegionTableSplit()
   {
   }
   public MultiRegionTableSplit(byte[] tableName, String location, 
byte[][]

regions) throws IOException
   {
   super(tableName, regions[0], regions[regions.length-1], location);
   this.location = location;
   this.regions = regions;
   }
   public void readFields(DataInput in) throws IOException
   {
   super.readFields(in);
   int n = in.readInt();
   regions = new byte[n][];
   for (int i = 0 ; i  n ; i++) {
   regions[i] = Bytes.readByteArray(in);
   }
   }
   public void write(DataOutput out) throws IOException
   {
   super.write(out);
   out.writeInt(regions.length);
   for (int i = 0 ; i  regions.length ; i++) {
   Bytes.writeByteArray(out, regions[i]);
   }
   }
   public String toString()
   {
   String str = location.substring(0, location.indexOf('.')) + : ;
   for (int i = 0 ; i  regions.length ; i += 2) {
   if (i  0) {
   str += , ;
   }
   str += Bytes.toString(regions[i]) + - +

Re: distcp port for 0.17.2

2008-10-23 Thread Aaron Kimball
Hi,

The dfs.http.address is for human use, not program interoperability. You can
visit http://whatever.address.your.namenode.has:50070 in a web browser and
see statistics about your filesystem.

The address of cluster 2 is in its fs.default.name. This should be set to
something like hdfs://cluster2.master.name:9000/

The file:// protocol only refers to paths on the current machine in its
real (non-DFS) filesystem.
- Aaron

On Wed, Oct 22, 2008 at 3:47 PM, bzheng [EMAIL PROTECTED] wrote:


 Thanks.  The fs.default.name is file:/// and dfs.http.address is
 0.0.0.0:50070.  I tried:

 hadoop dfs -ls /path/file to make sure file exists on cluster1
 hadoop distcp file:///cluster1_master_node_ip:50070/path/file
 file:///cluster2_master_node_ip:50070/path/file

 It gives this error message:
 08/10/22 15:43:47 INFO util.CopyFiles:
 srcPaths=[file:/cluster1_master_node_ip:50070/path/file]
 08/10/22 15:43:47 INFO util.CopyFiles:
 destPath=file:/cluster2_master_node_ip:50070/path/file
 With failures, global counters are inaccurate; consider running with -i
 Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source
 file:/cluster1_master_node_ip:50070/path/file does not exist.
at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:578)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)


 If I use hdfs:// instead of file:///, I get:
 Copy failed: java.net.SocketTimeoutException: timed out waiting for rpc
 response
at org.apache.hadoop.ipc.Client.call(Client.java:559)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
at
 org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178)
at

 org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68)
at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1280)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:572)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)



 s29752-hadoopuser wrote:
 
  Hi,
 
  There is no such thing called distcp port.  distcp uses (generic) file
  system API and so it does not care about the file system implementation
  details like port number.
 
  It is common to use distcp with HDFS or HFTP.  The urls will look like
  hdfs://namenode:port/path and hftp://namenode:port/path for HDFS and
 HFTP,
  respectively.   The HDFS and HFTP ports are specified by fs.default.name
  and dfs.http.address, respectively.
 
  Nicholas Sze
 
 
 
 
  - Original Message 
  From: bzheng [EMAIL PROTECTED]
  To: core-user@hadoop.apache.org
  Sent: Wednesday, October 22, 2008 11:57:43 AM
  Subject: distcp port for 0.17.2
 
 
  What's the port number for distcp in 0.17.2?  I can't find any
  documentation
  on distcp for version 0.17.2.  For version 0.18, the documentation says
  it's
  8020.
 
  I'm using a standard install and the only open ports associated with
  hadoop
  are 50030, 50070, and 50090.  None of them work with distcp.  So, how do
  you
  use distcp in 0.17.2?  are there any extra setup/configuration needed?
 
  Thanks in advance for your help.
  --
  View this message in context:
  http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20117463.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 

 --
 View this message in context:
 http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20121246.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Passing Constants from One Job to the Next

2008-10-23 Thread Aaron Kimball
See Configuration.setInt() in the API. (JobConf inherits from
Configuration). You can read it back in the configure() method of your
mappers/reducers
- Aaron

On Wed, Oct 22, 2008 at 3:03 PM, Yih Sun Khoo [EMAIL PROTECTED] wrote:

 Are you saying that I can pass, say, a single integer constant with either
 of these three: JobConf? A HDFS file? DistributedCache?
 Or are you asking if I can pass given the context of: JobConf? A HDFS file?
 DistributedCache?
 I'm thinking of how to pass a single int so from one Jobconf to the next

 On Wed, Oct 22, 2008 at 2:57 PM, Arun C Murthy [EMAIL PROTECTED] wrote:

 
  On Oct 22, 2008, at 2:52 PM, Yih Sun Khoo wrote:
 
   I like to hear some good ways of passing constants from one job to the
  next.
 
 
  Unless I'm missing something: JobConf? A HDFS file? DistributedCache?
 
  Arun
 
 
 
  These are some ways that I can think of:
  1)  The obvious solution is to carry the constant as part of your value
  from
  one job to the next, but that would mean every value would hold that
  constant
  2)  Use the reporter as a hack so that you can set the status message
 and
  then get the status message back when u need the constant
 
  Any other ideas?  (Also please do not include code)
 
 
 



Re: Is it possible to change parameters using org.apache.hadoop.conf.Configuration API?

2008-10-23 Thread Steve Loughran

Alex Loddengaard wrote:

Just to be clear, you want to persist a configuration change to your entire
cluster without bringing it down, and you're hoping to use the Configuration
API to do so.  Did I get your question correct?

I don't know of a way to do this without restarting the cluster, because I'm
pretty sure Configuration changes will only affect the current job.  Does
anyone else have a suggestion?

Alex



Its not doable with the current architecture
 -confs never get reread
 -they aren't persisted

If you did persist them, it would get into a mess if every node in the 
cluster was reading the same shared file from an NFS mount; the last 
node to start up and set a value would set it for everything else. 
Better to have a persistent central configuration management (CM) 
infrastructure and drive hadoop that way.


Even with CM tooling, the idea of persisting all changes back from the 
apps is generally considered a bad thing. What is preferred is to have a 
gui for managing system state that you can view, rollback, compare 
configuration changes too, as having the app do it means you don't know 
what's going on -and your app can leave your system in a mess, which is 
exactly what CM tools try to prevent.


-steve


Auto-shutdown for EC2 clusters

2008-10-23 Thread Stuart Sierra
Hi folks,
Anybody tried scripting Hadoop on EC2 to...
1. Launch a cluster
2. Pull data from S3
3. Run a job
4. Copy results to S3
5. Terminate the cluster
... without any user interaction?

-Stuart


Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Chris K Wensel

Hey Stuart

I did that for a client using Cascading events and SQS.

When jobs completed, they dropped a message on SQS where a listener  
picked up new jobs and ran with them, or decided to kill off the  
cluster. The currently shipping EC2 scripts are suitable for having  
multiple simultaneous clusters for this purpose.


Cascading has always and now Hadoop supports (thanks Tom) raw file  
access on S3, so this is quite natural. This is the best approach as  
data is pulled directly into the Mapper, instead of onto HDFS first,  
then read into the Mapper from HDFS.


YMMV

chris

On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote:


Hi folks,
Anybody tried scripting Hadoop on EC2 to...
1. Launch a cluster
2. Pull data from S3
3. Run a job
4. Copy results to S3
5. Terminate the cluster
... without any user interaction?

-Stuart


--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Per Jacobsson
We're doing the same thing, but doing the scheduling just with shell scripts
running on a machine outside of the Hadoop cluster. It works but we're
getting into a bit of scripting hell as things get more complex.

We're using distcp to first copy the files the jobs need from S3 to HDFS and
it works nicely. When going the other direction we have to pull the data
down from HDFS to one of the EC2 machines and then push back it up to S3. If
I understand things right the support for that will be better in 0.19.
/ Per

On Thu, Oct 23, 2008 at 8:52 AM, Chris K Wensel [EMAIL PROTECTED] wrote:

 Hey Stuart

 I did that for a client using Cascading events and SQS.

 When jobs completed, they dropped a message on SQS where a listener picked
 up new jobs and ran with them, or decided to kill off the cluster. The
 currently shipping EC2 scripts are suitable for having multiple simultaneous
 clusters for this purpose.

 Cascading has always and now Hadoop supports (thanks Tom) raw file access
 on S3, so this is quite natural. This is the best approach as data is pulled
 directly into the Mapper, instead of onto HDFS first, then read into the
 Mapper from HDFS.

 YMMV

 chris


 On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote:

  Hi folks,
 Anybody tried scripting Hadoop on EC2 to...
 1. Launch a cluster
 2. Pull data from S3
 3. Run a job
 4. Copy results to S3
 5. Terminate the cluster
 ... without any user interaction?

 -Stuart


 --
 Chris K Wensel
 [EMAIL PROTECTED]
 http://chris.wensel.net/
 http://www.cascading.org/




Re: Auto-shutdown for EC2 clusters

2008-10-23 Thread Paco NATHAN
Hi Stuart,

Yes, we do that.  Ditto on most of what Chris described.

We use an AMI which pulls tarballs for Ant, Java, Hadoop, etc., from
S3 when it launches. That controls the versions for tools/frameworks,
instead of redoing an AMI each time a tool has an update.

A remote server -- in our data center -- acts as a controller, to
launch and manage the cluster.  FWIW, an engineer here wrote those
scripts in Python using boto.  Had to patch boto, which was
submitted.

The mapper for the first MR job in the workflow streams in data from
S3.  Reducers in subsequent jobs have the option to write output to S3
(as Chris mentioned)

After the last MR job in the workflow completes, it pushes a message
into SQS.  The remote server polls SQS, then performs a shutdown of
the cluster.

We may replace use of SQS with RabbitMQ -- more flexible to broker
other kinds of messages between Hadoop on AWS and the
controller/consumer of results back in our data center.


This workflow could be initiated from a crontab -- totally automated.
However, we still see occasional failures of the cluster, and must
restart manually, but not often.  Stability for that has improved much
since the 0.18 release.  For us, it's getting closer to total
automation.

FWIW, that's running on EC2 m1.xl instances.

Paco



On Thu, Oct 23, 2008 at 9:47 AM, Stuart Sierra [EMAIL PROTECTED] wrote:
 Hi folks,
 Anybody tried scripting Hadoop on EC2 to...
 1. Launch a cluster
 2. Pull data from S3
 3. Run a job
 4. Copy results to S3
 5. Terminate the cluster
 ... without any user interaction?

 -Stuart



[Help needed] Is there a way to know the input filename at Hadoop Streaming?

2008-10-23 Thread Steve Gao
Sorry for the email. Thanks for any help or hint.

    I am using Hadoop Streaming. The input are multiple files.
    Is there a way to get the current filename in mapper?

    For example:
    $HADOOP_HOME/bin/hadoop  \
    jar $HADOOP_HOME/hadoop-streaming.jar \
    -input file1 \
    -input file2 \
    -output myOutputDir \
    -mapper mapper \
    -reducer reducer

    In mapper:
    while (STDIN){
  //how to tell the current line is from file1 or file2?
    }



  

RE: Is there a way to know the input filename at Hadoop Streaming?

2008-10-23 Thread Steve Gao
Thanks, Amogh. But my case is slightly different. The command line inputs are 2 
files: file1 and file2. I need to tell in the mapper which line is from which 
file:
#In mapper
while (STDIN){
  //how to tell the current line is from file1 or file2?
}

-jobconfs map.input.file param does not help in this case 
because file1 and file2 are both input.

-Steve

--- On Thu, 10/23/08, Amogh Vasekar [EMAIL PROTECTED] wrote:
From: Amogh Vasekar [EMAIL PROTECTED]
Subject: RE: Is there a way to know the input filename at Hadoop Streaming?
To: [EMAIL PROTECTED]
Date: Thursday, October 23, 2008, 12:11 AM

Personally haven't worked with streaming but I guess the ur jobconfs
map.input.file param should do it for you.
-Original Message-
From: Steve Gao [mailto:[EMAIL PROTECTED] 
Sent: Thursday, October 23, 2008 7:26 AM
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Subject: Is there a way to know the input filename at Hadoop Streaming?

I am using Hadoop Streaming. The input are multiple files.
Is there a way to get the current filename in mapper?

For example:
$HADOOP_HOME/bin/hadoop  \
jar $HADOOP_HOME/hadoop-streaming.jar \
-input file1 \
-input file2 \
-output myOutputDir \
-mapper mapper \
-reducer reducer

In mapper:
while (STDIN){
  //how to tell the current line is from file1 or file2?
}




  



  

Re: [Help needed] Is there a way to know the input filename at Hadoop Streaming?

2008-10-23 Thread Zhengguo 'Mike' SUN
I guess one trick you can do without the help of hadoop is to encode the file 
identifier inside the file itself. For example, each line of file1 could start 
with 1'space''content of the original line'.



- Original Message 
From: Steve Gao [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Cc: [EMAIL PROTECTED]
Sent: Thursday, October 23, 2008 1:48:11 PM
Subject: [Help needed] Is there a way to know the input filename at Hadoop 
Streaming?

Sorry for the email. Thanks for any help or hint.

I am using Hadoop Streaming. The input are multiple files.
Is there a way to get the current filename in mapper?

For example:
$HADOOP_HOME/bin/hadoop  \
jar $HADOOP_HOME/hadoop-streaming.jar \
-input file1 \
-input file2 \
-output myOutputDir \
-mapper mapper \
-reducer reducer

In mapper:
while (STDIN){
  //how to tell the current line is from file1 or file2?
}


  

Re: Is there a way to know the input filename at Hadoop Streaming?

2008-10-23 Thread Rick Cox
On Wed, Oct 22, 2008 at 18:55, Steve Gao [EMAIL PROTECTED] wrote:
 I am using Hadoop Streaming. The input are multiple files.
 Is there a way to get the current filename in mapper?


Streaming map tasks should have a map_input_file environment
variable like the following:

map_input_file=hdfs://HOST/path/to/file

rick

 For example:
 $HADOOP_HOME/bin/hadoop  \
 jar $HADOOP_HOME/hadoop-streaming.jar \
-input file1 \
-input file2 \
-output myOutputDir \
-mapper mapper \
-reducer reducer

 In mapper:
 while (STDIN){
  //how to tell the current line is from file1 or file2?
 }







Re: distcp port for 0.17.2

2008-10-23 Thread bzheng

It's working for me now.  Turns out the cluster have multiple network
interfaces and I was using the wrong one.  Thanks.



Aaron Kimball-3 wrote:
 
 Hi,
 
 The dfs.http.address is for human use, not program interoperability. You
 can
 visit http://whatever.address.your.namenode.has:50070 in a web browser and
 see statistics about your filesystem.
 
 The address of cluster 2 is in its fs.default.name. This should be set to
 something like hdfs://cluster2.master.name:9000/
 
 The file:// protocol only refers to paths on the current machine in its
 real (non-DFS) filesystem.
 - Aaron
 
 On Wed, Oct 22, 2008 at 3:47 PM, bzheng [EMAIL PROTECTED] wrote:
 

 Thanks.  The fs.default.name is file:/// and dfs.http.address is
 0.0.0.0:50070.  I tried:

 hadoop dfs -ls /path/file to make sure file exists on cluster1
 hadoop distcp file:///cluster1_master_node_ip:50070/path/file
 file:///cluster2_master_node_ip:50070/path/file

 It gives this error message:
 08/10/22 15:43:47 INFO util.CopyFiles:
 srcPaths=[file:/cluster1_master_node_ip:50070/path/file]
 08/10/22 15:43:47 INFO util.CopyFiles:
 destPath=file:/cluster2_master_node_ip:50070/path/file
 With failures, global counters are inaccurate; consider running with -i
 Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source
 file:/cluster1_master_node_ip:50070/path/file does not exist.
at
 org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:578)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)


 If I use hdfs:// instead of file:///, I get:
 Copy failed: java.net.SocketTimeoutException: timed out waiting for rpc
 response
at org.apache.hadoop.ipc.Client.call(Client.java:559)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.dfs.$Proxy0.getProtocolVersion(Unknown
 Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:313)
at
 org.apache.hadoop.dfs.DFSClient.createRPCNamenode(DFSClient.java:102)
at org.apache.hadoop.dfs.DFSClient.init(DFSClient.java:178)
at

 org.apache.hadoop.dfs.DistributedFileSystem.initialize(DistributedFileSystem.java:68)
at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1280)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1291)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:203)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at
 org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:572)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)



 s29752-hadoopuser wrote:
 
  Hi,
 
  There is no such thing called distcp port.  distcp uses (generic) file
  system API and so it does not care about the file system implementation
  details like port number.
 
  It is common to use distcp with HDFS or HFTP.  The urls will look like
  hdfs://namenode:port/path and hftp://namenode:port/path for HDFS and
 HFTP,
  respectively.   The HDFS and HFTP ports are specified by
 fs.default.name
  and dfs.http.address, respectively.
 
  Nicholas Sze
 
 
 
 
  - Original Message 
  From: bzheng [EMAIL PROTECTED]
  To: core-user@hadoop.apache.org
  Sent: Wednesday, October 22, 2008 11:57:43 AM
  Subject: distcp port for 0.17.2
 
 
  What's the port number for distcp in 0.17.2?  I can't find any
  documentation
  on distcp for version 0.17.2.  For version 0.18, the documentation
 says
  it's
  8020.
 
  I'm using a standard install and the only open ports associated with
  hadoop
  are 50030, 50070, and 50090.  None of them work with distcp.  So, how
 do
  you
  use distcp in 0.17.2?  are there any extra setup/configuration needed?
 
  Thanks in advance for your help.
  --
  View this message in context:
  http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20117463.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 

 --
 View this message in context:
 http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20121246.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/distcp-port-for-0.17.2-tp20117463p20137577.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: LHadoop Server simple Hadoop input and output

2008-10-23 Thread Jeff Hammerbacher
Hey Edward,

The Thrift interface to HDFS allows clients to be developed in any
Thrift-supported language: http://wiki.apache.org/hadoop/HDFS-APIs.

Regards,
Jeff

On Thu, Oct 23, 2008 at 1:04 PM, Edward Capriolo [EMAIL PROTECTED] wrote:
 One of my first questions about hadoop was, How do systems outside
 the cluster interact with the file system? I read several documents
 that described streaming data into hadoop for processing, but I had
 trouble finding examples.

 The goal of LHadoop Server (L stands for Lightweight) is to produce a
 VERY simple interface to allow streaming READ and WRITE access to
 hadoop. The client side of the connection interacts using a simple
 text based protocol. Any type of client, perl,c++, telnet, can
 interact with hadoop. There is no need to have Java on the client.

 The protocol works like this:

 bash-3.2# nc localhost 9090
 AUTH ecapriolo password
 serverOK:AUTH
 READ /letsgo
 serverOK.
 OMG.
 Is this going to work
 Lets see
 ^C

 Site:
 http://www.jointhegrid.com/jtgweb/lhadoopserver/
 SVN:
 http://www.jointhegrid.com/jtgwebrepo/jtglhadoopserver

 I know several other methods exist to get access to a hadoop including
 Fuse. Again, I could not find anyone doing something like this. Does
 anyone have any ideas or think this is useful?

 Thank you,



Re: LHadoop Server simple Hadoop input and output

2008-10-23 Thread Edward Capriolo
I had downloaded thrift and ran the example applications after the
Hive meet up. It is very cool stuff. The thriftfs interface is more
elegant than what I was trying to do, and that implementation is more
complete.

Still, someone might be interested in what I did if they want a
super-light API :)

I will link to http://wiki.apache.org/hadoop/HDFS-APIs from my page so
people know the options.


Seeking Someone to Review Hadoop Article

2008-10-23 Thread Tom Wheeler
Each month the developers at my company write a short article about a
Java technology we find exciting. I've just finished one about Hadoop
for November and am seeking a volunteer knowledgeable about Hadoop to
look it over to help ensure it's both clear and technically accurate.

If you're interested in helping me, please contact me offlist and I
will send you the draft.  Meanwhile, you can get a feel for the length
and general style of the articles from our archives:

   http://www.ociweb.com/articles/publications/jnb.html

Thanks in advance,

Tom Wheeler


Re: Seeking Someone to Review Hadoop Article

2008-10-23 Thread Mafish Liu
I'm interesting in it.

On Fri, Oct 24, 2008 at 6:31 AM, Tom Wheeler [EMAIL PROTECTED] wrote:

 Each month the developers at my company write a short article about a
 Java technology we find exciting. I've just finished one about Hadoop
 for November and am seeking a volunteer knowledgeable about Hadoop to
 look it over to help ensure it's both clear and technically accurate.

 If you're interested in helping me, please contact me offlist and I
 will send you the draft.  Meanwhile, you can get a feel for the length
 and general style of the articles from our archives:

   http://www.ociweb.com/articles/publications/jnb.html

 Thanks in advance,

 Tom Wheeler




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.