I plan to use hadoop to do some log processing and I'm working on a method to load the files (probably nightly) into hdfs. My plan is to have a web server on each machine with logs that serves up the log directories. Then I would give distcp a list of http URLs of the log files and have it copy the files in.

Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this should be supported, but the http URLs are not working for me. Are http source URLs still supported?

I tried a simple test with an http source URL (using Hadoop 0.19):

hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs

This fails:

With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
   at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578)
   at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74)
   at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775)
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)

Reply via email to