I plan to use hadoop to do some log processing and I'm working on a
method to load the files (probably nightly) into hdfs. My plan is to
have a web server on each machine with logs that serves up the log
directories. Then I would give distcp a list of http URLs of the log
files and have it copy the files in.
Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like
this should be supported, but the http URLs are not working for me. Are
http source URLs still supported?
I tried a simple test with an http source URL (using Hadoop 0.19):
hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs
This fails:
With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: No FileSystem for scheme: http
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578)
at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74)
at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)