[
https://issues.apache.org/jira/browse/MAPREDUCE-5050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Albert Chu updated MAPREDUCE-5050:
----------------------------------
Description:
I'm trying to simulate running Hadoop on Lustre by configuring it to use the
local file system using a single cloudera VM (cdh3u4).
I can generate the data just fine, but when running the sorting portion of the
program, I get an error about not being able to find the _partition.lst file.
It exists in the generated data directory.
Perusing the Terasort code, I see in the main method that has a Path reference
to partition.lst, which is created with the parent directory.
{noformat}
public int run(String[] args) throws Exception {
LOG.info("starting");
JobConf job = (JobConf) getConf();
>> Path inputDir = new Path(args[0]);
>> inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
>> Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
URI partitionUri = new URI(partitionFile.toString() +
"#" + TeraInputFormat.PARTITION_FILENAME);
TeraInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("TeraSort");
job.setJarByClass(TeraSort.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormat(TeraInputFormat.class);
job.setOutputFormat(TeraOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
TeraInputFormat.writePartitionFile(job, partitionFile);
DistributedCache.addCacheFile(partitionUri, job);
DistributedCache.createSymlink(job);
job.setInt("dfs.replication", 1);
TeraOutputFormat.setFinalSync(job, true);
JobClient.runJob(job);
LOG.info("done");
return 0;
}
{noformat}
But in the configure method, the Path isn't created with the parent directory
reference.
{noformat}
public void configure(JobConf job) {
try {
FileSystem fs = FileSystem.getLocal(job);
>> Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
splitPoints = readPartitions(fs, partFile, job);
trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
} catch (IOException ie) {
throw new IllegalArgumentException("can't read paritions file", ie);
}
}
{noformat}
I modified the code as follows, and now sorting portion of the Terasort test
works using the
general file system. I think the above code is a bug.
{noformat}
public void configure(JobConf job) {
try {
FileSystem fs = FileSystem.getLocal(job);
>> Path[] inputPaths = TeraInputFormat.getInputPaths(job);
>> Path partFile = new Path(inputPaths[0],
TeraInputFormat.PARTITION_FILENAME);
splitPoints = readPartitions(fs, partFile, job);
trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
} catch (IOException ie) {
throw new IllegalArgumentException("can't read paritions file", ie);
}
}
{noformat}
was:
I'm trying to simulate running Hadoop on Lustre by configuring it to use the
local file system using a single cloudera VM (cdh3u4).
I can generate the data just fine, but when running the sorting portion of the
program, I get an error about not being able to find the _partition.lst file.
It exists in the generated data directory.
Perusing the Terasort code, I see in the main method that has a Path reference
to partition.lst, which is created with the parent directory.
public int run(String[] args) throws Exception {
LOG.info("starting");
JobConf job = (JobConf) getConf();
>> Path inputDir = new Path(args[0]);
>> inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
>> Path partitionFile = new Path(inputDir, TeraInputFormat.PARTITION_FILENAME);
URI partitionUri = new URI(partitionFile.toString() +
"#" + TeraInputFormat.PARTITION_FILENAME);
TeraInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setJobName("TeraSort");
job.setJarByClass(TeraSort.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormat(TeraInputFormat.class);
job.setOutputFormat(TeraOutputFormat.class);
job.setPartitionerClass(TotalOrderPartitioner.class);
TeraInputFormat.writePartitionFile(job, partitionFile);
DistributedCache.addCacheFile(partitionUri, job);
DistributedCache.createSymlink(job);
job.setInt("dfs.replication", 1);
TeraOutputFormat.setFinalSync(job, true);
JobClient.runJob(job);
LOG.info("done");
return 0;
}
But in the configure method, the Path isn't created with the parent directory
reference.
public void configure(JobConf job) {
try {
FileSystem fs = FileSystem.getLocal(job);
>> Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
splitPoints = readPartitions(fs, partFile, job);
trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
} catch (IOException ie) {
throw new IllegalArgumentException("can't read paritions file", ie);
}
}
I modified the code as follows, and now sorting portion of the Terasort test
works using the
general file system. I think the above code is a bug.
public void configure(JobConf job) {
try {
FileSystem fs = FileSystem.getLocal(job);
>> Path[] inputPaths = TeraInputFormat.getInputPaths(job);
>> Path partFile = new Path(inputPaths[0],
TeraInputFormat.PARTITION_FILENAME);
splitPoints = readPartitions(fs, partFile, job);
trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
} catch (IOException ie) {
throw new IllegalArgumentException("can't read paritions file", ie);
}
}
> Cannot find partition.lst in Terasort on Hadoop/Local File System
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-5050
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5050
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: examples
> Affects Versions: 0.20.2
> Environment: Cloudera VM CDH3u4, VMWare, Linux, Java SE 1.6.0_31-b04
> Reporter: Matt Parker
> Priority: Minor
>
> I'm trying to simulate running Hadoop on Lustre by configuring it to use the
> local file system using a single cloudera VM (cdh3u4).
> I can generate the data just fine, but when running the sorting portion of
> the program, I get an error about not being able to find the _partition.lst
> file. It exists in the generated data directory.
> Perusing the Terasort code, I see in the main method that has a Path
> reference to partition.lst, which is created with the parent directory.
> {noformat}
> public int run(String[] args) throws Exception {
> LOG.info("starting");
> JobConf job = (JobConf) getConf();
> >> Path inputDir = new Path(args[0]);
> >> inputDir = inputDir.makeQualified(inputDir.getFileSystem(job));
> >> Path partitionFile = new Path(inputDir,
> >> TeraInputFormat.PARTITION_FILENAME);
> URI partitionUri = new URI(partitionFile.toString() +
> "#" + TeraInputFormat.PARTITION_FILENAME);
> TeraInputFormat.setInputPaths(job, new Path(args[0]));
> FileOutputFormat.setOutputPath(job, new Path(args[1]));
> job.setJobName("TeraSort");
> job.setJarByClass(TeraSort.class);
> job.setOutputKeyClass(Text.class);
> job.setOutputValueClass(Text.class);
> job.setInputFormat(TeraInputFormat.class);
> job.setOutputFormat(TeraOutputFormat.class);
> job.setPartitionerClass(TotalOrderPartitioner.class);
> TeraInputFormat.writePartitionFile(job, partitionFile);
> DistributedCache.addCacheFile(partitionUri, job);
> DistributedCache.createSymlink(job);
> job.setInt("dfs.replication", 1);
> TeraOutputFormat.setFinalSync(job, true);
> JobClient.runJob(job);
> LOG.info("done");
> return 0;
> }
> {noformat}
> But in the configure method, the Path isn't created with the parent directory
> reference.
> {noformat}
> public void configure(JobConf job) {
> try {
> FileSystem fs = FileSystem.getLocal(job);
> >> Path partFile = new Path(TeraInputFormat.PARTITION_FILENAME);
> splitPoints = readPartitions(fs, partFile, job);
> trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
> } catch (IOException ie) {
> throw new IllegalArgumentException("can't read paritions file", ie);
> }
> }
> {noformat}
> I modified the code as follows, and now sorting portion of the Terasort test
> works using the
> general file system. I think the above code is a bug.
> {noformat}
> public void configure(JobConf job) {
> try {
> FileSystem fs = FileSystem.getLocal(job);
> >> Path[] inputPaths = TeraInputFormat.getInputPaths(job);
> >> Path partFile = new Path(inputPaths[0],
> TeraInputFormat.PARTITION_FILENAME);
> splitPoints = readPartitions(fs, partFile, job);
> trie = buildTrie(splitPoints, 0, splitPoints.length, new Text(), 2);
> } catch (IOException ie) {
> throw new IllegalArgumentException("can't read paritions file", ie);
> }
> }
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)