[jira] [Updated] (CRUNCH-227) Write to sequence file ignores destination path.

Micah Whitacre (JIRA) Thu, 20 Nov 2014 16:18:07 -0800

     [ 
https://issues.apache.org/jira/browse/CRUNCH-227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Micah Whitacre updated CRUNCH-227:
----------------------------------
    Attachment: 0001-CRUNCH-227-Added-test-that-shows-ToolRunner-does-not.patch

We have a situation where we are trying to write a companion text file to our 
output from the pipeline but the success file shows up but no contents.  It was 
brought up that this issue might be similar so I quickly wrote up some tests to 
see if ToolRunner + TextFileSource suffer the same problem.

To make this truly similar to our conditions I need to adjust this test a 
little bit more.  Specifically change it to use the MiniDFSCluster like we are 
using (to see if there is a difference).  For this specific issue I'll also 
work on adding on a test for sequence files since that is the format for which 
this issue was originally logged.

> Write to sequence file ignores destination path.
> ------------------------------------------------
>
>                 Key: CRUNCH-227
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-227
>             Project: Crunch
>          Issue Type: Bug
>          Components: IO
>    Affects Versions: 0.6.0, 0.7.0
>         Environment: Hadoop 1.0.3
>            Reporter: Florian Laws
>         Attachments: 
> 0001-CRUNCH-227-Added-test-that-shows-ToolRunner-does-not.patch, 
> CRUNCH-227.patch
>
>
> I'm trying to write a simple Crunch job that outputs a sequence file 
> consisting of a custom Writable.
> The job runs successfully, but the output is not written to the path that I 
> specify in To.sequenceFile(), but instead to a Crunch working directory.
> This happens when running the job both locally and on my 1-node Hadoop
> test cluster, and it happens both with Crunch 0.6.0 and 0.7.0-SNAPSHOT as of 
> today (38a97e5).
> When using pipeline.done() instead of pipeline.run(), the Crunch working 
> directory gets removed after execution, in that case, the output is not 
> retained at all.
> Code snippet:
> ---
> public int run(String[] args) throws IOException {
>   CommandLine cl = parseCommandLine(args);
>   Path output = new Path((String) cl.getValue(OUTPUT_OPTION));
>   int docIdIndex = getColumnIndex(cl, "DocID");
>   int ldaIndex = getColumnIndex(cl, "LDA");
>   Pipeline pipeline = new MRPipeline(DbDumpToSeqFile.class);
>   pipeline.setConfiguration(getConf());
>   PCollection<String> lines = pipeline.readTextFile((String)
> cl.getValue(INPUT_OPTION));
>   PTable<String, NamedQuantizedVecWritable> vectors = lines.parallelDo(
>     new ConvertToSeqFileDoFn(docIdIndex, ldaIndex),
>     tableOf(strings(), writables(NamedQuantizedVecWritable.class)));
>   vectors.write(To.sequenceFile(output));
>   PipelineResult res = pipeline.run();
>   return res.succeeded() ? 0 : 1;
> }
> ---
> Log output from local run.
> Note how the intended output path "/tmp/foo.seq" is reported in the
> execution plan,
> is not actually used.
> ---
> 2013-06-25 16:19:44.250 java[10755:1203] Unable to load realm info
> from SCDynamicStore
> 2013-06-25 16:19:44 HadoopUtil:185 [INFO] Deleting /tmp/foo.seq
> 2013-06-25 16:19:44 FileTargetImpl:224 [INFO] Will write output files
> to new path: /tmp/foo.seq
> 2013-06-25 16:19:45 JobClient:741 [WARN] No job jar file set.  User
> classes may not be found. See JobConf(Class) or
> JobConf#setJar(String).
> 2013-06-25 16:19:45 FileInputFormat:237 [INFO] Total input paths to process : 
> 1
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:407 [INFO] Creating
> MAP in 
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1-work--1596891011522800122
> with rwxr-xr-x
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:447 [INFO] Cached
> /tmp/crunch-1128974463/p1/MAP as
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
> 2013-06-25 16:19:45 TrackerDistributedCacheManager:470 [INFO] Cached
> /tmp/crunch-1128974463/p1/MAP as
> /tmp/hadoop-florian/mapred/local/archive/4100035173370108016_-456151549_2075417214/file/tmp/crunch-1128974463/p1/MAP
> 2013-06-25 16:19:45 CrunchControlledJob:303 [INFO] Running job
> "com.issuu.mahout.utils.DbDumpToSeqFile:
> Text(/Users/florian/data/docdb.first20.txt)+S0+SeqFile(/tmp/foo.seq)"
> 2013-06-25 16:19:45 CrunchControlledJob:304 [INFO] Job status
> available at: http://localhost:8080/
> 2013-06-25 16:19:45 Task:792 [INFO] Task:attempt_local_0001_m_000000_0
> is done. And is in the process of commiting
> 2013-06-25 16:19:45 LocalJobRunner:321 [INFO]
> 2013-06-25 16:19:45 Task:945 [INFO] Task attempt_local_0001_m_000000_0
> is allowed to commit now
> 2013-06-25 16:19:45 FileOutputCommitter:173 [INFO] Saved output of
> task 'attempt_local_0001_m_000000_0' to
> /tmp/crunch-1128974463/p1/output
> 2013-06-25 16:19:48 LocalJobRunner:321 [INFO]
> 2013-06-25 16:19:48 Task:904 [INFO] Task 'attempt_local_0001_m_000000_0' done.
> ---
> This crude patch makes the output end up at the right place,
> but breaks a lot of other tests.
> ---
> --- a/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> +++ b/crunch-core/src/main/java/org/apache/crunch/io/impl/FileTargetImpl.java
> @@ -66,7 +66,7 @@ public class FileTargetImpl implements PathTarget {
>    protected void configureForMapReduce(Job job, Class keyClass, Class
> valueClass,
>        Class outputFormatClass, Path outputPath, String name) {
>      try {
> -      FileOutputFormat.setOutputPath(job, outputPath);
> +      FileOutputFormat.setOutputPath(job, path);
>      } catch (Exception e) {
>        throw new RuntimeException(e);
>      }
> ---



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CRUNCH-227) Write to sequence file ignores destination path.

Reply via email to