IndexMerger now deletes entire <workingdir> after completing
------------------------------------------------------------

                 Key: NUTCH-341
                 URL: http://issues.apache.org/jira/browse/NUTCH-341
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 0.8
            Reporter: Chris Schneider
            Priority: Critical


Change 383304 deleted the following line near Line 117 (see 
<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java?r1=383304&r2=405204&diff_format=h>
 for details):

workDir = new File(workDir, "indexmerger-workingdir");

Previously, if no -workingdir <workingdir> parameter was specified, 
IndexMerger.main() would place an "indexmerger-workingdir" directory into the 
default directory and then delete the former after completing. Now, 
IndexMerger.main() defaults the value of its workDir to "indexmerger" within 
the default directory, and deletes this workDir afterward.

However, if -workingdir <workingdir> _is_ specified, IndexMerger.main() will 
now set workDir to _this_ path and delete the _entire_ <workingdir> afterward. 
Previously, IndexMerger.main() would only delete 
<workingDir>/"indexmerger-workingdir", without deleting <workingdir> itself. 
This is because the line mentioned above always appended 
"indexmerger-workingdir" to workDir.

Our hardware configuration on the jobtracker/namenode box attempts to keep all 
large datasets on a separate, large hard drive. Accordingly, we were keeping 
dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir on this 
drive. Unfortunately, we were passing the folder containing these folders in 
the <workingdir> parameter to the IndexMerger. As a result, the first time we 
ran the IndexMerger, we ended up trashing our entire DFS!

Perhaps the way that the IndexMerger handles its <workingdir> parmaeter now is 
an acceptable design. However, given the way it handled this parameter in the 
past, I feel that the current implementation is unacceptably dangerous.

More importantly, perhaps there's some way that we could make hadoop more 
robust in handling its critical data files. I plan to place a directory owned 
by root with "dr--------" permissions into each of these critical directories 
in order to prevent any of them from suffering the fate of our DFS. This could 
become part of a standard hadoop installation.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to