IndexMerger now deletes entire <workingdir> after completing
------------------------------------------------------------
Key: NUTCH-341
URL: http://issues.apache.org/jira/browse/NUTCH-341
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 0.8
Reporter: Chris Schneider
Priority: Critical
Change 383304 deleted the following line near Line 117 (see
<http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java?r1=383304&r2=405204&diff_format=h>
for details):
workDir = new File(workDir, "indexmerger-workingdir");
Previously, if no -workingdir <workingdir> parameter was specified,
IndexMerger.main() would place an "indexmerger-workingdir" directory into the
default directory and then delete the former after completing. Now,
IndexMerger.main() defaults the value of its workDir to "indexmerger" within
the default directory, and deletes this workDir afterward.
However, if -workingdir <workingdir> _is_ specified, IndexMerger.main() will
now set workDir to _this_ path and delete the _entire_ <workingdir> afterward.
Previously, IndexMerger.main() would only delete
<workingDir>/"indexmerger-workingdir", without deleting <workingdir> itself.
This is because the line mentioned above always appended
"indexmerger-workingdir" to workDir.
Our hardware configuration on the jobtracker/namenode box attempts to keep all
large datasets on a separate, large hard drive. Accordingly, we were keeping
dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir on this
drive. Unfortunately, we were passing the folder containing these folders in
the <workingdir> parameter to the IndexMerger. As a result, the first time we
ran the IndexMerger, we ended up trashing our entire DFS!
Perhaps the way that the IndexMerger handles its <workingdir> parmaeter now is
an acceptable design. However, given the way it handled this parameter in the
past, I feel that the current implementation is unacceptably dangerous.
More importantly, perhaps there's some way that we could make hadoop more
robust in handling its critical data files. I plan to place a directory owned
by root with "dr--------" permissions into each of these critical directories
in order to prevent any of them from suffering the fate of our DFS. This could
become part of a standard hadoop installation.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers