[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]
Stefan Groschupf updated NUTCH-341:
-----------------------------------
Attachment: doNotDeleteTmpIndexMergeDirV1.patch
+1.
I agree it makes completly no sense to be required creating a tmp folder
manually and nutch deletes it afterwards with all content.
Very dangerous if a user provides / as tmp folder. The attached patch
rollsback the missing line and I would love to ask that one developer with
write access can roll in this in asap!
THANKS!
> IndexMerger now deletes entire <workingdir> after completing
> ------------------------------------------------------------
>
> Key: NUTCH-341
> URL: http://issues.apache.org/jira/browse/NUTCH-341
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 0.8
> Reporter: Chris Schneider
> Priority: Critical
> Attachments: doNotDeleteTmpIndexMergeDirV1.patch
>
>
> Change 383304 deleted the following line near Line 117 (see
> <http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java?r1=383304&r2=405204&diff_format=h>
> for details):
> workDir = new File(workDir, "indexmerger-workingdir");
> Previously, if no -workingdir <workingdir> parameter was specified,
> IndexMerger.main() would place an "indexmerger-workingdir" directory into the
> default directory and then delete the former after completing. Now,
> IndexMerger.main() defaults the value of its workDir to "indexmerger" within
> the default directory, and deletes this workDir afterward.
> However, if -workingdir <workingdir> _is_ specified, IndexMerger.main() will
> now set workDir to _this_ path and delete the _entire_ <workingdir>
> afterward. Previously, IndexMerger.main() would only delete
> <workingDir>/"indexmerger-workingdir", without deleting <workingdir> itself.
> This is because the line mentioned above always appended
> "indexmerger-workingdir" to workDir.
> Our hardware configuration on the jobtracker/namenode box attempts to keep
> all large datasets on a separate, large hard drive. Accordingly, we were
> keeping dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir
> on this drive. Unfortunately, we were passing the folder containing these
> folders in the <workingdir> parameter to the IndexMerger. As a result, the
> first time we ran the IndexMerger, we ended up trashing our entire DFS!
> Perhaps the way that the IndexMerger handles its <workingdir> parmaeter now
> is an acceptable design. However, given the way it handled this parameter in
> the past, I feel that the current implementation is unacceptably dangerous.
> More importantly, perhaps there's some way that we could make hadoop more
> robust in handling its critical data files. I plan to place a directory owned
> by root with "dr--------" permissions into each of these critical directories
> in order to prevent any of them from suffering the fate of our DFS. This
> could become part of a standard hadoop installation.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers