[ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]

Stefan Groschupf updated NUTCH-341:
-----------------------------------

    Attachment: doNotDeleteTmpIndexMergeDirV1.patch

+1. 
I agree it makes completly no sense to be required creating a tmp folder 
manually and nutch deletes it afterwards with all content. 
Very dangerous if a user provides  / as tmp folder. The attached patch 
rollsback the missing line and I would love to ask that one developer with 
write access can roll in this in asap!
THANKS!


> IndexMerger now deletes entire <workingdir> after completing
> ------------------------------------------------------------
>
>                 Key: NUTCH-341
>                 URL: http://issues.apache.org/jira/browse/NUTCH-341
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.8
>            Reporter: Chris Schneider
>            Priority: Critical
>         Attachments: doNotDeleteTmpIndexMergeDirV1.patch
>
>
> Change 383304 deleted the following line near Line 117 (see 
> <http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java?r1=383304&r2=405204&diff_format=h>
>  for details):
> workDir = new File(workDir, "indexmerger-workingdir");
> Previously, if no -workingdir <workingdir> parameter was specified, 
> IndexMerger.main() would place an "indexmerger-workingdir" directory into the 
> default directory and then delete the former after completing. Now, 
> IndexMerger.main() defaults the value of its workDir to "indexmerger" within 
> the default directory, and deletes this workDir afterward.
> However, if -workingdir <workingdir> _is_ specified, IndexMerger.main() will 
> now set workDir to _this_ path and delete the _entire_ <workingdir> 
> afterward. Previously, IndexMerger.main() would only delete 
> <workingDir>/"indexmerger-workingdir", without deleting <workingdir> itself. 
> This is because the line mentioned above always appended 
> "indexmerger-workingdir" to workDir.
> Our hardware configuration on the jobtracker/namenode box attempts to keep 
> all large datasets on a separate, large hard drive. Accordingly, we were 
> keeping dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir 
> on this drive. Unfortunately, we were passing the folder containing these 
> folders in the <workingdir> parameter to the IndexMerger. As a result, the 
> first time we ran the IndexMerger, we ended up trashing our entire DFS!
> Perhaps the way that the IndexMerger handles its <workingdir> parmaeter now 
> is an acceptable design. However, given the way it handled this parameter in 
> the past, I feel that the current implementation is unacceptably dangerous.
> More importantly, perhaps there's some way that we could make hadoop more 
> robust in handling its critical data files. I plan to place a directory owned 
> by root with "dr--------" permissions into each of these critical directories 
> in order to prevent any of them from suffering the fate of our DFS. This 
> could become part of a standard hadoop installation.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to