[ 
https://issues.apache.org/jira/browse/NUTCH-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889585#comment-17889585
 ] 

Sebastian Nagel commented on NUTCH-3078:
----------------------------------------

Hi [~hiranchaudhuri], thanks and good catch!

And thanks for initiative to simplify the lock handling. It's indeed one of the 
cumbersome points where there are many small bugs (such as this one) which make 
the usage of Nutch difficult.

However, there is one situation where an existing CrawlDb might be damaged, if 
the lock is unconditionally removed:

- the exception happens in {{CrawlDb.install(job, crawldb)}} and
-- the folder {{current/}} was successfully moved to {{old/}}
-- but the new, temporary CrawlDb was not copied to the final location 
({{current/}})
-- or is copied only partially, in case, the underlying filesystem does not 
support an atomic directory {{rename()}}. That's usually the case for cloud 
storage abstractions, see [S3A: Directories are 
mimicked|https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Warning_.231:_Directories_are_mimicked]

In this situation, it would be better to keep the lock, so that no CrawlDb 
write operation is allowed to run until a manual cleanup. That allows users to 
analyze what happened and likely to save the data. If the lock is removed, data 
may get lost.

So, that's the reason why the lock/unlock and cleanup code is so complex. It's 
a little bit more than just to prevent that only one job reading or writing the 
CrawlDb is run simultaneously.

It's also the reason why try-catch blocks should be "focused" on errors 
happening when running the job. It shouldn't include the {{CrawlDb.install(job, 
crawldb)}} call. Currently, it does which is wrong - in Injector but also in 
CrawlDbMerger - but that's a separate issue. See for comparison 
[CrawlDb.update(...)|https://github.com/apache/nutch/blob/4a61208f492613f2c5282741e64c036acabeb71e/src/java/org/apache/nutch/crawl/CrawlDb.java#L145]
 or 
[DeduplicationJob.run(...)|https://github.com/apache/nutch/blob/4a61208f492613f2c5282741e64c036acabeb71e/src/java/org/apache/nutch/crawl/DeduplicationJob.java].

Another way to make the cleanup of the lock file simpler, would be to obtain 
the lock later, shortly before running the job...

> Database is not unlocked when injector fails
> --------------------------------------------
>
>                 Key: NUTCH-3078
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3078
>             Project: Nutch
>          Issue Type: Bug
>          Components: injector
>    Affects Versions: 1.21
>         Environment: Ubuntu 22 LTS
> $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode, 
> sharing)
>            Reporter: Hiran Chaudhuri
>            Priority: Major
>
> The injector locks the database but in case of failure does not unlock it. 
> This is a problem on the next invocation. To repeat this, start off with a 
> non-existing crawldb and non-existing seed directory:
> {{./local/bin/nutch inject crawl/crawldb urls}}
> The crawldb is created and locked, but then the injector fails with
> {{2024-10-14 07:43:20,091 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.FileNotFoundException: File urls does not exist}}
> {{    at 
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:733)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2078)}}
> {{    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2122)}}
> {{    at 
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:970)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:418)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> Well, the urls directory indeed does not exist. So let's run the same job 
> with the correct directory:
> {{./local/bin/nutch inject crawl/crawldb ../urls}}
> And despite we have the right directory, the Injector fails with
> {{2024-10-14 07:43:30,147 ERROR org.apache.nutch.crawl.Injector [main] 
> Injector: java.io.IOException: lock file crawl/crawldb/.locked already 
> exists.}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:50)}}
> {{    at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:80)}}
> {{    at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:193)}}
> {{    at org.apache.nutch.crawl.Injector.inject(Injector.java:404)}}
> {{    at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{    at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> I'd expect when Injector finishes (successful or not) the lock on the DB is 
> removed again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to