[
https://issues.apache.org/jira/browse/NUTCH-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889729#comment-17889729
]
Hiran Chaudhuri edited comment on NUTCH-3078 at 10/15/24 3:52 PM:
------------------------------------------------------------------
I was not aware of handling a corrupt database. While it makes sense to prevent
modifications in that state it sounds odd not to go for a simple locking
mechanism.
Therefore I suggest to use two lockfiles: one to prevent parallel updates, the
other to indicate health status. So to LockUtil we would add code such that
* createLockFile would only succeed if the database can be locked and after
that the 'health lock' is ok
* removeLockFile would only unlock the database (but not touch the 'health
lock')
* we add a createHealthLock() method that indicates the corrupt state
So the suggested programming paradigm would remain, but code that identifies an
unhealthy state would just need to call createHealthLock(), like so:
{{Path lock = LockUtil.createLockFile(...);}}
{{try {}}
{{ }}{{{}CrawlDb.install(job, crawldb){}}}{{{};{}}}
{{} catch (InterruptedWriteException e) {}}
{{ // this is the exception that indicates we have a strange status}}
{{ LockUtil.createHealthLock();}}
{{} finally {}}
{{ LockUtil.removeLockFile(...);}}
{{}}}
or alternatively
{{Path lock = LockUtil.createLockFile(...);}}
{{try {}}
{{ }}{{{}CrawlDb.install(job, crawldb){}}}{{{};{}}}
{{} catch (InterruptedWriteException e) {}}
{{ // this is the exception that indicates we have a strange status}}
{{ LockUtil.createHealthLock();}}
{{} finally {}}
{{ LockUtil.removeLockFile(...);}}
{{}}}
was (Author: hiranchaudhuri):
I was not aware of handling a corrupt database. While it makes sense to prevent
modifications in that state it sounds odd not to go for a simple locking
mechanism.
Therefore I suggest to use two lockfiles: one to prevent parallel updates, the
other to indicate health status. So to LockUtil we would add code such that
* createLockFile would only succeed if the database can be locked and after
that the 'health lock' is ok
* removeLockFile would only unlock the database (but not touch the 'health
lock')
* we add a createHealthLock() method that indicates the corrupt state
So the suggested programming paradigm would remain, but code that identifies an
unhealthy state would just need to call createHealthLock(), like so:
{{Path lock = LockUtil.createLockFile(...);}}
{{try {}}
{{ }}{{{}CrawlDb.install(job, crawldb){}}}{{{};{}}}
{{} catch (InterruptedWriteException e) {}}
{{ // this is the exception that indicates we have a strange status}}
{{ LockUtil.createHealthLock();}}
{{} finally {}}
{{ LockUtil.removeLockFile(...);}}
{{}}}
> Database is not unlocked when injector fails
> --------------------------------------------
>
> Key: NUTCH-3078
> URL: https://issues.apache.org/jira/browse/NUTCH-3078
> Project: Nutch
> Issue Type: Bug
> Components: injector
> Affects Versions: 1.21
> Environment: Ubuntu 22 LTS
> $JAVA_HOME/bin/java -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Temurin-21.0.4+7 (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Temurin-21.0.4+7 (build 21.0.4+7-LTS, mixed mode,
> sharing)
> Reporter: Hiran Chaudhuri
> Priority: Major
> Fix For: 1.21
>
>
> The injector locks the database but in case of failure does not unlock it.
> This is a problem on the next invocation. To repeat this, start off with a
> non-existing crawldb and non-existing seed directory:
> {{./local/bin/nutch inject crawl/crawldb urls}}
> The crawldb is created and locked, but then the injector fails with
> {{2024-10-14 07:43:20,091 ERROR org.apache.nutch.crawl.Injector [main]
> Injector: java.io.FileNotFoundException: File urls does not exist}}
> {{ at
> org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:733)}}
> {{ at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2078)}}
> {{ at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2122)}}
> {{ at
> org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:970)}}
> {{ at org.apache.nutch.crawl.Injector.inject(Injector.java:418)}}
> {{ at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{ at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> Well, the urls directory indeed does not exist. So let's run the same job
> with the correct directory:
> {{./local/bin/nutch inject crawl/crawldb ../urls}}
> And despite we have the right directory, the Injector fails with
> {{2024-10-14 07:43:30,147 ERROR org.apache.nutch.crawl.Injector [main]
> Injector: java.io.IOException: lock file crawl/crawldb/.locked already
> exists.}}
> {{ at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:50)}}
> {{ at org.apache.nutch.util.LockUtil.createLockFile(LockUtil.java:80)}}
> {{ at org.apache.nutch.crawl.CrawlDb.lock(CrawlDb.java:193)}}
> {{ at org.apache.nutch.crawl.Injector.inject(Injector.java:404)}}
> {{ at org.apache.nutch.crawl.Injector.run(Injector.java:574)}}
> {{ at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)}}
> {{ at org.apache.nutch.crawl.Injector.main(Injector.java:538)}}
> I'd expect when Injector finishes (successful or not) the lock on the DB is
> removed again.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)