[
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718439#comment-13718439
]
Riyaz Shaik commented on NUTCH-1457:
------------------------------------
Hi Ferdy/Lewis,
It seems trunk has the Nutch-1.4 version code as per the SVN check-in logs and
mail archives.
http://www.mail-archive.com/[email protected]/msg04348.html
I had created patches for the branches : *Nutch-2.1* and *Nutch-2.2.1*
Attached the modified source code files as a Zip and patches.
(on) Patch contains following fixes other than NUTCH-1457:
(+) org.apache.nutch.crawl.AbstractFetchSchedule
* Fix for resetting fetchTime to currentTime, if the *??fetchTime-currTime >
maxInterval??*. Since *“shouldFetch”* method returning false even after setting
the new fetchTime to page. So, that new fetchTime changes will not be available
to GeneratorReducer to persist the changes in HBase.
(+) org.apache.nutch.parse.ParseUtil
* Moved the page signature calculation code(a line of code).
Existing code calculating the page signature without parsed plain text(Ex: from
HTMLParser), that causes signature calculation on entire page content even
after enabling the “org.apache.nutch.crawl.TextProfileSignature”.
Can you please validate the changes?.
Thanks
Riyaz
> Nutch2 Refactor the update process so that fetched items are only processed
> once
> --------------------------------------------------------------------------------
>
> Key: NUTCH-1457
> URL: https://issues.apache.org/jira/browse/NUTCH-1457
> Project: Nutch
> Issue Type: Improvement
> Reporter: Ferdy Galema
> Fix For: 2.4
>
> Attachments: CrawlStatus.java, DbUpdateReducer.java,
> GeneratorMapper.java, GeneratorReducer.java
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira