[ 
https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718439#comment-13718439
 ] 

Riyaz Shaik commented on NUTCH-1457:
------------------------------------

Hi Ferdy/Lewis,

It seems trunk has the Nutch-1.4 version code as per the SVN check-in logs and 
mail archives.

http://www.mail-archive.com/[email protected]/msg04348.html


I had created patches for the branches : *Nutch-2.1* and *Nutch-2.2.1*

Attached the modified source code files as a Zip and patches.

(on) Patch contains following fixes other than NUTCH-1457:

(+) org.apache.nutch.crawl.AbstractFetchSchedule
 * Fix for resetting fetchTime to currentTime, if the *??fetchTime-currTime > 
maxInterval??*. Since *“shouldFetch”* method returning false even after setting 
the new fetchTime to page. So, that new fetchTime changes will not be available 
to GeneratorReducer to persist the changes in HBase.

(+) org.apache.nutch.parse.ParseUtil
 * Moved the page signature calculation code(a line of code).
Existing code calculating the page signature without parsed plain text(Ex: from 
HTMLParser), that causes signature calculation on entire page content even 
after enabling the “org.apache.nutch.crawl.TextProfileSignature”.

Can you please validate the changes?.

Thanks
Riyaz

                
> Nutch2 Refactor the update process so that fetched items are only processed 
> once
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-1457
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1457
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Ferdy Galema
>             Fix For: 2.4
>
>         Attachments: CrawlStatus.java, DbUpdateReducer.java, 
> GeneratorMapper.java, GeneratorReducer.java
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to