[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465700 ] Armel Nene commented on NUTCH-61: - I have attached a new patch as the old one need updating before using with Nutch 0.8.1. It will be great if more people can test the feature as I have encounter some issues with plugins such the parse-xml when used with this patch. Over http protocol the patch works well when indexing text/xml/html. When used with a plugins such parse-xml, the fetcher throws a java IllegalStateException. If anybody has this error and knows how to fix, please share it with the rest of us. As of now, i'm working on trying to fix this issue and hoperfully adapt the feature the 0.9.0 version. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: https://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch, nutch-61-492176.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465493 ] Sami Siren commented on NUTCH-61: - Havent looked the patch (tm) How would one manage segments after something linke this gets included, i mean now it's more or less safe to delete segments older than configured refetch interval + some marginal, but after the lifetime of page can vary there's no more such a simple way to manage fetched data. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: https://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465517 ] Andrzej Bialecki commented on NUTCH-61: Actually, there is a way to do this, and this patch implements it. We define a maximum time to live for _any_ page, no matter when it was last fetched or what is its re-fetch interval. This is a system-wide setting. If re-fetch interval is longer than this value, or somehow the page wasn't re-fetched at least that long for other reasons (e.g. because it was unmodified, and we don't fetch unmodified content) - such pages will be forcefully included in fetchlist candidates as if they had DB_UNFETCHED status. This means we can be sure that any pages still present in segments older than this maximum TTL will have been refetched, and we can safely discard all segments older than TTL. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: https://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ https://issues.apache.org/jira/browse/NUTCH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465540 ] Sami Siren commented on NUTCH-61: - ok, so in my usual use case where there are far more urls than I can fetch this shouldn't have any effect at all negative or positive. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: https://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12449128 ] Armel Nene commented on NUTCH-61: - Has this patch by any chance been included in the newer release of nucth or is any one using as Otis asked. The reason is I am about to build a similar patch but if this patch is already working, I can just adapt it to my context. Or will nutch in the future planning to provide this feature out of the box? Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12449170 ] Andrzej Bialecki commented on NUTCH-61: Unfortunately, this patch hasn't been applied yet, due to its complexity and lack of testing. But it will be, sooner or later, because this functionality is required for any serious use. I'm planning to bring this patch to the latest trunk, and then apply it piece-wise over the next couple of weeks. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
Andrzej, the feature that I am after can be implemented by this patch if I just adapt it right. I am not sure of this but the patch seems a little bit old to be implemented in the latest release of Nutch 0.8.1. I want to implement a feature where the fetcher will fetch files but only add them if there have been modified after the latest fetch time. Now, I want to implement that on a filesystem first and then update later for network fetching. I would like to have a look at your full source code for your patch in a zip file if possible. Once the feature implemented, I will post it back here. I'd like to start working from your code first. You can either make the source code available here or mail them to me at armel dot nene @ idna-solutions dot com. -Original Message- From: Andrzej Bialecki (JIRA) [mailto:[EMAIL PROTECTED] Sent: 12 November 2006 19:39 To: nutch-dev@lucene.apache.org Subject: [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content [ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12449170 ] Andrzej Bialecki commented on NUTCH-61: Unfortunately, this patch hasn't been applied yet, due to its complexity and lack of testing. But it will be, sooner or later, because this functionality is required for any serious use. I'm planning to bring this patch to the latest trunk, and then apply it piece-wise over the next couple of weeks. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
Re: [Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
Armel T. Nene wrote: Andrzej, the feature that I am after can be implemented by this patch if I just adapt it right. I am not sure of this but the patch seems a little bit old to be implemented in the latest release of Nutch 0.8.1. Right, that's why I wrote it needs to be brought up-to-date with the current trunk/ . I want to implement a feature where the fetcher will fetch files but only add them if there have been modified after the latest fetch time. Now, I want to implement that on a filesystem first and then update later for network fetching. I would like to have a look at your full source code for your patch in a zip file if possible. Once the feature implemented, I will post it back here. I'd like to start working from your code first. You can either make the source code available here or mail them to me at armel dot nene @ idna-solutions dot com. Patches attached to the JIRA issue already support this. Please bear in mind that the notion of change is dependent on how you compare the content of old and new pages, especially if you lack the Last-Modified header from the server. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12444514 ] Otis Gospodnetic commented on NUTCH-61: --- Has anyone been using the code with this patch applied? Just wondering if/how well it works. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assigned To: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt, nutch-61-417287.patch Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12368050 ] Jerome Charron commented on NUTCH-61: - Not an objection, but a simple comment. Why not making FetchSchedule a new ExtensionPoint and then DefaultFetchSchedule and AdaptiveFetchSchedule some fetch schedule plugins? Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid=110944bid=241720dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12368051 ] Andrzej Bialecki commented on NUTCH-61: I contemplated this for a while, and then decided against it. The main reason was that currently most of the pluggable extensions that result in running a single selected plugin are handled using a simple Factory pattern; as opposed to ChainedFilter pattern, where we use extension points. I guess the original reason was that implementations would almost always consist of a single class, so it didn't make sense to complicate it and require the whole plugin infrastructure ... It would be the same in this case (just a single class), so I followed the same pattern. It's easy to change this to use an extension point, if people prefer it this way. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff, 20051230.txt, 20060227.txt Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid=110944bid=241720dat=121642 ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361346 ] byron miller commented on NUTCH-61: --- Most definately! I'll be happy to give it a whirl! Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361302 ] byron miller commented on NUTCH-61: --- Is there a patch modified for the current branch or should i take a stab at this? Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361311 ] Andrzej Bialecki commented on NUTCH-61: I'm working on this, the patch will be available in a couple of days. I could use then your help with review and testing... ;-) Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361131 ] raghavendra prabhu commented on NUTCH-61: - Will the same thing work for a filesystem For a file system , We can directly get the modified date store it in the db The plugins will have a look at the content date and if it is different they will index it Otherwise they will not fetch it This can be a solution for file based content (The thing is it does away entirely with fetch interval and takes decision only based upon file modification date) Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-61) Adaptive re-fetch interval. Detecting umodified content
[ http://issues.apache.org/jira/browse/NUTCH-61?page=comments#action_12361133 ] Andrzej Bialecki commented on NUTCH-61: This patch already supports this. Anyway, it needs to be significantly re-worked to fit into the current development version. Adaptive re-fetch interval. Detecting umodified content --- Key: NUTCH-61 URL: http://issues.apache.org/jira/browse/NUTCH-61 Project: Nutch Type: New Feature Components: fetcher Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: 20050606.diff Currently Nutch doesn't adjust automatically its re-fetch period, no matter if individual pages change seldom or frequently. The goal of these changes is to extend the current codebase to support various possible adjustments to re-fetch times and intervals, and specifically a re-fetch schedule which tries to adapt the period between consecutive fetches to the period of content changes. Also, these patches implement checking if the content has changed since last fetching; protocol plugins are also changed to make use of this information, so that if content is unmodified it doesn't have to be fetched and processed. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers