[jira] [Updated] (NUTCH-1341) NotModified time set to now but page not modified
[ https://issues.apache.org/jira/browse/NUTCH-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1341: - Attachment: NUTCH-1341-1.6-1.patch Here's a patch for 1.6. It simply resets the modifiedTime to the CrawlDatum's previous value right after the reducers sets a STATUS_DB_NOTMODIFIED status value. Since i believe the status is correct i assume the modifiedTime value can be reset here as well. Please comment. Did i overlook something? Implement it differently? Thanks > NotModified time set to now but page not modified > - > > Key: NUTCH-1341 > URL: https://issues.apache.org/jira/browse/NUTCH-1341 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1341-1.6-1.patch > > > Servers tend to respond with incorrect or no value for LastModified. By > comparing signatures or when (fetch.getStatus() == > CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the > db_notmodified status for the CrawlDatum. The modifiedTime value, however, is > not set accordingly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1341) NotModified time set to now but page not modified
NotModified time set to now but page not modified - Key: NUTCH-1341 URL: https://issues.apache.org/jira/browse/NUTCH-1341 Project: Nutch Issue Type: Bug Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Servers tend to respond with incorrect or no value for LastModified. By comparing signatures or when (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the db_notmodified status for the CrawlDatum. The modifiedTime value, however, is not set accordingly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Apache Nutch 1.5 release rc #1
Hey Julien thanks for the help below. I will try running some of the ant tasks (sorry I'm a Maven wonk ;) ) and get this working hopefully this week. I have a big proposal deadline on Friday but should come up for air after that heading into the weekend and get this done. Cheers, Chris On Apr 19, 2012, at 3:56 AM, Julien Nioche wrote: > Hi Chris > > > > > > -1 the versions of the deps for hadoop, tika and possibly others are not > > correct in the pom.xml found in the src archive and on the mvn repository, > > which will be a problem for whoever tries to use the pom.xml file e.g. in > > Eclipse or more annoyingly declare Nutch as a dependency with Ivy / Maven. > > Did you regenerate the pom file from the ivy one? > > I didn't regenerate it -- but will try and do so for RC #2. > > Should have been done automatically when calling 'ant deploy' - if not might > be that the maven task jar is missing from lib > > > > > > I remember that we mentioned delivering the content of runtime/local in the > > binary archive instead of having the sources + runtime/deploy as well. > [..snip...] > > I don't think it would take much time to do that, so what about doing it > > now? We could rename the archive into apache-nutch-1.5-local-bin maybe to > > make the content clearer. > > +1 to the above, but I think we can just have it be apache-nutch-1.5-bin -- > no need to rename it to local. We can just > reference this ML thread for documentation in the future. > > > I've committed in trunk revision 1327896 a new ant task which will generate a > binary package as described above. You'll probably need to modify the code > for the tar / zip as well but this should give you a starting point > > Thanks > > Julien > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 1.5 release rc #1
Hi Chris > > > -1 the versions of the deps for hadoop, tika and possibly others are not > correct in the pom.xml found in the src archive and on the mvn repository, > which will be a problem for whoever tries to use the pom.xml file e.g. in > Eclipse or more annoyingly declare Nutch as a dependency with Ivy / Maven. > Did you regenerate the pom file from the ivy one? > > I didn't regenerate it -- but will try and do so for RC #2. > Should have been done automatically when calling 'ant deploy' - if not might be that the maven task jar is missing from lib > > > > > I remember that we mentioned delivering the content of runtime/local in > the binary archive instead of having the sources + runtime/deploy as well. > [..snip...] > > I don't think it would take much time to do that, so what about doing > it now? We could rename the archive into apache-nutch-1.5-local-bin maybe > to make the content clearer. > > +1 to the above, but I think we can just have it be apache-nutch-1.5-bin > -- no need to rename it to local. We can just > reference this ML thread for documentation in the future. > > I've committed in trunk revision 1327896 a new ant task which will generate a binary package as described above. You'll probably need to modify the code for the tar / zip as well but this should give you a starting point Thanks Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Commented] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257383#comment-13257383 ] Ferdy Galema commented on NUTCH-882: Hey Patrick, We are currently finishing the work for this issue. There is still one minor issue that is not fully working yet (namely host inlinks/outlinks are not populated), but we are still trying to make that work. If this is does not succeed in a few days, we will submit the patches anyhow. Thanks for you interest. > Design a Host table in GORA > --- > > Key: NUTCH-882 > URL: https://issues.apache.org/jira/browse/NUTCH-882 > Project: Nutch > Issue Type: New Feature >Affects Versions: nutchgora >Reporter: Julien Nioche > Fix For: nutchgora > > Attachments: NUTCH-882-v1.patch, hostdb.patch > > > Having a separate GORA table for storing information about hosts (and > domains?) would be very useful for : > * customising the behaviour of the fetching on a host basis e.g. number of > threads, min time between threads etc... > * storing stats > * keeping metadata and possibly propagate them to the webpages > * keeping a copy of the robots.txt and possibly use that later to filter the > webtable > * store sitemaps files and update the webtable accordingly > I'll try to come up with a GORA schema for such a host table but any comments > are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-882: Assignee: (was: Julien Nioche) > Design a Host table in GORA > --- > > Key: NUTCH-882 > URL: https://issues.apache.org/jira/browse/NUTCH-882 > Project: Nutch > Issue Type: New Feature >Affects Versions: nutchgora >Reporter: Julien Nioche > Fix For: nutchgora > > Attachments: NUTCH-882-v1.patch, hostdb.patch > > > Having a separate GORA table for storing information about hosts (and > domains?) would be very useful for : > * customising the behaviour of the fetching on a host basis e.g. number of > threads, min time between threads etc... > * storing stats > * keeping metadata and possibly propagate them to the webpages > * keeping a copy of the robots.txt and possibly use that later to filter the > webtable > * store sitemaps files and update the webtable accordingly > I'll try to come up with a GORA schema for such a host table but any comments > are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257371#comment-13257371 ] Patrick Hennig commented on NUTCH-882: -- Hi, you wrote, you have updated the patches for the Host table to pupulate it from the webpages. But the files are from august and september. Where I can find your updated patch? > Design a Host table in GORA > --- > > Key: NUTCH-882 > URL: https://issues.apache.org/jira/browse/NUTCH-882 > Project: Nutch > Issue Type: New Feature >Affects Versions: nutchgora >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: nutchgora > > Attachments: NUTCH-882-v1.patch, hostdb.patch > > > Having a separate GORA table for storing information about hosts (and > domains?) would be very useful for : > * customising the behaviour of the fetching on a host basis e.g. number of > threads, min time between threads etc... > * storing stats > * keeping metadata and possibly propagate them to the webpages > * keeping a copy of the robots.txt and possibly use that later to filter the > webtable > * store sitemaps files and update the webtable accordingly > I'll try to come up with a GORA schema for such a host table but any comments > are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira