[jira] [Updated] (NUTCH-1341) NotModified time set to now but page not modified

2012-04-19 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1341:
-

Attachment: NUTCH-1341-1.6-1.patch

Here's a patch for 1.6. It simply resets the modifiedTime to the CrawlDatum's 
previous value right after the reducers sets a STATUS_DB_NOTMODIFIED status 
value. Since i believe the status is correct i assume the modifiedTime value 
can be reset here as well.

Please comment. Did i overlook something? Implement it differently?

Thanks

> NotModified time set to now but page not modified
> -
>
> Key: NUTCH-1341
> URL: https://issues.apache.org/jira/browse/NUTCH-1341
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1341-1.6-1.patch
>
>
> Servers tend to respond with incorrect or no value for LastModified. By 
> comparing signatures or when (fetch.getStatus() == 
> CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the 
> db_notmodified status for the CrawlDatum. The modifiedTime value, however, is 
> not set accordingly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1341) NotModified time set to now but page not modified

2012-04-19 Thread Markus Jelsma (Created) (JIRA)
NotModified time set to now but page not modified
-

 Key: NUTCH-1341
 URL: https://issues.apache.org/jira/browse/NUTCH-1341
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


Servers tend to respond with incorrect or no value for LastModified. By 
comparing signatures or when (fetch.getStatus() == 
CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the 
db_notmodified status for the CrawlDatum. The modifiedTime value, however, is 
not set accordingly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-04-19 Thread Mattmann, Chris A (388J)
Hey Julien thanks for the help below. I will try running some of the ant tasks 
(sorry I'm a Maven wonk ;) ) and get this working hopefully this week. I have
a big proposal deadline on Friday but should come up for air after that
heading into the weekend and get this done.

Cheers,
Chris

On Apr 19, 2012, at 3:56 AM, Julien Nioche wrote:

> Hi Chris
> 
> 
> >
> > -1 the versions of the deps for hadoop, tika and possibly others are not 
> > correct in the pom.xml found in the src archive and on the mvn repository, 
> > which will be a problem for whoever tries to use the pom.xml file e.g. in 
> > Eclipse or more annoyingly declare Nutch as a dependency with Ivy / Maven. 
> > Did you regenerate the pom file from the ivy one?
> 
> I didn't regenerate it -- but will try and do so for RC #2.
> 
> Should have been done automatically when calling 'ant deploy' - if not might 
> be that the maven task jar is missing from lib 
>  
> 
> >
> > I remember that we mentioned delivering the content of runtime/local in the 
> > binary archive instead of having the sources + runtime/deploy as well.
> [..snip...]
> >  I don't think it would take much time to do that, so what about doing it 
> > now? We could rename the archive into apache-nutch-1.5-local-bin maybe to 
> > make the content clearer.
> 
> +1 to the above, but I think we can just have it be apache-nutch-1.5-bin -- 
> no need to rename it to local. We can just
> reference this ML thread for documentation in the future.
> 
> 
> I've committed in trunk revision 1327896 a new ant task which will generate a 
> binary package as described above. You'll probably need to modify the code 
> for the tar / zip as well but this should give you a starting point
>  
> Thanks
> 
> Julien
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-04-19 Thread Julien Nioche
Hi Chris


>
> > -1 the versions of the deps for hadoop, tika and possibly others are not
> correct in the pom.xml found in the src archive and on the mvn repository,
> which will be a problem for whoever tries to use the pom.xml file e.g. in
> Eclipse or more annoyingly declare Nutch as a dependency with Ivy / Maven.
> Did you regenerate the pom file from the ivy one?
>
> I didn't regenerate it -- but will try and do so for RC #2.
>

Should have been done automatically when calling 'ant deploy' - if not
might be that the maven task jar is missing from lib


>
> >
> > I remember that we mentioned delivering the content of runtime/local in
> the binary archive instead of having the sources + runtime/deploy as well.
> [..snip...]
> >  I don't think it would take much time to do that, so what about doing
> it now? We could rename the archive into apache-nutch-1.5-local-bin maybe
> to make the content clearer.
>
> +1 to the above, but I think we can just have it be apache-nutch-1.5-bin
> -- no need to rename it to local. We can just
> reference this ML thread for documentation in the future.
>
>
I've committed in trunk revision 1327896 a new ant task which will generate
a binary package as described above. You'll probably need to modify the
code for the tar / zip as well but this should give you a starting point

Thanks

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2012-04-19 Thread Ferdy Galema (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257383#comment-13257383
 ] 

Ferdy Galema commented on NUTCH-882:


Hey Patrick,

We are currently finishing the work for this issue. There is still one minor 
issue that is not fully working yet (namely host inlinks/outlinks are not 
populated), but we are still trying to make that work. If this is does not 
succeed in a few days, we will submit the patches anyhow.

Thanks for you interest.


> Design a Host table in GORA
> ---
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-882-v1.patch, hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-882) Design a Host table in GORA

2012-04-19 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-882:


Assignee: (was: Julien Nioche)

> Design a Host table in GORA
> ---
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-882-v1.patch, hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2012-04-19 Thread Patrick Hennig (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257371#comment-13257371
 ] 

Patrick Hennig commented on NUTCH-882:
--

Hi, 

you wrote, you have updated the patches for the Host table to pupulate it from 
the webpages.

But the files are from august and september. Where I can find your updated 
patch?

> Design a Host table in GORA
> ---
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-882-v1.patch, hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira