[jira] [Updated] (NUTCH-1278) Fetch Improvement in threads per host

2012-02-19 Thread behnam nikbakht (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

behnam nikbakht updated NUTCH-1278:
---

Attachment: NUTCH-1278.zip

 Fetch Improvement in threads per host
 -

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1278.zip


 the value of maxThreads is equal to fetcher.threads.per.host and is constant 
 for every host
 there is a possibility with using of dynamic values for every host that 
 influeced with number of blocked requests.
 this means that if number of blocked requests for one host increased, then we 
 most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

2012-02-19 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211294#comment-13211294
 ] 

behnam nikbakht commented on NUTCH-1278:


here is a primary patch, that has some changes in Fetcher.java ,Protocol.java 
and it's plugins like lib-http
i use a file in local system for maintaining a hashtable that contains hosts 
and their http.timeout
for each blocked response, there is a increment in timeout and for each 
success, there is a decrement
we can use different increment and decrement rates so we can make a balance 
between total time of fetch Job, and a relation between fetched and blocked 
rates. for example it can configurable that if 90% of requests for some host 
are seccess, there is no need to increase timeout.

 Fetch Improvement in threads per host
 -

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1278.zip


 the value of maxThreads is equal to fetcher.threads.per.host and is constant 
 for every host
 there is a possibility with using of dynamic values for every host that 
 influeced with number of blocked requests.
 this means that if number of blocked requests for one host increased, then we 
 most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1246) Upgrade to Hadoop 1.0.0

2012-02-19 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211296#comment-13211296
 ] 

Hudson commented on NUTCH-1246:
---

Integrated in Nutch-nutchgora #165 (See 
[https://builds.apache.org/job/Nutch-nutchgora/165/])
commit to address NUTCH-1246 and update to CHANGES.txt (Revision 1245921)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/ivy/ivy.xml


 Upgrade to Hadoop 1.0.0
 ---

 Key: NUTCH-1246
 URL: https://issues.apache.org/jira/browse/NUTCH-1246
 Project: Nutch
  Issue Type: Improvement
Affects Versions: nutchgora, 1.5
Reporter: Julien Nioche



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1282) linkdb scalability

2012-02-19 Thread behnam nikbakht (Created) (JIRA)
linkdb scalability
--

 Key: NUTCH-1282
 URL: https://issues.apache.org/jira/browse/NUTCH-1282
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 1.4
Reporter: behnam nikbakht


as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage 
is only for anchor and not impact on scoring. 
as seemed, size of linkdb in incremental crawl grow very fast and make it 
unscalable for huge size of web sites.
so, here is two choises, one, ignore invertlinks and linkdb from crawl, and 
second, make it scalable
in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed 
segments, and second for merge new linkdb with old linkdb. the second job is 
unscalable and we can ignore it with this changes in solrIndex:
in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum 
== null or parseText == null or parseData == null, then add anchor to doc and 
update solr (no insert)
here also some changes required to NutchDocument.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211314#comment-13211314
 ] 

Lewis John McGibbney commented on NUTCH-1281:
-

Hi behnam, there is a similar issue open and a patch has been submitted for 
Nutchgora. I wonder if you can check it out and comment on the link between 
these two. NUTCH-965

Also would it be possible for you to attach your code changes as a patch 
against trunk? Which I guess is what you are using. Thank you

 tika parser not work properly with unwanted file types that passed from 
 filters in nutch
 

 Key: NUTCH-1281
 URL: https://issues.apache.org/jira/browse/NUTCH-1281
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: behnam nikbakht

 when in parse-plugins.xml, set this property:
 mimeType name=*
 plugin id=parse-tika /
 /mimeType
 all unwanted files that pass from all filters, refered to tika
 but for some file types like .flv, tika parser has problem and hunged and 
 cause to fail in parse Job.
 if this file types passed from regex-urlfilter and other filters, parse job 
 failed.
 for this problem I suggest that add some properties for valid file types, and 
 use this code in TikaParser.java, like this:
 public ParseResult getParse(Content content) {
   String mimeType = content.getContentType();
 + String[]validTypes=new 
 String[]{application/pdf,application/x-tika-msoffice,application/x-tika- 
 ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml};
 + boolean valid=false;
 + for(int k=0;kvalidTypes.length;k++){
 + if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
 + valid=true;
 + }
 + if(!valid)
 + return new ParseStatus(ParseStatus.NOTPARSED, Can't 
 parse for unwanted filetype + 
 mimeType).getEmptyParseResult(content.getUrl(), getConf());
   
   URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2012-02-19 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211316#comment-13211316
 ] 

behnam nikbakht commented on NUTCH-1281:


Problem is that actual mime-types can not properly filtered until the parse or 
fetch start. and here are many file types that we can not filter all of them, 
and maybe there are some bugs with tika parser with some file types.
so we can filter them in TikaParser from valid file types.

 tika parser not work properly with unwanted file types that passed from 
 filters in nutch
 

 Key: NUTCH-1281
 URL: https://issues.apache.org/jira/browse/NUTCH-1281
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: behnam nikbakht

 when in parse-plugins.xml, set this property:
 mimeType name=*
 plugin id=parse-tika /
 /mimeType
 all unwanted files that pass from all filters, refered to tika
 but for some file types like .flv, tika parser has problem and hunged and 
 cause to fail in parse Job.
 if this file types passed from regex-urlfilter and other filters, parse job 
 failed.
 for this problem I suggest that add some properties for valid file types, and 
 use this code in TikaParser.java, like this:
 public ParseResult getParse(Content content) {
   String mimeType = content.getContentType();
 + String[]validTypes=new 
 String[]{application/pdf,application/x-tika-msoffice,application/x-tika- 
 ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml};
 + boolean valid=false;
 + for(int k=0;kvalidTypes.length;k++){
 + if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
 + valid=true;
 + }
 + if(!valid)
 + return new ParseStatus(ParseStatus.NOTPARSED, Can't 
 parse for unwanted filetype + 
 mimeType).getEmptyParseResult(content.getUrl(), getConf());
   
   URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211320#comment-13211320
 ] 

Lewis John McGibbney commented on NUTCH-1278:
-

Behnam, this looks interesting but there are a few problems here.
1) It would be much much easier for us to apply, test and comment on your 
contribution if you included it in a simple .patch file. This can be done like 
so 
{code}
$ cd $NUTCH_HOME
$ svn diff  NUTCH-patch-name.patch
{code}
The current zip format for the patch(es), plus the fact that every class has 
been patched separately from thier own respective directories makes it really 
hard for us to work with this.
2) I doesn't appear that this patch is actually applies against trunk? Maybe 
1.4? You can check out trunk here [1] I'm getting errors when trying to apply 
HttpBase then gave up and started writing this.
3) for a change to the fetcher of this scale, it would be really nice if you 
could provide a test within the test suite we already maintain [2].

As I said this looks really great, and sorry for the rather lengthy initial 
response, but for us to consider this for integration it would be great for 
your contributions to meet this minimum requirement as they are highly 
appreciated. Thank you

[1] https://svn.apache.org/repos/asf/nutch/trunk/
[2] 
https://svn.apache.org/viewvc/nutch/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java?view=markup


 Fetch Improvement in threads per host
 -

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht
 Attachments: NUTCH-1278.zip


 the value of maxThreads is equal to fetcher.threads.per.host and is constant 
 for every host
 there is a possibility with using of dynamic values for every host that 
 influeced with number of blocked requests.
 this means that if number of blocked requests for one host increased, then we 
 most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1283) Ridically update all Solr configuration in Nutchgora

2012-02-19 Thread Lewis John McGibbney (Created) (JIRA)
Ridically update all Solr configuration in Nutchgora


 Key: NUTCH-1283
 URL: https://issues.apache.org/jira/browse/NUTCH-1283
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
 Fix For: nutchgora


We're currently running with a Schema which states it's 1.4 :0| There should be 
better support for newer stuff going on over the Solrland. Thsi issue should 
track those improvements entirely.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney

2012-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchAdministrationUserInterface page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=5rev2=6

   
  
  == Look and feel Admin Gui: ==
+ This [[https://github.com/101tec/nutch/wiki|link]] provides the best working 
prototype of an example admin gui, it also provides a heap of material relating 
to what kind and level of functionality the Nutch webapp should support.
- The following link provide a non working prototype of the admin gui created 
by Frank Henze (credits).
- http://www.media-style.com/gfx/nutchadmin/index.html
  
  == Description Admin Gui: ==
  There are three main functionalities of the admin gui


[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney

2012-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchAdministrationUserInterface page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=6rev2=7

  
  
  == Timetable: ==
- first beta version until end of Feburary.
+ TODO
  
  == people: ==
- Frank Hanze: jsp programming
  
- Marko Bauhard: re-factoring nutchConf and tools api
+ The Apache Nutch Development team.
  
- Stefan Groschupf: developing plugin Extension point and ground framework
+ Original developers working on this included Frank Hanze (jsp programming), 
Marko Bauhard (re-factoring nutchConf and tools api)  Stefan Groschupf 
(developing plugin Extension point and ground framework)
  
  Please add yourself here.
  
- If you wish to help but do not know how, please get in touch with Stefan.
+ If you wish to help but do not know how, please get in touch with the Nutch 
tean [[http://nutch.apache.org/mailing_lists.htmlhere]].
  
  == Download: ==
  
- Here are some mirrors where you can download a version of nutch-0.8-dev 
bundled with the administration GUI:
+ The code base we are working on is Nutch 2.0, which you can checkout 
[[https://svn.apache.org/repos/asf/nutch/branches/nutchgora/|here]]. If you are 
unfamiliar with using SVN repositories and SVN, then please see 
[[http://nutch.apache.org/version_control.html|here]].
+ 
+ == Old Resources ==
+ 
+ Here are some mirrors where you can download a version of nutch-0.8-dev 
bundled with the administration GUI, some of these mirrors no longer exist, and 
are there merely to provide you with a look and feel for the GUI.
  
   * http://85.214.26.67/nutch-admingui/nutch-0.8-dev_guiBundle_05_02_06.tar.gz
   * http://jerome.charron.free.fr/nutch/nutch-0.8-dev_guiBundle_05_02_06.tar.gz


[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney

2012-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchAdministrationUserInterface page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=7rev2=8

  
  == Look and feel Admin Gui: ==
  This [[https://github.com/101tec/nutch/wiki|link]] provides the best working 
prototype of an example admin gui, it also provides a heap of material relating 
to what kind and level of functionality the Nutch webapp should support.
+ {{http://101tec.com/wp-content/themes/101tec/images/instanceNew.jpg}}
+ {{http://101tec.com/wp-content/themes/101tec/images/configuration.jpg}}
+ {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}}
+ {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}}
+ {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}}
+ {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}}
  
  == Description Admin Gui: ==
  There are three main functionalities of the admin gui


[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney

2012-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchAdministrationUserInterface page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=8rev2=9

  
  == Look and feel Admin Gui: ==
  This [[https://github.com/101tec/nutch/wiki|link]] provides the best working 
prototype of an example admin gui, it also provides a heap of material relating 
to what kind and level of functionality the Nutch webapp should support.
+ 
+ === A New Nutch Instance ===
+ {{http://101tec.com/wp-content/themes/101tec/images/instanceNew.jpg}}
+ 
+ === Congiguration UI ===
+ {{http://101tec.com/wp-content/themes/101tec/images/configuration.jpg}}
+ 
+ === URL Upload ===
+ {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}}
+ 
+ === Example Crawl ===
+ {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}}
+ 
+ === Example Crawl ===
+ {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}}
+ 
+ === Example Crawl ===
+ {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}}
+ 
+  /!\ '''Edit conflict - other version:''' 
  {{http://101tec.com/wp-content/themes/101tec/images/instanceNew.jpg}}
  {{http://101tec.com/wp-content/themes/101tec/images/configuration.jpg}}
  {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}}
  {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}}
  {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}}
  {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}}
+ 
+  /!\ '''Edit conflict - your version:''' 
+ 
+  /!\ '''End of edit conflict''' 
  
  == Description Admin Gui: ==
  There are three main functionalities of the admin gui


[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney

2012-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchAdministrationUserInterface page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=9rev2=10

  === URL Upload ===
  {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}}
  
- === Example Crawl ===
+ === Example Crawl 1 ===
  {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}}
  
- === Example Crawl ===
+ === Example Crawl 2 ===
  {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}}
  
- === Example Crawl ===
+ === Example Crawl 3 ===
  {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}}
- 
-  /!\ '''Edit conflict - other version:''' 
- {{http://101tec.com/wp-content/themes/101tec/images/instanceNew.jpg}}
- {{http://101tec.com/wp-content/themes/101tec/images/configuration.jpg}}
- {{http://101tec.com/wp-content/themes/101tec/images/urlUpload.jpg}}
- {{http://101tec.com/wp-content/themes/101tec/images/crawl1.jpg}}
- {{http://101tec.com/wp-content/themes/101tec/images/crawl2.jpg}}
- {{http://101tec.com/wp-content/themes/101tec/images/crawl3.jpg}}
- 
-  /!\ '''Edit conflict - your version:''' 
- 
-  /!\ '''End of edit conflict''' 
  
  == Description Admin Gui: ==
  There are three main functionalities of the admin gui


[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney

2012-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchAdministrationUserInterface page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=11rev2=12

  
  == Summary: ==
  
- The goal is to extend nutch with a comfortable web based administration user 
interface to monitor, configure and manage one or a set of nutch search system 
instances. 
+ The goal is to extend Apach Nutch with a comfortable 
[[https://issues.apache.org/jira/browse/NUTCH-929|web based administration user 
interface]] to monitor, configure and manage one or a set of Nutch search 
system instances through the 
[[https://issues.apache.org/jira/browse/NUTCH-880|REST-API]]. This will tie 
together a number of issues, ultimately resulting in a 
[[https://issues.apache.org/jira/browse/NUTCH-841|Nutch 2.0 Webapp]]
   
  
  == Vision: ==


[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney

2012-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchAdministrationUserInterface page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=12rev2=13

- = proposal Nutch appliance / Nutch admin gui =
+ = Proposal Nutch appliance / Nutch admin gui =
  
  == Summary: ==
  


[Nutch Wiki] Trivial Update of NutchAdministrationUserInterface by LewisJohnMcgibbney

2012-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The NutchAdministrationUserInterface page has been changed by 
LewisJohnMcgibbney:
http://wiki.apache.org/nutch/NutchAdministrationUserInterface?action=diffrev1=14rev2=15

  
  == Summary: ==
  
- The goal is to extend Apach Nutch with a comfortable 
[[https://issues.apache.org/jira/browse/NUTCH-929|web based administration user 
interface]] to monitor, configure and manage one or a set of Nutch search 
system instances through the 
[[https://issues.apache.org/jira/browse/NUTCH-880|REST-API]]. This will tie 
together a number of issues, ultimately resulting in a 
[[https://issues.apache.org/jira/browse/NUTCH-841|Nutch 2.0 Webapp]]
+ The goal is to extend [[http://nutch.apache.org|Apach Nutch]] with a 
comfortable [[https://issues.apache.org/jira/browse/NUTCH-929|web based 
administration user interface]] to monitor, configure and manage one or a set 
of Nutch search system instances through the 
[[https://issues.apache.org/jira/browse/NUTCH-880|REST-API]]. This will tie 
together a number of issues, ultimately resulting in a 
[[https://issues.apache.org/jira/browse/NUTCH-841|Nutch 2.0 Webapp]]
   
  
  == Vision: ==


[jira] [Commented] (NUTCH-929) Create a REST-based admin UI for Nutch

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211404#comment-13211404
 ] 

Lewis John McGibbney commented on NUTCH-929:


As we are using org.restlet as the underlying RESTlet framework, we will need 
to utilise the presentation technologies supported. e.g integration with three 
popular template technologies : XSLT, FreeMarker or Apache Velocity.

[1] 
http://wiki.restlet.org/docs_2.0/13-restlet/21-restlet/378-restlet/116-restlet.html

 Create a REST-based admin UI for Nutch
 --

 Key: NUTCH-929
 URL: https://issues.apache.org/jira/browse/NUTCH-929
 Project: Nutch
  Issue Type: New Feature
  Components: administration gui
Affects Versions: nutchgora
Reporter: Andrzej Bialecki 

 This is a follow up to NUTCH-880 - we need to expose the functionality of 
 REST API in a user-friendly admin UI. Thanks to the nature of the API the UI 
 can be implemented in any UI framework that speaks REST/JSON, so it could be 
 a simple webapp (we already have jetty) or a Swing / Pivot / etc standalone 
 application.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Attachment: NUTCH-1253-nutchgora.patch
NUTCH-1253.patch

Trivial patches for both trunk and Nutchgora branch. Can you guys please test 
and get back on this issue. Thanks 

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Patch Info: Patch Available

 Incompatible neko and xerces versions
 -

 Key: NUTCH-1253
 URL: https://issues.apache.org/jira/browse/NUTCH-1253
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
 Environment: Ubuntu 10.04
Reporter: Dennis Spathis
 Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch


 The Nutch 1.4 distribution includes
  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
 nekohtml)
  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
 These two JARs appear to be incompatible versions. When the HtmlParser 
 (configured to use neko) is invoked during a local-mode crawl, the parse 
 fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
 rebuild the HtmlParser plugin and add a
 catch(Throwable) clause in the getParse method to log the stacktrace.)
 I found that substituting a later, compatible version of nekohtml (1.9.11)
 fixes the problem.
 Curiously, and in support of the above, the nekohtml plugin.xml file in
 Nutch 1.4 contains the following:
 plugin
id=lib-nekohtml
name=CyberNeko HTML Parser
version=1.9.11
provider-name=org.cyberneko
runtime
library name=nekohtml-0.9.5.jar
export name=*/
/library
/runtime
 /plugin
 Note the conflicting version numbers (version tag is 1.9.11 but the
 specified library is nekohtml-0.9.5.jar).
 Was the 0.9.5 version included by mistake? Was the intention rather to
 include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-728) Improve nutch release packaging

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-728:
---

Attachment: NUTCH-728-v2.patch
NUTCH-728-nutchgora.patch

Updated patches for trunk and Nutchgora

 Improve nutch release packaging
 ---

 Key: NUTCH-728
 URL: https://issues.apache.org/jira/browse/NUTCH-728
 Project: Nutch
  Issue Type: Improvement
Reporter: Sami Siren
 Attachments: NUTCH-728-nutchgora.patch, NUTCH-728-v2.patch, 
 NUTCH-728.patch


 see the discussion from 
 http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-1276) Fix [dep-ann]

2012-02-19 Thread Lewis John McGibbney (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1276.
---


 Fix [dep-ann]
 -

 Key: NUTCH-1276
 URL: https://issues.apache.org/jira/browse/NUTCH-1276
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
 Fix For: nutchgora, 1.5


 Generally speaking these are more straightforward than others as it should be 
 a case of either annotating using
 {code}
 @Deprecated
 {code}
 or of course replacing the deprecated class method with another 
 non-deprecated implementation. Hopefully most of these occurrences will be 
 resolved within NUTCH-1273

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1276) Fix [dep-ann]

2012-02-19 Thread Lewis John McGibbney (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1276.
-

Resolution: Fixed

Committed @ revision 1291030 in trunk
Committed @ revision 1291031 in Nutchgora branch

 Fix [dep-ann]
 -

 Key: NUTCH-1276
 URL: https://issues.apache.org/jira/browse/NUTCH-1276
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
 Fix For: nutchgora, 1.5


 Generally speaking these are more straightforward than others as it should be 
 a case of either annotating using
 {code}
 @Deprecated
 {code}
 or of course replacing the deprecated class method with another 
 non-deprecated implementation. Hopefully most of these occurrences will be 
 resolved within NUTCH-1273

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1273) Fix [deprecation] javac warnings

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211464#comment-13211464
 ] 

Lewis John McGibbney commented on NUTCH-1273:
-

With this issue, do we wish to simply suppress the warnings? What other options 
do we have? It makes me think that we could upgrade the use of classes within 
our library dependencies. Any ideas?

 Fix [deprecation] javac warnings
 

 Key: NUTCH-1273
 URL: https://issues.apache.org/jira/browse/NUTCH-1273
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5


 As part of this task, these warnings should be resolved, however this 
 particular strand of warnings can either be resolved by adding
 {code}
 @SuppressWarnings(deprecation)
 {code}
 or by actually upgrading our class usage to rely upon non-deprecated classes. 
 Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Assigned] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2012-02-19 Thread Lewis John McGibbney (Assigned) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1249:
---

Assignee: Lewis John McGibbney

 Resolve all issues flagged up by adding javac -Xlint arguement
 --

 Key: NUTCH-1249
 URL: https://issues.apache.org/jira/browse/NUTCH-1249
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5


 There are a heap of issues flagged up by NUTCH-1237, I think over time it 
 would be great to get these addressed and resolved.
 What is interesting is that adding the same arguements to 
 /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
 Some of this stuff is documented in the link below
 http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1271) Fix errors @ compile time

2012-02-19 Thread Lewis John McGibbney (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1271.
-

Resolution: Duplicate

This issue duplicates the more accurate NUTCH-1249

 Fix errors @ compile time
 -

 Key: NUTCH-1271
 URL: https://issues.apache.org/jira/browse/NUTCH-1271
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5


 After adding the -Xlint commands to build.xml, we see many errors when 
 compiling. These should be fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Closed] (NUTCH-1271) Fix errors @ compile time

2012-02-19 Thread Lewis John McGibbney (Closed) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1271.
---


 Fix errors @ compile time
 -

 Key: NUTCH-1271
 URL: https://issues.apache.org/jira/browse/NUTCH-1271
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5


 After adding the -Xlint commands to build.xml, we see many errors when 
 compiling. These should be fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-19 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211470#comment-13211470
 ] 

Lewis John McGibbney commented on NUTCH-978:


Hi Chris did you mentor this project through GSoC? I've downloaded the .zip 
available in the description (which I've also attached in case the link goes 
AWOL) and I'm going to play about with it. I'll attach it as a patch if I get 
anywhere.

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-978:
---

Attachment: for_GSoc.zip

In it's present form this is quite literally all over the place and is merely 
for safe keeping.

 [GSoC 2011] A Plugin for extracting certain element of a web page on html 
 page parsing.
 ---

 Key: NUTCH-978
 URL: https://issues.apache.org/jira/browse/NUTCH-978
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.2
 Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
Reporter: Ammar Shadiq
Assignee: Chris A. Mattmann
Priority: Minor
  Labels: gsoc2011, mentor
 Fix For: nutchgora

 Attachments: 
 [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
 app_guardian_ivory_coast_news_exmpl.png, 
 app_screenshoot_configuration_result.png, 
 app_screenshoot_configuration_result_anchor.png, 
 app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
 for_GSoc.zip

   Original Estimate: 1,680h
  Remaining Estimate: 1,680h

 Nutch use parse-html plugin to parse web pages, it process the contents of 
 the web page by removing html tags and component like javascript and css and 
 leaving the extracted text to be stored on the index. Nutch by default 
 doesn't have the capability to select certain atomic element on an html page, 
 like certain tags, certain content, some part of the page, etc.
 A html page have a tree-like xml pattern with html tag as its branch and text 
 as its node. This branch and node could be extracted using XPath. XPath 
 allowing us to select a certain branch or node of an XML and therefore could 
 be used to extract certain information and treat it differently based on its 
 content and the user requirements. Furthermore a web domain like news website 
 usually have a same html code structure for storing the information on its 
 web pages. This same html code structure could be parsed using the same XPath 
 query and retrieve the same content information element. All of the XPath 
 query for selecting various content could be stored on a XPath Configuration 
 File.
 The purpose of nutch are for various web source, not all of the web page 
 retrieved from those various source have the same html code structure, thus 
 have to be threated differently using the correct XPath Configuration. The 
 selection of the correct XPath configuration could be done automatically 
 using regex by matching the url of the web page with valid url pattern for 
 that xpath configuration.
 This automatic mechanism allow the user of nutch to process various web page 
 and get only certain information that user wants therefore making the index 
 more accurate and its content more flexible.
 The component for this idea have been tested on nutch 1.2 for selecting 
 certain elements on various news website for the purpose of document 
 clustering. This includes a Configuration Editor Application build using 
 NetBeans 6.9 Application Framework. though its need a few debugging.
 http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-02-19 Thread Lewis John McGibbney (Created) (JIRA)
Add site fetcher.max.crawl.delay as log output by default.
--

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Trivial
 Fix For: nutchgora, 1.5


Currently, when manually scanning our log output we cannot infer which pages 
are governed by a crawl delay between successive fetch attempts of any given 
page within the site. The value should be made available as something like:

{code}
2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
http://nutch.apache.org/ (crawl.delay=XXXms)
{code}

This way we can easily and quickly determine whether the fetcher is having to 
use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1276) Fix [dep-ann]

2012-02-19 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211482#comment-13211482
 ] 

Hudson commented on NUTCH-1276:
---

Integrated in Nutch-trunk #1762 (See 
[https://builds.apache.org/job/Nutch-trunk/1762/])
trivial commit to address NUTCH-1276 (Revision 1291030)

 Result = SUCCESS
lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1291030
Files : 
* /nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java
* /nutch/trunk/src/java/org/apache/nutch/net/protocols/ProtocolException.java
* /nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
* /nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java


 Fix [dep-ann]
 -

 Key: NUTCH-1276
 URL: https://issues.apache.org/jira/browse/NUTCH-1276
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
 Fix For: nutchgora, 1.5


 Generally speaking these are more straightforward than others as it should be 
 a case of either annotating using
 {code}
 @Deprecated
 {code}
 or of course replacing the deprecated class method with another 
 non-deprecated implementation. Hopefully most of these occurrences will be 
 resolved within NUTCH-1273

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1276) Fix [dep-ann]

2012-02-19 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211484#comment-13211484
 ] 

Hudson commented on NUTCH-1276:
---

Integrated in nutch-trunk-maven #156 (See 
[https://builds.apache.org/job/nutch-trunk-maven/156/])
trivial commit to address NUTCH-1276 (Revision 1291030)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/src/java/org/apache/nutch/crawl/MapWritable.java
* /nutch/trunk/src/java/org/apache/nutch/net/protocols/ProtocolException.java
* /nutch/trunk/src/java/org/apache/nutch/parse/OutlinkExtractor.java
* /nutch/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java


 Fix [dep-ann]
 -

 Key: NUTCH-1276
 URL: https://issues.apache.org/jira/browse/NUTCH-1276
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
 Fix For: nutchgora, 1.5


 Generally speaking these are more straightforward than others as it should be 
 a case of either annotating using
 {code}
 @Deprecated
 {code}
 or of course replacing the deprecated class method with another 
 non-deprecated implementation. Hopefully most of these occurrences will be 
 resolved within NUTCH-1273

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1276) Fix [dep-ann]

2012-02-19 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211507#comment-13211507
 ] 

Hudson commented on NUTCH-1276:
---

Integrated in Nutch-nutchgora #166 (See 
[https://builds.apache.org/job/Nutch-nutchgora/166/])
trivial commit to address NUTCH-1276 (Revision 1291031)

 Result = SUCCESS
lewismc : 
Files : 
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/net/protocols/ProtocolException.java
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/parse/OutlinkExtractor.java
* /nutch/branches/nutchgora/src/test/org/apache/nutch/util/CrawlTestUtil.java


 Fix [dep-ann]
 -

 Key: NUTCH-1276
 URL: https://issues.apache.org/jira/browse/NUTCH-1276
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
 Fix For: nutchgora, 1.5


 Generally speaking these are more straightforward than others as it should be 
 a case of either annotating using
 {code}
 @Deprecated
 {code}
 or of course replacing the deprecated class method with another 
 non-deprecated implementation. Hopefully most of these occurrences will be 
 resolved within NUTCH-1273

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-02-19 Thread Elisabeth Adler (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211586#comment-13211586
 ] 

Elisabeth Adler commented on NUTCH-809:
---

I haven't tested the plugin in 1.4 myself, but I think a few guys on the 
mailing list already used it with 1.4. 

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.5

 Attachments: NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, 
 metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira