[jira] [Updated] (NUTCH-1333) Introduce AvroStore, DataFileAvroStore and Accumulo Datastore implementations

2012-04-15 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1333:


Attachment: NUTCH-1333.patch

Patch adding specifics and also license headers to files in conf/ 

> Introduce AvroStore, DataFileAvroStore and Accumulo Datastore implementations
> -
>
> Key: NUTCH-1333
> URL: https://issues.apache.org/jira/browse/NUTCH-1333
> Project: Nutch
>  Issue Type: New Feature
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1333.patch
>
>
> This is to accomodate recent developments over @ Gora.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1333) Introduce AvroStore, DataFileAvroStore and Accumulo Datastore implementations

2012-04-15 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1333:


Patch Info: Patch Available

> Introduce AvroStore, DataFileAvroStore and Accumulo Datastore implementations
> -
>
> Key: NUTCH-1333
> URL: https://issues.apache.org/jira/browse/NUTCH-1333
> Project: Nutch
>  Issue Type: New Feature
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1333.patch
>
>
> This is to accomodate recent developments over @ Gora.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2012-03-21 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1104:


Description: 
Umbrella issue for tracking issues that should be ported from 1.x trunk to the 
NutchGora branch. Please mark ported issues by modifying this description.

NOT YET PORTED:

* NUTCH-809 Parse-metatags plugin
* NUTCH-987 Support HTTP auth for Solr communication
* NUTCH-1028 Log parser keys
* NUTCH-1036 Solr jobs should increment counters in Reporter
* NUTCH-1057 Make fetcher thread time out configurable
* NUTCH-1067 Configure minimum throughput for fetcher
* NUTCH-1101 Options to purge db_gone records in updatedb
* NUTCH-1102 Fetcher, rely on fetcher.parse directive only
* NUTCH-1105 MaxContentLength option for index-basic
* NUTCH-940 Statis field plugin
* NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1090 InvertLinks should inform when ignoring internal links
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1203 ParseSegment to show number of milliseconds per parse
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1155 Host/domain limit in generator is generate.max.count+1
* NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
* NUTCH-1142 Normalization and filtering in WebGraph
* NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file
* NUTCH-1195 Add Solr 4x (trunk) example schema
* NUTCH-1141 Configurable Fetcher queue depth
* NUTCH-1214 DomainStats tool should be named for what it's doing
* NUTCH-1213 Pass additional SolrParams when indexing to Solr
* NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN 
requirements
* NUTCH-1231 Upgrade to Tika 1.0
* NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0
* NUTCH-1235 Upgrade to new Hadoop 0.20.205.0
* NUTCH-1184 Fetcher to parse and follow Nth degree outlinks
* NUTCH-1214 DomainStats tool should be named for what it's doing
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1142 Normalization and filtering in WebGraph

PORTED:
* No issues yet


NOT GOING TO BE PORTED:
* No issues, explain why it should not be ported



  was:
Umbrella issue for tracking issues that should be ported from 1.x trunk to the 
NutchGora branch. Please mark ported issues by modifying this description.

NOT YET PORTED:

* NUTCH-987 Support HTTP auth for Solr communication
* NUTCH-1028 Log parser keys
* NUTCH-1036 Solr jobs should increment counters in Reporter
* NUTCH-1057 Make fetcher thread time out configurable
* NUTCH-1067 Configure minimum throughput for fetcher
* NUTCH-1101 Options to purge db_gone records in updatedb
* NUTCH-1102 Fetcher, rely on fetcher.parse directive only
* NUTCH-1105 MaxContentLength option for index-basic
* NUTCH-940 Statis field plugin
* NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1090 InvertLinks should inform when ignoring internal links
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1203 ParseSegment to show number of milliseconds per parse
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1155 Host/domain limit in generator is generate.max.count+1
* NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex
* NUTCH-1142 Normalization and filtering in WebGraph
* NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file
* NUTCH-1195 Add Solr 4x (trunk) example schema
* NUTCH-1141 Configurable Fetcher queue depth
* NUTCH-1214 DomainStats tool should be named for what it's doing
* NUTCH-1213 Pass additional SolrParams when indexing to Solr
* NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN 
requirements
* NUTCH-1231 Upgrade to Tika 1.0
* NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0
* NUTCH-1235 Upgrade to new Hadoop 0.20.205.0
* NUTCH-1184 Fetcher to parse and follow Nth degree outlinks
* NUTCH-1214 DomainStats tool should be named for what it's doing
* NUTCH-1207 ParserChecker to output signature
* NUTCH-1174 Outlinks are not properly normalized
* NUTCH-1173 DomainStats doesn't count db_not_modified
* NUTCH-1142 Normalization and filtering in WebGraph

PORTED:
* No issues yet


NOT GOING TO BE PORTED:
* No issues, explain why it should not be ported




> Port issues from trunk NutchGora branch
> ---
>
> Key: NUTCH-1104
> URL: https://issues.apache.org/jira/browse/NUTCH-1104
> Project: Nutch
>  Issue Type: Task
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: nutchgora
>
>
> Umbrella issue for tracking issues that should be ported from 1.x trunk 

[jira] [Updated] (NUTCH-978) A Plugin for extracting certain element of a web page on html page parsing.

2012-03-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-978:
---

 Labels: gsoc2012 mentor  (was: gsoc2011 mentor)
Summary: A Plugin for extracting certain element of a web page on html page 
parsing.  (was: [GSoC 2011] A Plugin for extracting certain element of a web 
page on html page parsing.)

This is as I thought. Look I've marked it for this years GSoC, students can 
apply up until April 6th iirc so if there is any interest then we can progress 
with it. Thanks Ammar

> A Plugin for extracting certain element of a web page on html page parsing.
> ---
>
> Key: NUTCH-978
> URL: https://issues.apache.org/jira/browse/NUTCH-978
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.2
> Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>Reporter: Ammar Shadiq
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: gsoc2012, mentor
> Fix For: nutchgora
>
> Attachments: 
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
> app_guardian_ivory_coast_news_exmpl.png, 
> app_screenshoot_configuration_result.png, 
> app_screenshoot_configuration_result_anchor.png, 
> app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
> for_GSoc.zip, version_alpha2.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of 
> the web page by removing html tags and component like javascript and css and 
> leaving the extracted text to be stored on the index. Nutch by default 
> doesn't have the capability to select certain atomic element on an html page, 
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text 
> as its node. This branch and node could be extracted using XPath. XPath 
> allowing us to select a certain branch or node of an XML and therefore could 
> be used to extract certain information and treat it differently based on its 
> content and the user requirements. Furthermore a web domain like news website 
> usually have a same html code structure for storing the information on its 
> web pages. This same html code structure could be parsed using the same XPath 
> query and retrieve the same content information element. All of the XPath 
> query for selecting various content could be stored on a XPath Configuration 
> File.
> The purpose of nutch are for various web source, not all of the web page 
> retrieved from those various source have the same html code structure, thus 
> have to be threated differently using the correct XPath Configuration. The 
> selection of the correct XPath configuration could be done automatically 
> using regex by matching the url of the web page with valid url pattern for 
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page 
> and get only certain information that user wants therefore making the index 
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting 
> certain elements on various news website for the purpose of document 
> clustering. This includes a Configuration Editor Application build using 
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1307) Improve formatting of ant targets for clearer project help

2012-03-08 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1307:


Attachment: NUTCH-1307-trunk.patch
NUTCH-1307-nutchgora.patch

trivial patches

When running 
{code}
$ant -projecthelp
{code}
(from $NUTCH_HOME)

this gives nicer output.

> Improve formatting of ant targets for clearer project help
> --
>
> Key: NUTCH-1307
> URL: https://issues.apache.org/jira/browse/NUTCH-1307
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1307-nutchgora.patch, NUTCH-1307-trunk.patch
>
>
> This is a trivial formatting issue I will submit a patch shortly and fix it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-475) Adaptive crawl delay

2012-03-02 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-475:
---

Attachment: NUTCH-475.patch

Updated patch which brings this issue up to speed as of Dogacan's comments. 
None of Todd's work was ever uploaded, however I think we should work towards 
an implementation as Enis' suggested. I suppose we can try/test this 
implementation... as I have not done so as of yet.

> Adaptive crawl delay
> 
>
> Key: NUTCH-475
> URL: https://issues.apache.org/jira/browse/NUTCH-475
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Doğacan Güney
> Attachments: NUTCH-475.patch, adaptive-delay_draft.patch
>
>
> Current fetcher implementation waits a default interval before making another 
> request to the same server (if crawl-delay is not specified in robots.txt). 
> IMHO, an adaptive implementation will be better. If the server is under 
> little load and can server requests fast, then fetcher can ask for more pages 
> in a given interval. Similarly, if the server is suffering from heavy load, 
> fetcher can slow down(w.r.t that host), easing the load on the server.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings

2012-03-02 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1273:


Attachment: NUTCH-1273-v2-trunk.patch

This patch goes some length to address the issues described on user or dev 
list. I'm having some problems with Exceptions, and tbh not really sure about 
the new API construction. I opted to switch the MimeUtil#autoResolveContentType 
code to use a mimetype String as oppose to either
* Switch the code to use MediaType rather than MimeType, and call
 DefaultDetector directly (rather than using the Tika facade class)
* If we get back a String (not null) for the mimetype, create a MimeType
 object for it.

In all honesty, if the method I have used is not suitable then I think the 
latter of the above alternatives would be better simply because we arwe not 
currently calling MediaType anywhere, I've been trying to keeep with 
consistency when workin on this one.

If someone could have a look it would be greatly appreciated. Thanks 

> Fix [deprecation] javac warnings
> 
>
> Key: NUTCH-1273
> URL: https://issues.apache.org/jira/browse/NUTCH-1273
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch, 
> NUTCH-1273-v2-trunk.patch
>
>
> As part of this task, these warnings should be resolved, however this 
> particular strand of warnings can either be resolved by adding
> {code}
> @SuppressWarnings("deprecation")
> {code}
> or by actually upgrading our class usage to rely upon non-deprecated classes. 
> Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings

2012-02-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1273:


Attachment: NUTCH-1273-nutchgora.patch

Preliminary patch for nutchgora branch. Same issue with protocol-http & tika 
methods. will update shortly.

> Fix [deprecation] javac warnings
> 
>
> Key: NUTCH-1273
> URL: https://issues.apache.org/jira/browse/NUTCH-1273
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch
>
>
> As part of this task, these warnings should be resolved, however this 
> particular strand of warnings can either be resolved by adding
> {code}
> @SuppressWarnings("deprecation")
> {code}
> or by actually upgrading our class usage to rely upon non-deprecated classes. 
> Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings

2012-02-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1273:


Attachment: NUTCH-1273-trunk.patch

This is not 100% complete but fixes all but a trivial issue with use of 
deprecated  Tika methods and one issue with http-client, which in all honesty 
I'm not going to fix for obvious reasons. I'll update the patch when I get news 
through about tika methods from user@tika. 

> Fix [deprecation] javac warnings
> 
>
> Key: NUTCH-1273
> URL: https://issues.apache.org/jira/browse/NUTCH-1273
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1273-trunk.patch
>
>
> As part of this task, these warnings should be resolved, however this 
> particular strand of warnings can either be resolved by adding
> {code}
> @SuppressWarnings("deprecation")
> {code}
> or by actually upgrading our class usage to rely upon non-deprecated classes. 
> Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml

2012-02-20 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1205:


Attachment: NUTCH-1205-v5.patch
NUTCH-1205-v5.patch

This is getting laughable now. I've overcome god knows how many problems, but 
now there is a problem with the actual gora-core-0.2-SNAPSHOT jar which we pull 
from the nexus snapshot repository. For some reason we are pulling the test jar 
and not the functional one!!!
When applying the patch to Nutchgora and running 
{code}
$ant compile > compile.txt
{code}
I get
{code}
compile-core:
[javac] /home/lewis/ASF/nutchgora/build.xml:97: warning: 
'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to 
false for repeatable builds
[javac] Compiling 170 source files to 
/home/lewis/ASF/nutchgora/build/classes
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/apache-cassandra-clientutil-1.0.2.jar": no 
such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/apache-cassandra-thrift-1.0.2.jar": no 
such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/bcel.jar": no such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/dom4j-full.jar": no such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/findbugs.jar": no such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/plastic.jar": no such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/jaxb-api.jar": no such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/activation.jar": no such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/jsr173_1.0_api.jar": no such file or 
directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/jaxb1-impl.jar": no such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/xercesImpl.jar": no such file or directory
[javac] warning: [path] bad path element 
"/home/lewis/ASF/nutchgora/build/lib/xml-apis.jar": no such file or directory
{code} 

When you look in your /build directory and open the gora-core-0.2-SNAPSHOT.jar 
you'll see that it's the test jar right enough.

For tonight I'm giving it a by, but will try and pick this up tomorrow at some 
stage.

> Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
> ---
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: nutchgora
>
> Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
> NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, 
> NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml

2012-02-20 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1205:


Priority: Blocker  (was: Minor)

> Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
> ---
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: nutchgora
>
> Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
> NUTCH-1205-v4.patch, NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1285) Debian Packaging for Nutch

2012-02-20 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1285:


Fix Version/s: 1.6
   nutchgora

> Debian Packaging for Nutch
> --
>
> Key: NUTCH-1285
> URL: https://issues.apache.org/jira/browse/NUTCH-1285
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.6
>
>
> This is a utopian type issue which will not be addressed for some time due to 
> many factors, outwith our control which exist within the Debian policy 
> ecosystem. 
> I've been in touch with Ioan over @ Apache James and they have recently 
> (after a number of years) made some real progress with this. Some links are 
> below
> [0] http://svn.apache.org/repos/asf/james/app
> [1] http://svn.apache.org/viewvc/james/app/trunk/pom.xml?view=markup
> [2] https://issues.apache.org/jira/browse/JAMES-1343
> [3] http://www.mail-archive.com/server-dev@james.apache.org/
> [4] http://www.debian.org/doc/debian-policy/
> [5] http://www.debian.org/doc/manuals/maint-guide/index.en.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1283) Radically update all Solr configuration in Nutchgora

2012-02-20 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1283:


Summary: Radically update all Solr configuration in Nutchgora  (was: 
Ridically update all Solr configuration in Nutchgora)

Hi Markus. We maintain a rather radical/cutting edge 4.X Solr Schema in trunk 
[0]. What are your views on supporting this in Nutchgora?

[0] http://svn.apache.org/viewvc/nutch/trunk/conf/schema-solr4.xml?view=markup

> Radically update all Solr configuration in Nutchgora
> 
>
> Key: NUTCH-1283
> URL: https://issues.apache.org/jira/browse/NUTCH-1283
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: nutchgora
>
>
> We're currently running with a Schema which states it's 1.4 :0| There should 
> be better support for newer stuff going on over the Solrland. Thsi issue 
> should track those improvements entirely.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-978) [GSoC 2011] A Plugin for extracting certain element of a web page on html page parsing.

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-978:
---

Attachment: for_GSoc.zip

In it's present form this is quite literally all over the place and is merely 
for safe keeping.

> [GSoC 2011] A Plugin for extracting certain element of a web page on html 
> page parsing.
> ---
>
> Key: NUTCH-978
> URL: https://issues.apache.org/jira/browse/NUTCH-978
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.2
> Environment: Ubuntu Linux 10.10; JDK 1.6; Netbeans 6.9
>Reporter: Ammar Shadiq
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: gsoc2011, mentor
> Fix For: nutchgora
>
> Attachments: 
> [Nutch-GSoC-2011-Proposal]Web_Page_Scrapper_Parser_Plugin.pdf, 
> app_guardian_ivory_coast_news_exmpl.png, 
> app_screenshoot_configuration_result.png, 
> app_screenshoot_configuration_result_anchor.png, 
> app_screenshoot_source_view.png, app_screenshoot_url_regex_filter.png, 
> for_GSoc.zip
>
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Nutch use parse-html plugin to parse web pages, it process the contents of 
> the web page by removing html tags and component like javascript and css and 
> leaving the extracted text to be stored on the index. Nutch by default 
> doesn't have the capability to select certain atomic element on an html page, 
> like certain tags, certain content, some part of the page, etc.
> A html page have a tree-like xml pattern with html tag as its branch and text 
> as its node. This branch and node could be extracted using XPath. XPath 
> allowing us to select a certain branch or node of an XML and therefore could 
> be used to extract certain information and treat it differently based on its 
> content and the user requirements. Furthermore a web domain like news website 
> usually have a same html code structure for storing the information on its 
> web pages. This same html code structure could be parsed using the same XPath 
> query and retrieve the same content information element. All of the XPath 
> query for selecting various content could be stored on a XPath Configuration 
> File.
> The purpose of nutch are for various web source, not all of the web page 
> retrieved from those various source have the same html code structure, thus 
> have to be threated differently using the correct XPath Configuration. The 
> selection of the correct XPath configuration could be done automatically 
> using regex by matching the url of the web page with valid url pattern for 
> that xpath configuration.
> This automatic mechanism allow the user of nutch to process various web page 
> and get only certain information that user wants therefore making the index 
> more accurate and its content more flexible.
> The component for this idea have been tested on nutch 1.2 for selecting 
> certain elements on various news website for the purpose of document 
> clustering. This includes a Configuration Editor Application build using 
> NetBeans 6.9 Application Framework. though its need a few debugging.
> http://dl.dropbox.com/u/2642087/For_GSoC/for_GSoc.zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-728) Improve nutch release packaging

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-728:
---

Attachment: NUTCH-728-v2.patch
NUTCH-728-nutchgora.patch

Updated patches for trunk and Nutchgora

> Improve nutch release packaging
> ---
>
> Key: NUTCH-728
> URL: https://issues.apache.org/jira/browse/NUTCH-728
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Sami Siren
> Attachments: NUTCH-728-nutchgora.patch, NUTCH-728-v2.patch, 
> NUTCH-728.patch
>
>
> see the discussion from 
> http://www.lucidimagination.com/search/document/aa4d52cbd9af026a/discuss_contents_of_nutch_release_artifact

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Patch Info: Patch Available

> Incompatible neko and xerces versions
> -
>
> Key: NUTCH-1253
> URL: https://issues.apache.org/jira/browse/NUTCH-1253
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
> Environment: Ubuntu 10.04
>Reporter: Dennis Spathis
> Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch
>
>
> The Nutch 1.4 distribution includes
>  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
> nekohtml)
>  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
> These two JARs appear to be incompatible versions. When the HtmlParser 
> (configured to use neko) is invoked during a local-mode crawl, the parse 
> fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
> rebuild the HtmlParser plugin and add a
> catch(Throwable) clause in the getParse method to log the stacktrace.)
> I found that substituting a later, compatible version of nekohtml (1.9.11)
> fixes the problem.
> Curiously, and in support of the above, the nekohtml plugin.xml file in
> Nutch 1.4 contains the following:
> id="lib-nekohtml"
>name="CyberNeko HTML Parser"
>version="1.9.11"
>provider-name="org.cyberneko">
>
>
>
>
>
> 
> Note the conflicting version numbers (version tag is "1.9.11" but the
> specified library is "nekohtml-0.9.5.jar").
> Was the 0.9.5 version included by mistake? Was the intention rather to
> include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2012-02-19 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1253:


Attachment: NUTCH-1253-nutchgora.patch
NUTCH-1253.patch

Trivial patches for both trunk and Nutchgora branch. Can you guys please test 
and get back on this issue. Thanks 

> Incompatible neko and xerces versions
> -
>
> Key: NUTCH-1253
> URL: https://issues.apache.org/jira/browse/NUTCH-1253
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
> Environment: Ubuntu 10.04
>Reporter: Dennis Spathis
> Attachments: NUTCH-1253-nutchgora.patch, NUTCH-1253.patch
>
>
> The Nutch 1.4 distribution includes
>  - nekohtml-0.9.5.jar (under .../runtime/local/plugins/lib-
> nekohtml)
>  - xercesImpl-2.9.1.jar (under .../runtime/local/lib)
> These two JARs appear to be incompatible versions. When the HtmlParser 
> (configured to use neko) is invoked during a local-mode crawl, the parse 
> fails due to an AbstractMethodError. (Note: To see the AbstractMethodError, 
> rebuild the HtmlParser plugin and add a
> catch(Throwable) clause in the getParse method to log the stacktrace.)
> I found that substituting a later, compatible version of nekohtml (1.9.11)
> fixes the problem.
> Curiously, and in support of the above, the nekohtml plugin.xml file in
> Nutch 1.4 contains the following:
> id="lib-nekohtml"
>name="CyberNeko HTML Parser"
>version="1.9.11"
>provider-name="org.cyberneko">
>
>
>
>
>
> 
> Note the conflicting version numbers (version tag is "1.9.11" but the
> specified library is "nekohtml-0.9.5.jar").
> Was the 0.9.5 version included by mistake? Was the intention rather to
> include 1.9.11?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1086) Rewrite protocol-httpclient

2012-02-17 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1086:


Priority: Critical  (was: Major)

> Rewrite protocol-httpclient
> ---
>
> Key: NUTCH-1086
> URL: https://issues.apache.org/jira/browse/NUTCH-1086
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Markus Jelsma
>Priority: Critical
>
> There are several issues about protocol-httpclient and several comments about 
> rewriting the plugin with the new http client libraries. There is, however, 
> not yet an issue for rewriting/reimplementing protocol-httpclient.
> http://hc.apache.org/httpcomponents-client-ga/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1129) Any23 Nutch plugin

2012-02-14 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1129:


Attachment: NUTCH-1129.patch

This is a first ditch attempt at the parse-any23 plugin. In all honesty the 
patch is a monster due to a hugely excessive test suite. This will be cut down 
once I get the code implementation written properly. 

> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1276) Fix [dep-ann]

2012-02-14 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1276:


Description: 
Generally speaking these are more straightforward than others as it should be a 
case of either annotating using
{code}
@Deprecated
{code}
or of course replacing the deprecated class method with another non-deprecated 
implementation. Hopefully most of these occurrences will be resolved within 
NUTCH-1273

  was:
Generally speaking these are more straightforward than others as it should be a 
case of either annotating using
{code}
@Deprecated
{code}
or of course replacing the deprecated class method with another non-deprecated 
implementation. Hopefully most of these occurrences will be resolved within 
NUTCH-1237


> Fix [dep-ann]
> -
>
> Key: NUTCH-1276
> URL: https://issues.apache.org/jira/browse/NUTCH-1276
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
>
> Generally speaking these are more straightforward than others as it should be 
> a case of either annotating using
> {code}
> @Deprecated
> {code}
> or of course replacing the deprecated class method with another 
> non-deprecated implementation. Hopefully most of these occurrences will be 
> resolved within NUTCH-1273

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1276) Fix [dep-ann]

2012-02-14 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1276:


Description: 
Generally speaking these are more straightforward than others as it should be a 
case of either annotating using
{code}
@Deprecated
{code}
or of course replacing the deprecated class method with another non-deprecated 
implementation. Hopefully most of these occurrences will be resolved within 
NUTCH-1237

  was:
Generally speaking these are more straightforward than others as it should be a 
case of either annotating using
{code}
@Deprecated
{code}
or of course replacing the deprecated class method with another non-deprecated 
implementation. Hopefully most of these occurrences will be resolved within 
NUTCH-


> Fix [dep-ann]
> -
>
> Key: NUTCH-1276
> URL: https://issues.apache.org/jira/browse/NUTCH-1276
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
>
> Generally speaking these are more straightforward than others as it should be 
> a case of either annotating using
> {code}
> @Deprecated
> {code}
> or of course replacing the deprecated class method with another 
> non-deprecated implementation. Hopefully most of these occurrences will be 
> resolved within NUTCH-1237

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml

2012-02-13 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1205:


Summary: Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml  (was: Upgrade 
gora modules to 0.2-incubating in ivy/ivy.xml)

> Upgrade gora modules to 0.2-SNAPSHOT in ivy/ivy.xml
> ---
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
> NUTCH-1205-v4.patch, NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2-incubating in ivy/ivy.xml

2012-02-11 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1205:


Attachment: NUTCH-1205-v4.patch

This patch updates the deprecated ivy resolver enabling us to utilise the 
bleeding edge Gora stuff. There is a problem with two maven-plugin 
dependencies, I don't know where they are coming from so I thought I would put 
this patch up to see if anyone can resolve it themself! Thanks   

> Upgrade gora modules to 0.2-incubating in ivy/ivy.xml
> -
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
> NUTCH-1205-v4.patch, NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1189) add commented out default settings to gora.properties files

2012-01-11 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1189:


Attachment: NUTCH-1189-v4.patch

Final patch attachment for now, hopefully we will be revisiting this issue when 
more data stores become available in Gora in the forthcoming months. Thanks 
Ferdy for the HBase commentary.

> add commented out default settings to gora.properties files 
> 
>
> Key: NUTCH-1189
> URL: https://issues.apache.org/jira/browse/NUTCH-1189
> Project: Nutch
>  Issue Type: Sub-task
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1189-v2.patch, NUTCH-1189-v3.patch, 
> NUTCH-1189-v4.patch, NUTCH-1189.patch
>
>
> This issues should have been dealt with as part of its parent issue, however 
> I think as it is a fairly lareg task in itself, it needs to be done 
> independently. The gora.properties file should, amongst other settings, and 
> beside the extreme basic defaults for sqlstore, include defaults for opening 
> HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
> to individual interpretation puts a huge owness of the user, hence 
> constructing a barrier to entry for getting the configuration settings up and 
> running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

2012-01-11 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-965:
---

Attachment: NUTCH-965-v2.patch

Hi Guys,

I would ask you's to comment as this patch is not finished yet. Although I've 
made the functionality a boolean configurable, I've also intentionally 
neglected to address the second of your points Julien, regarding 
FetcherJob.java.

I see that the boolean parsing value is set in this class [1], but would like 
you to confirm if the code I'm writing should live under the public Collection 
object on line 138.

Once this is addressed it would be great to get a patch for trunk.

Thanks for anyone that can comment on this. 

[1] 
http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java?view=markup

> Skip parsing for truncated documents
> 
>
> Key: NUTCH-965
> URL: https://issues.apache.org/jira/browse/NUTCH-965
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Alexis
>Assignee: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-965-v2.patch, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is 
> described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted 
> data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2012-01-09 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1138:


Attachment: NUTCH-1138-nutchgora.patch

Attached patch for Nutchgora which will hopefully put this one to bed. Compiles 
and tests pass with most recent Nutchgora code. Thanks 

{code}
BUILD SUCCESSFUL
Total time: 6 minutes 6 seconds

{code}

> remove LogUtil from trunk and nutch gora
> 
>
> Key: NUTCH-1138
> URL: https://issues.apache.org/jira/browse/NUTCH-1138
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.5
>
> Attachments: Document1.txt, NUTCH-1138-nutchgora.patch, 
> NUTCH-1138-trunk-20111023.patch
>
>
> This should move towards the removal of the LogUtil class from both codebases 
> as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-01-09 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---

Attachment: NUTCH-840.patch

Hi Julien. I have absolutely no idea how or when I ended up working on this, 
but I think the attachment nearly addresses this issue. It is from a while back 
and to be honest I can't really remeber working on it...

Anyway, I think the parse-tika tests fail as it is not quite working properly 
yet. The patch also changes the directory structure to o.a.n.p.tika rather than 
existing o.a.n.tika which is inconsistent with other parser plugin 
implementation we ship with Nutch.

Sorry for hijacking this one slightly.

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-840.patch, NUTCH-840.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1237) Improve javac arguements for more verbose output

2011-12-28 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1237:


Attachment: NUTCH-1237-trunk.patch

> Improve javac arguements for more verbose output 
> -
>
> Key: NUTCH-1237
> URL: https://issues.apache.org/jira/browse/NUTCH-1237
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch, 
> NUTCH-1237-trunk.patch
>
>
> When trying to fix another problem I stumbled across this one. I think it is 
> important to ensure that the javac outputs info regarding deprecation and 
> unchecked operations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT

2011-12-27 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1205:


Attachment: NUTCH-1205-v2.patch

new patch which acknowledges Juliens comments regarding other dependencies. 
This patch ONLY upgrades the gora-module dependencies to 0.2-incubating 

> Upgrade gora modules to 0.2-SNAPSHOT
> 
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1205-v2.patch, NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2-incubating in ivy/ivy.xml

2011-12-27 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1205:


Summary: Upgrade gora modules to 0.2-incubating in ivy/ivy.xml  (was: 
Upgrade gora modules to 0.2-SNAPSHOT)

> Upgrade gora modules to 0.2-incubating in ivy/ivy.xml
> -
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1205-v2.patch, NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1237) Improve javac arguements for more verbose output

2011-12-27 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1237:


Attachment: NUTCH-1237-nutchgora.patch
NUTCH-1237-trunk.patch

Patches for trunk and nutchgora branch

> Improve javac arguements for more verbose output 
> -
>
> Key: NUTCH-1237
> URL: https://issues.apache.org/jira/browse/NUTCH-1237
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch
>
>
> When trying to fix another problem I stumbled across this one. I think it is 
> important to ensure that the javac outputs info regarding deprecation and 
> unchecked operations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1217) Update NOTICE.txt to drop some copyrights

2011-12-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1217:


Attachment: NUTCH-1217-trunk-v2.patch

new patch which greatly simplifies the trunk NOTICE.txt file.

> Update NOTICE.txt to drop some copyrights
> -
>
> Key: NUTCH-1217
> URL: https://issues.apache.org/jira/browse/NUTCH-1217
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1217-trunk-v2.patch, NUTCH-1217-trunk.patch
>
>
> We have many references to software copyrights which should be dropped. Most 
> of these relate to the Lucene legacy days.
> -Carrot2
> -saxpath
> -jaxen
> -jdom
> -snowball
> -violinstrings
> -Jena
> -bouncycastle
> -fontbox
> -jempbox
> -pdfbox
> -rome
> Also some need to be added
> -slf4j
> -activation
> -mortbay (jetty)
> -jline
> -junit
> -stax
> -wstx
> As I am unfamiliar with most of these, and that is important to inlcude all 
> references to software outside of the ASF, I would appreciate if this list 
> could act as a beginning for completing this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1236) Add link to site documentation to download older versions of Nutch.

2011-12-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1236:


Attachment: NUTCH-1236.patch

This small patch simply adds pages to older Nutch releases as well as a link to 
the link to the trunk Sonar Analysis page. I will commit this once the svn site 
area has been updated to accomodate 1.4 changes. Thanks

> Add link to site documentation to download older versions of Nutch.
> ---
>
> Key: NUTCH-1236
> URL: https://issues.apache.org/jira/browse/NUTCH-1236
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Attachments: NUTCH-1236.patch
>
>
> As we are moving towards 2012 I thought it best to clear out my mailbox. I 
> found an older email which requested the link to download older versions of 
> Nutch. This was discussed and I think it would be best to get the link added 
> to the site documentation. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1236) Add link to site documentation to download older versions of Nutch.

2011-12-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1236:


Patch Info: Patch Available

> Add link to site documentation to download older versions of Nutch.
> ---
>
> Key: NUTCH-1236
> URL: https://issues.apache.org/jira/browse/NUTCH-1236
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Attachments: NUTCH-1236.patch
>
>
> As we are moving towards 2012 I thought it best to clear out my mailbox. I 
> found an older email which requested the link to download older versions of 
> Nutch. This was discussed and I think it would be best to get the link added 
> to the site documentation. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1218) Improve trunk API documentation

2011-12-13 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1218:


Attachment: NUTCH-1218.patch

This patch is a work in progress. So far it includes the following
1) Covers half of the core packages by substantiating on the minimal 
package.html descritpions.
2) Fixes the issue with the ${Name} variable which was incorrectly specified
3) Adds missing plugins to the Javadoc Ant target in build.xml.

There is an issue I have stumbled across, can anyone explain in 
default.properties, why there is a _*:\_ after some plugin class names when 
there is not this after others?
{code}

#
# Parse Plugins
#
plugins.parse=\
   org.apache.nutch.parse.ext*:\
   org.apache.nutch.parse.js:\
   org.apache.nutch.parse.swf*:\
   org.apache.nutch.parse.tika:\
   org.apache.nutch.parse.zip

{code}



> Improve trunk API documentation
> ---
>
> Key: NUTCH-1218
> URL: https://issues.apache.org/jira/browse/NUTCH-1218
> Project: Nutch
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: NUTCH-1218.patch
>
>
> The trunk API Java documentation could do with some improving. This issue 
> should track that. It should however not seek to change any functionality 
> within the codebase, only to substantiate and improve the existing 
> documentation.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1094) create comprehensive documentation for Nutchgora branch

2011-12-12 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1094:


Summary: create comprehensive documentation for Nutchgora branch  (was: 
create comprehensive documentation for Nutch 2.0 trunk)

> create comprehensive documentation for Nutchgora branch
> ---
>
> Key: NUTCH-1094
> URL: https://issues.apache.org/jira/browse/NUTCH-1094
> Project: Nutch
>  Issue Type: Sub-task
>  Components: documentation
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: nutchgora
>
>
> This should shadow the core documentation for Nutch 1.4 (branch) and 
> mainstream users, however it should include fundamentals specific to Nutch 
> trunk. Until we release Nutch 2.0 this documentation should be stored in svn 
> under a /docs directory. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1217) Update NOTICE.txt to drop some copyrights

2011-12-07 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1217:


Attachment: NUTCH-1217-trunk.patch

patch for trunk.
Nutchgora branch has a couple of additional dependencies so I will get a patch 
sorted for it when I have time.

> Update NOTICE.txt to drop some copyrights
> -
>
> Key: NUTCH-1217
> URL: https://issues.apache.org/jira/browse/NUTCH-1217
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1217-trunk.patch
>
>
> We have many references to software copyrights which should be dropped. Most 
> of these relate to the Lucene legacy days.
> -Carrot2
> -saxpath
> -jaxen
> -jdom
> -snowball
> -violinstrings
> -Jena
> -bouncycastle
> -fontbox
> -jempbox
> -pdfbox
> -rome
> Also some need to be added
> -slf4j
> -activation
> -mortbay (jetty)
> -jline
> -jsp
> -junit
> -log4j
> -stax
> -wstx
> As I am unfamiliar with most of these, and that is important to inlcude all 
> references to software outside of the ASF, I would appreciate if this list 
> could act as a beginning for completing this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1217) Update NOTICE.txt to drop some copyrights

2011-12-07 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1217:


Description: 
We have many references to software copyrights which should be dropped. Most of 
these relate to the Lucene legacy days.

-Carrot2
-saxpath
-jaxen
-jdom
-snowball
-violinstrings
-Jena
-bouncycastle
-fontbox
-jempbox
-pdfbox
-rome

Also some need to be added

-slf4j
-activation
-mortbay (jetty)
-jline
-junit
-stax
-wstx

As I am unfamiliar with most of these, and that is important to inlcude all 
references to software outside of the ASF, I would appreciate if this list 
could act as a beginning for completing this issue.

  was:
We have many references to software copyrights which should be dropped. Most of 
these relate to the Lucene legacy days.

-Carrot2
-saxpath
-jaxen
-jdom
-snowball
-violinstrings
-Jena
-bouncycastle
-fontbox
-jempbox
-pdfbox
-rome

Also some need to be added

-slf4j
-activation
-mortbay (jetty)
-jline
-jsp
-junit
-log4j
-stax
-wstx

As I am unfamiliar with most of these, and that is important to inlcude all 
references to software outside of the ASF, I would appreciate if this list 
could act as a beginning for completing this issue.

 Patch Info: Patch Available

> Update NOTICE.txt to drop some copyrights
> -
>
> Key: NUTCH-1217
> URL: https://issues.apache.org/jira/browse/NUTCH-1217
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1217-trunk.patch
>
>
> We have many references to software copyrights which should be dropped. Most 
> of these relate to the Lucene legacy days.
> -Carrot2
> -saxpath
> -jaxen
> -jdom
> -snowball
> -violinstrings
> -Jena
> -bouncycastle
> -fontbox
> -jempbox
> -pdfbox
> -rome
> Also some need to be added
> -slf4j
> -activation
> -mortbay (jetty)
> -jline
> -junit
> -stax
> -wstx
> As I am unfamiliar with most of these, and that is important to inlcude all 
> references to software outside of the ASF, I would appreciate if this list 
> could act as a beginning for completing this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1216) Add trivial comment to lib/native/README.txt

2011-12-06 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1216:


Attachment: NUTCH-1216-nutchgora.patch
NUTCH-1216-trunk.patch

Patches which fix this for both trunk and Nutchgora branch

> Add trivial comment to lib/native/README.txt
> 
>
> Key: NUTCH-1216
> URL: https://issues.apache.org/jira/browse/NUTCH-1216
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-1216-nutchgora.patch, NUTCH-1216-trunk.patch
>
>
> This trivial issue simply adds missing comments to the above file. The WARN 
> logging which is churned out has caused a small degree of confusion in the 
> past, therefore this sorts that out :0)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1216) Add trivial comment to lib/native/README.txt

2011-12-06 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1216:


Patch Info: Patch Available

> Add trivial comment to lib/native/README.txt
> 
>
> Key: NUTCH-1216
> URL: https://issues.apache.org/jira/browse/NUTCH-1216
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: nutchgora, 1.5
>
>
> This trivial issue simply adds missing comments to the above file. The WARN 
> logging which is churned out has caused a small degree of confusion in the 
> past, therefore this sorts that out :0)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT

2011-11-23 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1205:


Attachment: NUTCH-1205.patch

This patch is breaking by build but it is a work in progress. In short the 
patch, 
1) Upgrades all gora dependencies to 0.2-incubating
2) Upgrades all other dependencies which were also upgraded in current gora 
trunk.

The build fails as follows
{code}
compile-core:
[javac] /home/lewis/ASF/nutchgora/build.xml:97: warning: 
'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to 
false for repeatable builds
[javac] Compiling 171 source files to 
/home/lewis/ASF/nutchgora/build/classes
[javac] 
/home/lewis/ASF/nutchgora/src/java/org/apache/nutch/storage/StorageUtils.java:43:
 cannot find symbol
[javac] symbol  : method createDataStore(java.lang.Class>,java.lang.Class,java.lang.Class)
[javac] location: class org.apache.gora.store.DataStoreFactory
[javac] return DataStoreFactory.createDataStore(dataStoreClass,
[javac]^
[javac] 
/home/lewis/ASF/nutchgora/src/java/org/apache/nutch/storage/StorageUtils.java:59:
 cannot find symbol
[javac] symbol  : method createDataStore(java.lang.Class>,java.lang.Class,java.lang.Class,java.lang.String)
[javac] location: class org.apache.gora.store.DataStoreFactory
[javac] return DataStoreFactory.createDataStore(dataStoreClass,
[javac]^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 2 errors

BUILD FAILED

{code}

> Upgrade gora modules to 0.2-SNAPSHOT
> 
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1205) Upgrade gora modules to 0.2-SNAPSHOT

2011-11-23 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1205:


Patch Info: Patch Available

> Upgrade gora modules to 0.2-SNAPSHOT
> 
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1200) Resolving Ivy dependencies in several plugins

2011-11-11 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1200:


Attachment: NUTCH-1200-v2-trunk.patch

Patch locates missing dependencies. As suggested by Julien, the DOES NOT 
require us to add anything to NUTCH_ROOT/ivy/ivy.xml. I have a fully 
functioning Eclipse build environment (after some configuration) with the 
attached patch. I still don't like the ../../../ so some replacement is really 
required to merit value. Any ideas?

> Resolving Ivy dependencies in several plugins 
> --
>
> Key: NUTCH-1200
> URL: https://issues.apache.org/jira/browse/NUTCH-1200
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: NUTCH-1200-trunk.patch, NUTCH-1200-v2-trunk.patch
>
>
> When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins 
> requiring additional libraries OVER AND ABOVE the ones specified in 
> NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the 
> classes are 
> {code}
> - FeedParser  conf="*->master"/>
> - URLAutomationFilter -  rev="???"/>
> - SWFParser  rev="2.0.1"/>
> - HTMLParserrev="1.9.15"/> 
> {code}
> Further to this, I cannot locate the dk.brics dependency!
> Finally, the plugin/ivy.xml files for the above plugins cannot be parsed 
> corectly due to the ${nutch.root} vairable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1200) Resolving Ivy dependencies in several plugins

2011-11-10 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1200:


Attachment: NUTCH-1200-trunk.patch

Real nasty hack which includes the ../../../ which I mentioned on the mailing 
list. I DO NOT like this, therefore something should definately replace this.
1) Define the dependencies in NUTCH_HOME/ivy/ivy.xml
2) ???

> Resolving Ivy dependencies in several plugins 
> --
>
> Key: NUTCH-1200
> URL: https://issues.apache.org/jira/browse/NUTCH-1200
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: NUTCH-1200-trunk.patch
>
>
> When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins 
> requiring additional libraries OVER AND ABOVE the ones specified in 
> NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the 
> classes are 
> {code}
> - FeedParser  conf="*->master"/>
> - URLAutomationFilter -  rev="???"/>
> - SWFParser  rev="2.0.1"/>
> - HTMLParserrev="1.9.15"/> 
> {code}
> Further to this, I cannot locate the dk.brics dependency!
> Finally, the plugin/ivy.xml files for the above plugins cannot be parsed 
> corectly due to the ${nutch.root} vairable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1200) Resolving Ivy dependencies in several plugins

2011-11-10 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1200:


Patch Info: Patch Available

> Resolving Ivy dependencies in several plugins 
> --
>
> Key: NUTCH-1200
> URL: https://issues.apache.org/jira/browse/NUTCH-1200
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.5
>
> Attachments: NUTCH-1200-trunk.patch
>
>
> When configuring Nutch 1.5-SNAPSHOT in Eclipse, I noticed that any plugins 
> requiring additional libraries OVER AND ABOVE the ones specified in 
> NUTCH_HOME/ivy/ivy.xml cannot resolve the dependencies. In specific the 
> classes are 
> {code}
> - FeedParser  conf="*->master"/>
> - URLAutomationFilter -  rev="???"/>
> - SWFParser  rev="2.0.1"/>
> - HTMLParserrev="1.9.15"/> 
> {code}
> Further to this, I cannot locate the dk.brics dependency!
> Finally, the plugin/ivy.xml files for the above plugins cannot be parsed 
> corectly due to the ${nutch.root} vairable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1189) add commented out default settings to gora.properties files

2011-11-04 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1189:


Attachment: NUTCH-1189-v2.patch

2nd edition added to acknowledge some pointers from the dev list. Admittedly I 
probably won't much time to work on much over then next while so attaching 
before this gets lost.

> add commented out default settings to gora.properties files 
> 
>
> Key: NUTCH-1189
> URL: https://issues.apache.org/jira/browse/NUTCH-1189
> Project: Nutch
>  Issue Type: Sub-task
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1189-v2.patch, NUTCH-1189.patch
>
>
> This issues should have been dealt with as part of its parent issue, however 
> I think as it is a fairly lareg task in itself, it needs to be done 
> independently. The gora.properties file should, amongst other settings, and 
> beside the extreme basic defaults for sqlstore, include defaults for opening 
> HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
> to individual interpretation puts a huge owness of the user, hence 
> constructing a barrier to entry for getting the configuration settings up and 
> running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1189) add commented out default settings to gora.properties files

2011-11-02 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1189:


Patch Info: Patch Available

> add commented out default settings to gora.properties files 
> 
>
> Key: NUTCH-1189
> URL: https://issues.apache.org/jira/browse/NUTCH-1189
> Project: Nutch
>  Issue Type: Sub-task
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1189.patch
>
>
> This issues should have been dealt with as part of its parent issue, however 
> I think as it is a fairly lareg task in itself, it needs to be done 
> independently. The gora.properties file should, amongst other settings, and 
> beside the extreme basic defaults for sqlstore, include defaults for opening 
> HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
> to individual interpretation puts a huge owness of the user, hence 
> constructing a barrier to entry for getting the configuration settings up and 
> running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1189) add commented out default settings to gora.properties files

2011-11-02 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1189:


Attachment: NUTCH-1189.patch

So as far as I am aware, HBase doesn't need any additional properties specified 
within the gora.properties file, however both the SQL & Cassandra stores do. By 
default the minimum properties for the SQL store are attached with extra 
security features commented out. Finally, all the 'expected' Cassandra 
properties are included and commented out by default. This is a work in process 
to lower the barrier to entry for Cassandra users.

> add commented out default settings to gora.properties files 
> 
>
> Key: NUTCH-1189
> URL: https://issues.apache.org/jira/browse/NUTCH-1189
> Project: Nutch
>  Issue Type: Sub-task
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1189.patch
>
>
> This issues should have been dealt with as part of its parent issue, however 
> I think as it is a fairly lareg task in itself, it needs to be done 
> independently. The gora.properties file should, amongst other settings, and 
> beside the extreme basic defaults for sqlstore, include defaults for opening 
> HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
> to individual interpretation puts a huge owness of the user, hence 
> constructing a barrier to entry for getting the configuration settings up and 
> running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2011-11-01 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-902:
---

Attachment: NUTCH-902-v3.patch

patch to include previous config changes to NUTCHGORA/ivy/ivy.xml

> Add all necessary files and configuration so that nutch can be used with 
> different backends out-of-the-box
> --
>
> Key: NUTCH-902
> URL: https://issues.apache.org/jira/browse/NUTCH-902
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, storage
>Affects Versions: nutchbase
>Reporter: Enis Soztutar
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch
>
>
> As per the discussion in the mailing list and 
> http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
> necessary files and configuration. I propose that we maintain configuration 
> for at least SQL, HBase and Cassandra. 
> The following changes are needed:
> conf/gora-sql-mapping.xml
> conf/gora-hbase-mapping.xml
> conf/gora-cassandra-mapping.xml
> comments on nutch-default and ivy.xml 
> Shall we also include jars from gora-hbase, gora-cassandra and their 
> dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2011-10-31 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-902:
---

Attachment: NUTCH-902-v2.patch

Revised patch to incorporate additional comments.

> Add all necessary files and configuration so that nutch can be used with 
> different backends out-of-the-box
> --
>
> Key: NUTCH-902
> URL: https://issues.apache.org/jira/browse/NUTCH-902
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, storage
>Affects Versions: nutchbase
>Reporter: Enis Soztutar
>Assignee: Lewis John McGibbney
> Attachments: NUTCH-902-v2.patch, NUTCH-902.patch
>
>
> As per the discussion in the mailing list and 
> http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
> necessary files and configuration. I propose that we maintain configuration 
> for at least SQL, HBase and Cassandra. 
> The following changes are needed:
> conf/gora-sql-mapping.xml
> conf/gora-hbase-mapping.xml
> conf/gora-cassandra-mapping.xml
> comments on nutch-default and ivy.xml 
> Shall we also include jars from gora-hbase, gora-cassandra and their 
> dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2011-10-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-902:
---

Patch Info: Patch Available

> Add all necessary files and configuration so that nutch can be used with 
> different backends out-of-the-box
> --
>
> Key: NUTCH-902
> URL: https://issues.apache.org/jira/browse/NUTCH-902
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, storage
>Affects Versions: nutchbase
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Attachments: NUTCH-902.patch
>
>
> As per the discussion in the mailing list and 
> http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
> necessary files and configuration. I propose that we maintain configuration 
> for at least SQL, HBase and Cassandra. 
> The following changes are needed:
> conf/gora-sql-mapping.xml
> conf/gora-hbase-mapping.xml
> conf/gora-cassandra-mapping.xml
> comments on nutch-default and ivy.xml 
> Shall we also include jars from gora-hbase, gora-cassandra and their 
> dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2011-10-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-902:
---

Attachment: NUTCH-902.patch

This is the beginning of a patch to address the ticket. It smartens up some 
files here and there, however as I've not been able to test recently on 
cassandra I don't know which additional dependencies are required to be added 
to ivy/ivy.xml (hector???).

Finally, I've just used 'other' implementations from various resources for both 
cassandra and hbase xml mapping files. Obviously this is up for debate so 
please comment.

> Add all necessary files and configuration so that nutch can be used with 
> different backends out-of-the-box
> --
>
> Key: NUTCH-902
> URL: https://issues.apache.org/jira/browse/NUTCH-902
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, storage
>Affects Versions: nutchbase
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
> Attachments: NUTCH-902.patch
>
>
> As per the discussion in the mailing list and 
> http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
> necessary files and configuration. I propose that we maintain configuration 
> for at least SQL, HBase and Cassandra. 
> The following changes are needed:
> conf/gora-sql-mapping.xml
> conf/gora-hbase-mapping.xml
> conf/gora-cassandra-mapping.xml
> comments on nutch-default and ivy.xml 
> Shall we also include jars from gora-hbase, gora-cassandra and their 
> dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1104) Port issues from 1.x to trunk

2011-10-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1104:


Description: 
A new issue to track issues that have not yet been ported from 1.x to trunk:

NUTCH-987
NUTCH-1028
NUTCH-1036
NUTCH-1057
NUTCH-1067
NUTCH-1101
NUTCH-1102
NUTCH-1105
NUTCH-940
NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk



  was:
A new issue to track issues that have not yet been ported from 1.x to trunk:

NUTCH-987
NUTCH-1028
NUTCH-1036
NUTCH-1057
NUTCH-1067
NUTCH-1101
NUTCH-1102
NUTCH-1105
NUTCH-940
NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk
NUTCH-623 - Change plugin source directory "languageidentifier" to 
"language-identifier"



> Port issues from 1.x to trunk
> -
>
> Key: NUTCH-1104
> URL: https://issues.apache.org/jira/browse/NUTCH-1104
> Project: Nutch
>  Issue Type: Task
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: nutchgora
>
>
> A new issue to track issues that have not yet been ported from 1.x to trunk:
> NUTCH-987
> NUTCH-1028
> NUTCH-1036
> NUTCH-1057
> NUTCH-1067
> NUTCH-1101
> NUTCH-1102
> NUTCH-1105
> NUTCH-940
> NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-842) AutoGenerate WebPage code

2011-10-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-842:
---

Patch Info: Patch Available

> AutoGenerate WebPage code
> -
>
> Key: NUTCH-842
> URL: https://issues.apache.org/jira/browse/NUTCH-842
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Fix For: nutchgora
>
> Attachments: NUTCH-842.patch
>
>
> This issue will track the addition of an ant task that will automatically 
> generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
> src/gora/webpage.avsc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-10-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1138:


Patch Info: Patch Available

> remove LogUtil from trunk and nutch gora
> 
>
> Key: NUTCH-1138
> URL: https://issues.apache.org/jira/browse/NUTCH-1138
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.5
>
> Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch
>
>
> This should move towards the removal of the LogUtil class from both codebases 
> as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1175) Update ivy.xml to use correct dependancies with gora-cassandra as a backend

2011-10-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1175:


Patch Info: Patch Available

> Update ivy.xml to use correct dependancies with gora-cassandra as a backend
> ---
>
> Key: NUTCH-1175
> URL: https://issues.apache.org/jira/browse/NUTCH-1175
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1175-20111020.patch
>
>
> This issue should add the correct target for the gora 0.1.1-incubating 
> dependency required to use Cassandra as storage mechanism.
> I will get a patch together and add in due course.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-865) Format source code in unique style

2011-10-26 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-865:
---

Patch Info: Patch Available

> Format source code in unique style
> --
>
> Key: NUTCH-865
> URL: https://issues.apache.org/jira/browse/NUTCH-865
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Pham Tuan Minh
>Assignee: Lewis John McGibbney
> Fix For: 1.4
>
> Attachments: NUTCH-865-nutchgora-rev1188268.patch, 
> NUTCH-865-trunk-rev1188252.patch, NUTCH-865.patch
>
>
> We should define a standard format rules for source code/comments, then using 
> eclipse tool to format the whole source code in the same style. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-10-23 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1138:


Attachment: NUTCH-1138-trunk-20111023.patch

First crack at removal of all LogUtils  imports, its usage and subsequently the 
LogUtil class.itself. Some testing is seriously required here. I am not 
particularly confident about this patch and would apprciate some criticisms, 
feedback to improve where required.

> remove LogUtil from trunk and nutch gora
> 
>
> Key: NUTCH-1138
> URL: https://issues.apache.org/jira/browse/NUTCH-1138
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.5
>
> Attachments: Document1.txt, NUTCH-1138-trunk-20111023.patch
>
>
> This should move towards the removal of the LogUtil class from both codebases 
> as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1138) remove LogUtil from trunk and nutch gora

2011-10-22 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1138:


Attachment: Document1.txt

a list of which files have to be altered to get rid of LogUtil and replace with 
improved logging framework

> remove LogUtil from trunk and nutch gora
> 
>
> Key: NUTCH-1138
> URL: https://issues.apache.org/jira/browse/NUTCH-1138
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.5
>
> Attachments: Document1.txt
>
>
> This should move towards the removal of the LogUtil class from both codebases 
> as per comments in NUTCH-1078.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-865) Format source code in unique style

2011-10-20 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-865:
---

Attachment: NUTCH-865.patch

OK, since the rather useless commit, I've compiled a patch using the code 
formatter application bundled with Eclipse. For some reason a rather nasty 
number of files have been skipped, these are as follows

{code}
lewis@lewis-desktop:~$ eclipse -application 
org.eclipse.jdt.core.JavaCodeFormatter -config 
~/ASF/trunk/eclipse-codeformat.xml ~/ASF/trunk
Configuration Name: /home/lewis/ASF/trunk/eclipse-codeformat.xml
Starting format job ...
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/test/org/apache/nutch/util/TestURLUtil.java. Skip the 
file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/test/org/apache/nutch/crawl/TestInjector.java. Skip 
the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java. 
Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/test/org/apache/nutch/crawl/TestGenerator.java. Skip 
the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/test/org/apache/nutch/fetcher/TestFetcher.java. Skip 
the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/index-static/src/java/org/apache/nutch/indexer/staticfield/StaticFieldIndexer.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/tld/src/java/org/apache/nutch/scoring/tld/TLDScoringFilter.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaConfig.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/GenericWritableConfigurable.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/TrieStringMatcher.java. 
Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/domain/DomainStatistics.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/domain/DomainSuffixes.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/domain/TopLevelDomain.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/domain/DomainSuffix.java. 
Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/ObjectCache.java. Skip the 
file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/EncodingDetector.java. 
Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/util/NodeWalker.java. Skip the 
file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/protocol/ProtocolException.java.
 Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/protocol/ProtocolNotFound.java. 
Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java/org/apache/nutch/protocol/ProtocolStatus.java. 
Skip the file.
The Eclipse formatter failed to format 
/home/lewis/ASF/trunk/src/java

[jira] [Updated] (NUTCH-1175) Update ivy.xml to use correct dependancies with gora-cassandra as a backend

2011-10-20 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1175:


Attachment: NUTCH-1175-20111020.patch

This has by no means even been tested, I do not have access to resources to 
test on a development Cassandra node/cluster atm. I thought I would merely get 
this off to a start.

> Update ivy.xml to use correct dependancies with gora-cassandra as a backend
> ---
>
> Key: NUTCH-1175
> URL: https://issues.apache.org/jira/browse/NUTCH-1175
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1175-20111020.patch
>
>
> This issue should add the correct target for the gora 0.1.1-incubating 
> dependency required to use Cassandra as storage mechanism.
> I will get a patch together and add in due course.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1156) building errors with gora-hbase as a backend; update ivy.xml to use correct dependancies

2011-10-20 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1156:


Fix Version/s: nutchgora

> building errors with gora-hbase as a backend; update ivy.xml to use correct 
> dependancies
> 
>
> Key: NUTCH-1156
> URL: https://issues.apache.org/jira/browse/NUTCH-1156
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: nutchgora
>Reporter: Ferdy
> Fix For: nutchgora
>
> Attachments: NUTCH-1156-v1.patch, NUTCH-1156-v2.patch
>
>
> This patch makes sure nutchgora can actually be built when gora-hbase is 
> uncommented in ivy.xml. Note that is still commented though, so sql is still 
> the default backend. However whenever one wishes to use hbase (as we do) 
> simply uncommenting the section in ivy.xml won't do the trick. This patch 
> fixes this. Changes in ivy.xml:
> -Set correct version for gora-hbase and config.
> -Add thrift to exclude as it is not available in the repos; it is not needed 
> in most cases but please correct me if I'm wrong.
> -Additionally, it removes the comment that hbase library itself should be 
> manually added, as this not needed anymore.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-10-11 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-623:
---

Attachment: NUTCH-623-nutchgora-20111011.patch

patch attachment for nutchgora branch.

> Change plugin source directory "languageidentifier" to "language-identifier"
> 
>
> Key: NUTCH-623
> URL: https://issues.apache.org/jira/browse/NUTCH-623
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ignacio J. Ortega
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-623-branch-1.4-20110810.patch, 
> NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-branch-1.4-20110910-v2.patch, 
> NUTCH-623-nutchgora-20111011.patch, NUTCH-623-trunk-1.4-20110924.patch, 
> NUTCH-623-trunk-2.0-20110810.patch
>
>
> When trying to develop and debug Nutch  in eclipse, following the 
> instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant 
> run with languageidentifier is rename to language-identifier, when later 
> issue an svn update, you end having two languageidentifier src dirs, one with 
> the dash and another without it, it's an annoyance only, i know, but it 
> stucks me for 2 weeks..so if can be corrected... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2011-10-10 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1053:


Attachment: seed.txt

I attach a seed file which I've used with the crawl command to parse and index 
several feed URLs. Using the crawl command the only warning in my logs was as 
follows
{code}
2011-10-10 22:10:37,853 WARN  parse.ParserFactory - ParserFactory:Plugin: 
org.apache.nutch.parse.feed.FeedParser mapped to contentType 
application/rss+xml via parse-plugins.xml, but its plugin.xml file does not 
claim to support contentType: application/rss+xml
{code} 

Additionally I've used the command line to attempt to parse the feeds but I'm 
getting the following. Any thoughts? Can you give a use case or an URL which 
will reproduce the problem you mention with the RSS parser?
{code}
lewis@lewis:~/ASF/trunk/runtime/local$ bin/nutch plugin feed 
org.apache.nutch.parse.feed.FeedParser 
http://feeds.bbci.co.uk/news/scotland/rss.xml
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
Caused by: java.io.FileNotFoundException: 
http:/feeds.bbci.co.uk/news/scotland/rss.xml (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:106)
at org.apache.nutch.parse.feed.FeedParser.main(FeedParser.java:209)
... 5 more
{code}

> Parsing of RSS feeds fails 
> ---
>
> Key: NUTCH-1053
> URL: https://issues.apache.org/jira/browse/NUTCH-1053
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.4
>
> Attachments: seed.txt
>
>
> See discussion on 
> http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
> Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1109) Add Sonar targets to Ant build.xml

2011-10-10 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1109:


Attachment: NUTCH-1109-nutchgora-20111010.patch

patch attachment for a nutchgora 2.0 Sonar task. Depending on how the 1.4 trunk 
job goes it would be nice to get a similar job established for nutchgora.

> Add Sonar targets to Ant build.xml
> --
>
> Key: NUTCH-1109
> URL: https://issues.apache.org/jira/browse/NUTCH-1109
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: build
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1109-branch-1.4-20110910.patch, 
> NUTCH-1109-nutchgora-20111010.patch, NUTCH-1109-trunk-1.4-20110927.patch, 
> NUTCH-1109-trunk-20111006-v2.patch, sonar-ant-task-1.1.jar
>
>
> Sonar [1] is an open platform to manage code quality. I was experimenting 
> today with what kind of analysis it allows us to do on a given codebase and 
> was pleasantly surprised with the results. For details on the documentation 
> please see here [2]. It can be easily integrated into our ant build.xml and 
> is an easy way to explicitly identify latent areas of code which we could 
> possibly improve upon. 
> At this stage I wish to highlight some of my statistics in findings...
> Running Sonar via the attached patch identifies (based upon the analysis 
> rules from Sonar) that the Branch-1.4 codebase contains issues as follows
> {code}
> Critical 28   
> Major 1,231   
> Minor 356 
> Info  119
> {code}
> These range from a catch statement being identified in o.a.n.crawl.Generator 
> which shouldn't be catching throwable since it includes errors, through to 
> trivial issues such as nested statements which could be combined in the same 
> class.
> Although on the face of it, this seems an excellent way to make code more 
> consistent across the board, which may in turn lead to 'better' code, I am by 
> no way saying that this is a step we should move towards without thinking it 
> through and discussing at length. I also think that there needs to be a good 
> deal of our own judgement to decide whether any issues flagged up by Sonar 
> should be marked as false positives.
> To conclude I would like to add that I onl decided to open this issue in an 
> attempt to gauge peoples views on the direction it takes us in.
> [1] http://www.sonarsource.org/
> [2] http://docs.codehaus.org/display/SONAR/Documentation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1109) Add Sonar targets to Ant build.xml

2011-10-06 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1109:


Attachment: NUTCH-1109-trunk-20111006-v2.patch

final patch for adding a sonar ant task to replace the broken pmd target which 
has now been deprecated.

> Add Sonar targets to Ant build.xml
> --
>
> Key: NUTCH-1109
> URL: https://issues.apache.org/jira/browse/NUTCH-1109
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: build
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1109-branch-1.4-20110910.patch, 
> NUTCH-1109-trunk-1.4-20110927.patch, NUTCH-1109-trunk-20111006-v2.patch, 
> sonar-ant-task-1.1.jar
>
>
> Sonar [1] is an open platform to manage code quality. I was experimenting 
> today with what kind of analysis it allows us to do on a given codebase and 
> was pleasantly surprised with the results. For details on the documentation 
> please see here [2]. It can be easily integrated into our ant build.xml and 
> is an easy way to explicitly identify latent areas of code which we could 
> possibly improve upon. 
> At this stage I wish to highlight some of my statistics in findings...
> Running Sonar via the attached patch identifies (based upon the analysis 
> rules from Sonar) that the Branch-1.4 codebase contains issues as follows
> {code}
> Critical 28   
> Major 1,231   
> Minor 356 
> Info  119
> {code}
> These range from a catch statement being identified in o.a.n.crawl.Generator 
> which shouldn't be catching throwable since it includes errors, through to 
> trivial issues such as nested statements which could be combined in the same 
> class.
> Although on the face of it, this seems an excellent way to make code more 
> consistent across the board, which may in turn lead to 'better' code, I am by 
> no way saying that this is a step we should move towards without thinking it 
> through and discussing at length. I also think that there needs to be a good 
> deal of our own judgement to decide whether any issues flagged up by Sonar 
> should be marked as false positives.
> To conclude I would like to add that I onl decided to open this issue in an 
> attempt to gauge peoples views on the direction it takes us in.
> [1] http://www.sonarsource.org/
> [2] http://docs.codehaus.org/display/SONAR/Documentation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1133) Fix TestInjector for Nutchgora

2011-10-05 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1133:


Patch Info: Patch Available

> Fix TestInjector for Nutchgora
> --
>
> Key: NUTCH-1133
> URL: https://issues.apache.org/jira/browse/NUTCH-1133
> Project: Nutch
>  Issue Type: Sub-task
>  Components: injector
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: nutchgora
>
> Attachments: NUTCH-1081.patch
>
>
> This issue is part of a larger target which aims to fix broken JUnit tests 
> for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1134) Fix TestFetcher for Nutchgora

2011-10-05 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1134:


Patch Info: Patch Available

> Fix TestFetcher for Nutchgora
> -
>
> Key: NUTCH-1134
> URL: https://issues.apache.org/jira/browse/NUTCH-1134
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: nutchgora
>
> Attachments: NUTCH-1081.patch
>
>
> This issue is part of a larger target which aims to fix broken JUnit tests 
> for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1134) Fix TestFetcher for Nutchgora

2011-10-05 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1134:


Attachment: NUTCH-1081.patch

Trivial patch fixes TestGenerator, TestInjector & TestFetcher in nutchgora 
branch.

> Fix TestFetcher for Nutchgora
> -
>
> Key: NUTCH-1134
> URL: https://issues.apache.org/jira/browse/NUTCH-1134
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Priority: Critical
> Fix For: nutchgora
>
> Attachments: NUTCH-1081.patch
>
>
> This issue is part of a larger target which aims to fix broken JUnit tests 
> for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1133) Fix TestInjector for Nutchgora

2011-10-05 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1133:


Attachment: NUTCH-1081.patch

Trivial patch fixes TestGenerator, TestInjector & TestFetcher in nutchgora 
branch.

> Fix TestInjector for Nutchgora
> --
>
> Key: NUTCH-1133
> URL: https://issues.apache.org/jira/browse/NUTCH-1133
> Project: Nutch
>  Issue Type: Sub-task
>  Components: injector
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: nutchgora
>
> Attachments: NUTCH-1081.patch
>
>
> This issue is part of a larger target which aims to fix broken JUnit tests 
> for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1132) Fix TestGenerator for Nutchgora

2011-10-05 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1132:


Attachment: NUTCH-1081.patch

Trivial patch fixes TestGenerator, TestInjector & TestFetcher in nutchgora 
branch.

> Fix TestGenerator for Nutchgora
> ---
>
> Key: NUTCH-1132
> URL: https://issues.apache.org/jira/browse/NUTCH-1132
> Project: Nutch
>  Issue Type: Sub-task
>  Components: generator
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: nutchgora
>
> Attachments: NUTCH-1081.patch
>
>
> This issue is part of a larger target which aims to fix broken JUnit tests 
> for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1132) Fix TestGenerator for Nutchgora

2011-10-05 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1132:


Patch Info: Patch Available

> Fix TestGenerator for Nutchgora
> ---
>
> Key: NUTCH-1132
> URL: https://issues.apache.org/jira/browse/NUTCH-1132
> Project: Nutch
>  Issue Type: Sub-task
>  Components: generator
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Critical
> Fix For: nutchgora
>
> Attachments: NUTCH-1081.patch
>
>
> This issue is part of a larger target which aims to fix broken JUnit tests 
> for Nutchgora

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1136) Ant pmd target is broken

2011-09-30 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1136:


Attachment: NUTCH-1136-nutchgora-20110930.patch
NUTCH-1136-trunk-1.4-20110930.patch

patches for both trunk-1.4 and nutchgora

They simply remove all reference to broken PMD test reporting.

> Ant pmd target is broken
> 
>
> Key: NUTCH-1136
> URL: https://issues.apache.org/jira/browse/NUTCH-1136
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1136-nutchgora-20110930.patch, 
> NUTCH-1136-trunk-1.4-20110930.patch
>
>
> issuing an 'ant pmd' command results in a failure as follows
> {code}
> BUILD FAILED
> /home/lewis/ASF/trunk/build.xml:327: taskdef class 
> net.sourceforge.pmd.ant.PMDTask cannot be found
>  using the classloader AntClassLoader[]
> {code}
> The resulting fix should address this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1136) Ant pmd target is broken

2011-09-30 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1136:


Patch Info: Patch Available

> Ant pmd target is broken
> 
>
> Key: NUTCH-1136
> URL: https://issues.apache.org/jira/browse/NUTCH-1136
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1136-nutchgora-20110930.patch, 
> NUTCH-1136-trunk-1.4-20110930.patch
>
>
> issuing an 'ant pmd' command results in a failure as follows
> {code}
> BUILD FAILED
> /home/lewis/ASF/trunk/build.xml:327: taskdef class 
> net.sourceforge.pmd.ant.PMDTask cannot be found
>  using the classloader AntClassLoader[]
> {code}
> The resulting fix should address this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-809) Parse-metatags plugin

2011-09-30 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-809:
---

Affects Version/s: nutchgora
   1.4
Fix Version/s: nutchgora
   1.4

This is great Elisabeth, thank you. Marked for possible inclusion in 1.4 (and 
of course nutchgora :0))

> Parse-metatags plugin
> -
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.4, nutchgora
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-809.patch, NUTCH-809_metatags_1.3.patch
>
>
> h2. Parse-metatags plugin
> The parse-metatags plugin consists of a HTMLParserFilter which takes as 
> parameter a list of metatag names with '*' as default value. The values are 
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you 
> must specify in nutch-site.xml
> {code:xml}
> 
>   metatags.names
>   description;keywords
> 
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 
> 'keywords' and 'description'. Note that keywords is multivalued.
> The query-basic plugin is used to include these fields in the search e.g. in 
> nutch-site.xml
> {code:xml}
> 
>   query.basic.description.boost
>   2.0
> 
> 
>   query.basic.keywords.boost
>   2.0
> 
> {code}
> This code has been developed by DigitalPebble Ltd and offered to the 
> community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1136) Ant pmd target is broken

2011-09-29 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1136:


Affects Version/s: nutchgora
Fix Version/s: nutchgora

> Ant pmd target is broken
> 
>
> Key: NUTCH-1136
> URL: https://issues.apache.org/jira/browse/NUTCH-1136
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.4, nutchgora
>Reporter: Lewis John McGibbney
> Fix For: 1.4, nutchgora
>
>
> issuing an 'ant pmd' command results in a failure as follows
> {code}
> BUILD FAILED
> /home/lewis/ASF/trunk/build.xml:327: taskdef class 
> net.sourceforge.pmd.ant.PMDTask cannot be found
>  using the classloader AntClassLoader[]
> {code}
> The resulting fix should address this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-208) http: proxy exception list:

2011-09-29 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-208:
---

Fix Version/s: (was: 1.4)
   1.5

> http: proxy exception list:
> ---
>
> Key: NUTCH-208
> URL: https://issues.apache.org/jira/browse/NUTCH-208
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 0.8, 1.3, nutchgora
>Reporter: Matthias Günter
>Assignee: Lewis John McGibbney
>Priority: Trivial
>  Labels: patch
> Fix For: nutchgora, 1.5
>
> Attachments: NUTCH-208-branch-1.4-20110210-v3.patch, 
> NUTCH-208-branch-1.4-20110807.patch, NUTCH-208-branch-1.4-20110809-v2.patch, 
> NUTCH-208-trunk-2.0-20110810-v2.patch, NUTCH-208-trunk-2.0-20110810.patch, 
> patch.txt, patch.txt, proxy_exception_list-0.8.diff
>
>
> I suggest that a parameter is added to nutch-default.xml which allows to 
> generate a proxy exception list. 
> 
>   http.proxy.exception.list
>   
>   URL's and hosts that don't use the proxy (e.g. 
> intranets)
> 
> This is useful when scanning intranet/internet combinations from behind a 
> firewall. A preliminary patch is added to this extend to this request, 
> showing the changes. We will test it and update it if necessary. this also 
> reflects the reality in web browsers, where there is in most cases an 
> exception list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-672) allow unit tests to be run from bin/nutch

2011-09-29 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-672:
---

Attachment: NUTCH-672-nutchgora-20110929.patch

patch attachment for nutchgora. In my opinion, it's even more important that 
this gets integrated into nutchgora as it will greatly help when I get round to 
fixing the tests/classes. Thanks again Julien for the guidance.

> allow unit tests to be run from bin/nutch
> -
>
> Key: NUTCH-672
> URL: https://issues.apache.org/jira/browse/NUTCH-672
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Affects Versions: 1.3
>Reporter: Todd Lipcon
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: 
> 0001-NUTCH-672-allow-junit-tests-to-be-run-from-bin-nutc.patch, 
> NUTCH-672-junit-test-commandline.patch, NUTCH-672-nutchgora-20110929.patch, 
> NUTCH-672-trunk-1.4-20110929.patch
>
>
> In development it's handy to be able to run a single test case easily. You 
> can do it with ant -Dtestcase=foo test, but that's slow since it still checks 
> all the plugins for changes, rebuilds jars, etc.
> This patch adds a command to bin/nutch to run a junit against what's already 
> compiled. It's much faster than using ant. Recommended for use with nutch 
> -core

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-672) allow unit tests to be run from bin/nutch

2011-09-29 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-672:
---

Affects Version/s: 1.3
Fix Version/s: 2.0
   1.4

One additional bit of commentary. This patch requires ant compile-test-core 
prior to ant runtime as the genrated test classes need to be copied to the 
runtime configurations.

> allow unit tests to be run from bin/nutch
> -
>
> Key: NUTCH-672
> URL: https://issues.apache.org/jira/browse/NUTCH-672
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Affects Versions: 1.3
>Reporter: Todd Lipcon
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: 
> 0001-NUTCH-672-allow-junit-tests-to-be-run-from-bin-nutc.patch, 
> NUTCH-672-junit-test-commandline.patch, NUTCH-672-trunk-1.4-20110929.patch
>
>
> In development it's handy to be able to run a single test case easily. You 
> can do it with ant -Dtestcase=foo test, but that's slow since it still checks 
> all the plugins for changes, rebuilds jars, etc.
> This patch adds a command to bin/nutch to run a junit against what's already 
> compiled. It's much faster than using ant. Recommended for use with nutch 
> -core

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-672) allow unit tests to be run from bin/nutch

2011-09-29 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-672:
---

Attachment: NUTCH-672-trunk-1.4-20110929.patch

This small patch makes JUnit testing so much easier. It's taken me a while to 
figure this one out, therefore thank you Julien for the recent comments. 
Possible to include in 1.4 release?

> allow unit tests to be run from bin/nutch
> -
>
> Key: NUTCH-672
> URL: https://issues.apache.org/jira/browse/NUTCH-672
> Project: Nutch
>  Issue Type: New Feature
>  Components: build
>Reporter: Todd Lipcon
>Assignee: Lewis John McGibbney
>Priority: Minor
> Attachments: 
> 0001-NUTCH-672-allow-junit-tests-to-be-run-from-bin-nutc.patch, 
> NUTCH-672-junit-test-commandline.patch, NUTCH-672-trunk-1.4-20110929.patch
>
>
> In development it's handy to be able to run a single test case easily. You 
> can do it with ant -Dtestcase=foo test, but that's slow since it still checks 
> all the plugins for changes, rebuilds jars, etc.
> This patch adds a command to bin/nutch to run a junit against what's already 
> compiled. It's much faster than using ant. Recommended for use with nutch 
> -core

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-29 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1078:


Attachment: NUTCH-1078-trunk-20110929.patch

The attached patch changes the LogUtil class to accommodate the requirements of 
the slf4j framework.

I ask if this could be tested and if necessary I will update other classes as 
per my earlier comments.

At this stage in time my thinking is that we should separate the removal of 
LogUtil from this issue as they fundamentally, they mean different things for 
the Nutch codebase. 

> Upgrade all instances of commons logging to slf4j (with log4j backend)
> --
>
> Key: NUTCH-1078
> URL: https://issues.apache.org/jira/browse/NUTCH-1078
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.4
>
> Attachments: NUTCH-1078-branch-1.4-20110816.patch, 
> NUTCH-1078-branch-1.4-20110824-v2.patch, 
> NUTCH-1078-branch-1.4-20110911-v3.patch, 
> NUTCH-1078-branch-1.4-20110916-v4.patch, NUTCH-1078-trunk-20110929.patch
>
>
> Whilst working on another issue, I noticed that some classes still import and 
> use commons logging for example HttpBase.java
> {code}
> import java.util.*;
> // Commons Logging imports
> import org.apache.commons.logging.Log;
> import org.apache.commons.logging.LogFactory;
> // Nutch imports
> import org.apache.nutch.crawl.CrawlDatum;
> {code}
> At this stage I am unsure how many (if any others) still import and reply 
> upon commons logging, however they should be upgraded to slf4j for branch-1.4.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1109) Add Sonar targets to Ant build.xml

2011-09-27 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1109:


Attachment: NUTCH-1109-trunk-1.4-20110927.patch

As per comments here [1] I attach a patch which will 'hopefully' enable us to 
get the Sonar task set up on the ASF Sonar instance. 

[1] http://mail-archives.apache.org/mod_mbox/www-builds/201109.mbox/browser

> Add Sonar targets to Ant build.xml
> --
>
> Key: NUTCH-1109
> URL: https://issues.apache.org/jira/browse/NUTCH-1109
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.4, 2.0
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: build
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1109-branch-1.4-20110910.patch, 
> NUTCH-1109-trunk-1.4-20110927.patch, sonar-ant-task-1.1.jar
>
>
> Sonar [1] is an open platform to manage code quality. I was experimenting 
> today with what kind of analysis it allows us to do on a given codebase and 
> was pleasantly surprised with the results. For details on the documentation 
> please see here [2]. It can be easily integrated into our ant build.xml and 
> is an easy way to explicitly identify latent areas of code which we could 
> possibly improve upon. 
> At this stage I wish to highlight some of my statistics in findings...
> Running Sonar via the attached patch identifies (based upon the analysis 
> rules from Sonar) that the Branch-1.4 codebase contains issues as follows
> {code}
> Critical 28   
> Major 1,231   
> Minor 356 
> Info  119
> {code}
> These range from a catch statement being identified in o.a.n.crawl.Generator 
> which shouldn't be catching throwable since it includes errors, through to 
> trivial issues such as nested statements which could be combined in the same 
> class.
> Although on the face of it, this seems an excellent way to make code more 
> consistent across the board, which may in turn lead to 'better' code, I am by 
> no way saying that this is a step we should move towards without thinking it 
> through and discussing at length. I also think that there needs to be a good 
> deal of our own judgement to decide whether any issues flagged up by Sonar 
> should be marked as false positives.
> To conclude I would like to add that I onl decided to open this issue in an 
> attempt to gauge peoples views on the direction it takes us in.
> [1] http://www.sonarsource.org/
> [2] http://docs.codehaus.org/display/SONAR/Documentation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira