Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]

2011-08-10 Thread Julien Nioche
Hi Tom, I have been using Nutch 1.x for the last 9 months or so and it works well for large scale crawls up to around a billion pages. However, the inherent lack of random access in HDFS really starts to become a burden on our hadoop cluster when going through the whole

Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]

2011-08-10 Thread Markus Jelsma
Julien, devs, users, I'd like to see bugs fixed in 2.0 but some of them are way out of my league or would cost me an absurd amount of time. I'd also really like to use Gora but Gora must be maintained. Gora will play a fundamental role in 2.0 and if something is broken there it is not trivial

Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]

2011-08-10 Thread lewis john mcgibbney
Hi, Without changing the flow of conversation and the points which have already been touched upon, I would like to add: I am really split here between a couple of decisions. I like the abstraction that Gora provides, even though it is somewhat of a pain to configure, this also presents a barrier

Writing JUnit test classes for Nutch

2011-08-10 Thread lewis john mcgibbney
Hi, I have been working on NUTCH-208 [1] in an attempt to clean up some dated issues on our JIRA and have been considering Sami's comments regarding unit tests for this specific improvement. My questions are as follows What is the procedure for defining whether a certain patch requires

Re: [jira] [Reopened] (NUTCH-917) Website Navigation Links

2011-08-10 Thread lewis john mcgibbney
Hi Julien this has now been dealt with. Any chance of checking when you get round to it. Thank you On Thu, Jul 28, 2011 at 7:28 PM, Julien Nioche (JIRA) j...@apache.orgwrote: [

[jira] [Updated] (NUTCH-208) http: proxy exception list:

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-208: --- Attachment: NUTCH-208-trunk-2.0-20110810.patch Patch attached for trunk 2.0. I am

[jira] [Commented] (NUTCH-208) http: proxy exception list:

2011-08-10 Thread Markus Jelsma (JIRA)
Priority: Trivial Labels: patch Fix For: 1.4, 2.0 Attachments: NUTCH-208-branch-1.4-20110807.patch, NUTCH-208-branch-1.4-20110809-v2.patch, NUTCH-208-trunk-2.0-20110810.patch, patch.txt, patch.txt, proxy_exception_list-0.8.diff I suggest that a parameter

[jira] [Commented] (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082292#comment-13082292 ] Lewis John McGibbney commented on NUTCH-258: When I was viewing

[jira] [Commented] (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2011-08-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082298#comment-13082298 ] Julien Nioche commented on NUTCH-258: - Lewis - this issue is closed and I am not sure

[jira] [Closed] (NUTCH-917) Website Navigation Links

2011-08-10 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche closed NUTCH-917. --- Resolution: Fixed That's great, thanks Lewis Website Navigation Links

[jira] [Commented] (NUTCH-1028) Log parser keys

2011-08-10 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082304#comment-13082304 ] Markus Jelsma commented on NUTCH-1028: -- Committed for 1.4 in rev. 1156132. Log

Nutch 2.0 DOAP

2011-08-10 Thread lewis john mcgibbney
Hi, Just for information purposes, I committed our DOAP which can now be found under trunk svn. I have been informed by site-dev@ that the system they use oes not support more than one doap file, however I thought it best to keep it in svn for the time being. If at some point in the future Nutch

[jira] [Updated] (NUTCH-208) http: proxy exception list:

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-208: --- Attachment: NUTCH-208-trunk-2.0-20110810-v2.patch new patch for trunk 2.0

[jira] [Created] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-08-10 Thread Lewis John McGibbney (JIRA)
Upgrade all instances of commons logging to slf4j (with log4j backend) -- Key: NUTCH-1078 URL: https://issues.apache.org/jira/browse/NUTCH-1078 Project: Nutch Issue Type:

Re: Nutch 2.0 DOAP

2011-08-10 Thread Julien Nioche
That's great, thanks! On 10 August 2011 14:58, lewis john mcgibbney lewis.mcgibb...@gmail.comwrote: Hi, Just for information purposes, I committed our DOAP which can now be found under trunk svn. I have been informed by site-dev@ that the system they use oes not support more than one doap

[jira] [Commented] (NUTCH-296) Image Search

2011-08-10 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082373#comment-13082373 ] Simão Fontes commented on NUTCH-296: The GSoC did generate some code. There have been

[jira] [Commented] (NUTCH-296) Image Search

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082388#comment-13082388 ] Lewis John McGibbney commented on NUTCH-296: Hi Simão, any chance we could

Re: [jira] [Commented] (NUTCH-296) Image Search

2011-08-10 Thread Simão Fontes
The code developed was for integration on nutchwax. The link to the project is: https://webarchive.jira.com/wiki/display/SOC06/Text-based+image+search+capability+for+NutchWAX The code has been made available to checkout, but it works on a previous version of nutch.

[jira] [Updated] (NUTCH-623) Change plugin source directory languageidentifier to language-identifier

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-623: --- Attachment: NUTCH-623-branch-1.4-20110810.patch This patch for branch-1.4 simply

[jira] [Updated] (NUTCH-623) Change plugin source directory languageidentifier to language-identifier

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-623: --- Attachment: NUTCH-623-branch-1.4-20110810.patch patch for trunk. Both of the above

[jira] [Reopened] (NUTCH-296) Image Search

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reopened NUTCH-296: This issue is back open... The code developed was for integration on nutchwax. The

[jira] [Commented] (NUTCH-672) allow unit tests to be run from bin/nutch

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082558#comment-13082558 ] Lewis John McGibbney commented on NUTCH-672: OK having tried to get this

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

2011-08-10 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082595#comment-13082595 ] Lewis John McGibbney commented on NUTCH-1075: - Hi Julien, Would it be