Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Markus Jelsma
Hi,

To keep up with the rest of the world i believe we should move from the old 
Hadoop mapred API to the new MapReduce API, which has already been done for 
the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily 
done in Ivy but all jobs must be tackled and we have many jobs!

Anyone to give pointers and helping hand in this large task?

Cheers,

-- 
Markus Jelsma - CTO - Openindex


Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Lewis John Mcgibbney
Hi Markus,

I'm certainly in agreement here. If you like to open a Jira, we can
begin the build up a picture of what is required.

Lewis

On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma
markus.jel...@openindex.io wrote:
 Hi,

 To keep up with the rest of the world i believe we should move from the old
 Hadoop mapred API to the new MapReduce API, which has already been done for
 the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily
 done in Ivy but all jobs must be tackled and we have many jobs!

 Anyone to give pointers and helping hand in this large task?

 Cheers,

 --
 Markus Jelsma - CTO - Openindex



-- 
Lewis


[jira] [Created] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2011-12-13 Thread Markus Jelsma (Created) (JIRA)
Upgrade all jobs to new MapReduce API
-

 Key: NUTCH-1219
 URL: https://issues.apache.org/jira/browse/NUTCH-1219
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Priority: Critical
 Fix For: 1.5


We should upgrade to the new Hadoop API for Nutch trunk as already has been 
done for the Nutchgora branch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki

On 13/12/2011 17:42, Lewis John Mcgibbney wrote:

Hi Markus,

I'm certainly in agreement here. If you like to open a Jira, we can
begin the build up a picture of what is required.

Lewis

On Tue, Dec 13, 2011 at 4:41 PM, Markus Jelsma
markus.jel...@openindex.io  wrote:

Hi,

To keep up with the rest of the world i believe we should move from the old
Hadoop mapred API to the new MapReduce API, which has already been done for
the nutchgora branch. Upgrading from hadoop-core to hadoop-common is easily
done in Ivy but all jobs must be tackled and we have many jobs!

Anyone to give pointers and helping hand in this large task?


I guess the question is also whether the 0.22 is compatible enough to 
compile more or less with the existing code that uses the old api. If it 
does, then we can do the transition gradually, if it doesn't then it's a 
bigger issue.


This is easy to verify - just drop in the 0.22 jars and see if it 
compiles / tests are passing.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2011-12-13 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1219:
-

Description: 
We should upgrade to the new Hadoop API for Nutch trunk as already has been 
done for the Nutchgora branch. If i'm not mistaken we can already upgrade to 
the latest 0.20.5 version that still carries the legacy API so we can, without 
immediately upgrading to 0.21 or higher, port the jobs to the new API without 
having the need for a separate branch to work on.

To the committers who created/ported jobs in NutchGora, please write down your 
advice and experience.

  was:We should upgrade to the new Hadoop API for Nutch trunk as already has 
been done for the Nutchgora branch.


 Upgrade all jobs to new MapReduce API
 -

 Key: NUTCH-1219
 URL: https://issues.apache.org/jira/browse/NUTCH-1219
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Priority: Critical
 Fix For: 1.5


 We should upgrade to the new Hadoop API for Nutch trunk as already has been 
 done for the Nutchgora branch. If i'm not mistaken we can already upgrade to 
 the latest 0.20.5 version that still carries the legacy API so we can, 
 without immediately upgrading to 0.21 or higher, port the jobs to the new API 
 without having the need for a separate branch to work on.
 To the committers who created/ported jobs in NutchGora, please write down 
 your advice and experience.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Andrzej Bialecki

On 13/12/2011 18:04, Markus Jelsma wrote:

Hi

I did a quick test to see what happens and it won't compile. It cannot find
our old mapred API's in 0.22. I've also tried 0.20.205.0 which compiles but
won't run and many tests fail with stuff like.

Exception in thread main java.lang.NoClassDefFoundError:
org/codehaus/jackson/map/JsonMappingException
 at
org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicator.java:421)


Hmm... what's that? I don't see this class (or this package) in the 
Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.



 at
org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:443)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at
org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:431)
Caused by: java.lang.ClassNotFoundException:
org.codehaus.jackson.map.JsonMappingException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 ... 4 more

I think this can be overcome but we cannot hide from the fact that all jobs
must be ported to the new API at some point.

You did some work on the new API's, did you come across any cumbersome issues
when working on it?


It was quite some time ago .. but I don't remember anything being really 
complicated, it was just tedious - and once you've done one class the 
other classes follow roughly the same pattern.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Upgrading to Hadoop 0.22.0+

2011-12-13 Thread Markus Jelsma

 On 13/12/2011 18:04, Markus Jelsma wrote:
  Hi
  
  I did a quick test to see what happens and it won't compile. It cannot
  find our old mapred API's in 0.22. I've also tried 0.20.205.0 which
  compiles but won't run and many tests fail with stuff like.
  
  Exception in thread main java.lang.NoClassDefFoundError:
  org/codehaus/jackson/map/JsonMappingException
  
   at
  
  org.apache.nutch.util.dupedb.HostDeduplicator.deduplicator(HostDeduplicat
  or.java:421)
 
 Hmm... what's that? I don't see this class (or this package) in the
 Nutch tree. Also, trunk doesn't use JSON for anything as far as I know.

It's thrown when the job is run, must be a mapred thing.

 
   at
  
  org.apache.nutch.util.dupedb.HostDeduplicator.run(HostDeduplicator.java:4
  43)
  
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at
  
  org.apache.nutch.util.dupedb.HostDeduplicator.main(HostDeduplicator.java:
  431) Caused by: java.lang.ClassNotFoundException:
  org.codehaus.jackson.map.JsonMappingException
  
   at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
   ... 4 more
  
  I think this can be overcome but we cannot hide from the fact that all
  jobs must be ported to the new API at some point.
  
  You did some work on the new API's, did you come across any cumbersome
  issues when working on it?
 
 It was quite some time ago .. but I don't remember anything being really
 complicated, it was just tedious - and once you've done one class the
 other classes follow roughly the same pattern.

Hmm yes. I checked both Hadoop books and saw few migration slides. It 
shouldn't be too hard. I'll just give it a try on some custom jobs.

thanks


[jira] [Updated] (NUTCH-1218) Improve trunk API documentation

2011-12-13 Thread Lewis John McGibbney (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1218:


Attachment: NUTCH-1218.patch

This patch is a work in progress. So far it includes the following
1) Covers half of the core packages by substantiating on the minimal 
package.html descritpions.
2) Fixes the issue with the ${Name} variable which was incorrectly specified
3) Adds missing plugins to the Javadoc Ant target in build.xml.

There is an issue I have stumbled across, can anyone explain in 
default.properties, why there is a _*:\_ after some plugin class names when 
there is not this after others?
{code}

#
# Parse Plugins
#
plugins.parse=\
   org.apache.nutch.parse.ext*:\
   org.apache.nutch.parse.js:\
   org.apache.nutch.parse.swf*:\
   org.apache.nutch.parse.tika:\
   org.apache.nutch.parse.zip

{code}



 Improve trunk API documentation
 ---

 Key: NUTCH-1218
 URL: https://issues.apache.org/jira/browse/NUTCH-1218
 Project: Nutch
  Issue Type: Sub-task
  Components: documentation
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.5

 Attachments: NUTCH-1218.patch


 The trunk API Java documentation could do with some improving. This issue 
 should track that. It should however not seek to change any functionality 
 within the codebase, only to substantiate and improve the existing 
 documentation.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira