Re: Nutch 2.0 Press Announcement

2012-06-22 Thread Sally Khudairi
Hello Lewis --great to hear from you, as always. Hello Nutch DevTeam!

Of course; I'm happy to help. What's your timeframe?

Traditionally, these sorts of announcements are usually something I work with 
the PMC on, vs. dev (no offense, folks, it's more of an issue of public 
exposure prior to the announcement being made). Whatever works best for you is 
fine...I'm flexible.

Having said that, what is your timeframe? In other words, has v2.0 already been 
releases (I hope not!). Also, if you would like to include supporting 
testimonial quotes from highly-visible users (organizations), we are going to 
have to plan to set aside at least a week for those to come in (some companies 
have strict vetting/clearance requirements by their legal teams).

And finally, in an ideal situation, we'll work on the announcement together 
(usually there's a point-person assigned to take the lead on this, and we'll 
run drafts by the list during the final editing stages) so I can get a better 
grasp of the project and be able to highlight what's new/important/sexy/*.

Thanks again. I look forward to working with y'all g

Chat soon,
Sally
 




 From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: Sally Khudairi s...@apache.org 
Cc: dev@nutch.apache.org 
Sent: Thursday, 21 June 2012, 16:49
Subject: Nutch 2.0 Press Announcement
 
Good Evening Sally,

First and foremost I hope you are keeping well and that the beginning
of the summer has been kind to you... all the good weather still to
come not to worry :0)

The reason I contact you is that we (the Apache Nutch community) are
nearly ready to release Nutch 2.0 which represents a pretty
significant milestone for Apache Nutch as a project. Although Nutch
2.0 is not considered as main stream development (a decision made by
the PMC some time ago) it still marks a real step forward for the
project as a whole and also pays serious merit to users, developers
and committers past and present. Due top these reasons I think it
would be excellent for the community if we could really get the
message out that the project is rocking in addition to the fact that
it is an excellent, well followed, vibrant TLP within the foundation.

I wonder if it would be possible for us to get a formal press
announcement constructed based on input from ourselves in
collaboration with your experience in this area?

I am coming into the official press releases from an almost blind
tangent so would really appreciate your guidance and input on this one
if possible.

Thanks in advance for any input you have.

Best

Lewis

N.B Please anyone from dev@ chime in on this thread. I personally feel
the better an announcement, the more our community grows. Thank you




Re: Nutch 2.0 DOAP

2011-08-10 Thread Julien Nioche
That's great, thanks!

On 10 August 2011 14:58, lewis john mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi,

 Just for information purposes, I committed our DOAP which can now be found
 under trunk svn. I have been informed by site-dev@ that the system they
 use oes not support more than one doap file, however I thought it best to
 keep it in svn for the time being. If at some point in the future Nutch 2.0
 becomes the de facto Nutch release then no-one will need to recreate one.

 Thanks

 --
 *Lewis*




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: Nutch 2.0 Documentation

2011-08-09 Thread Markus Jelsma
Hi,

Maybe a stupid question but i don't see a trunk/docs?

Cheers


On Thursday 04 August 2011 12:47:54 lewis john mcgibbney wrote:
 Hi,
 
 Was mucking around on a totally separate personal issue with Gora today and
 couldn't help but like the /docs directory which is bundled when you svn co
 the project. I would really like to push to get this going as per [1] as I
 have been trying to get various documentation updated over the last while.
 This would be a reasonable milestone which would carve the way for a fully
 documented Nutch 2.0 (and branch 1.4) ;0)
 
 Would it be possible for me to invoke a small conversation on this topic to
 gather thoughts as it seems this issue has been forgotten about again.
 
 Thank you
 
 [1] https://issues.apache.org/jira/browse/NUTCH-881

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Nutch 2.0 roadmap

2011-07-04 Thread Julien Nioche
Hi Lewis,


Currently the slightly (in places) dated roadmap can be found here [1], I
 was wondering if we could give this an overhaul/update as it would give a
 more robust overview of where trunk is going. Most of the points you make
 are still in development, however some have been achieved and integrated
 into trunk builds. Is there anything else we can add to this page to reflect
 current initiatives currently in dev regarding trunk (major or minor?).


There isn't much happening to the trunk, partly because building it is not
very straightforward but this should get better once the GORA artefacts are
published (I think Chris was about to do another RC ). There are also
outstanding issues in GORA with some of the backends (e.g disappearing
URLs), failing tests etc...


 You make a lot of good points in your Berlin Buzzwords presentation Julien,
 would it be possible to initiate further disucssion amongst devs on these
 points.


some of the points are relevant for the 1.x branch as well. We can
definitely list them on the Wiki


  I noticed another point you mentioned was that we are thin on
 documentation for trunk... this is very much true. It would be great to get
 an up-to-date roadmap for trunk as we plan to release this year moving
 forward it is essential that this is seen to.


Having a roadmap would be good of course but being able to compile, fix
essential bugs and have a minimal documentation should probably be enough to
do an initial release.

Thanks

Julien
-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: Nutch 2.0 Help

2010-09-08 Thread Julien Nioche
Hi guys,

I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
http://wiki.apache.org/nutch/GORA_HBase

Feel free to amend and improve as you see fit.

Please bear in mind that Nutch 2.0 is at a very early stage and is far from
being bug-proof, see in particular [1].

HTH

Julien

[1] https://issues.apache.org/jira/browse/NUTCH-893

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 6 September 2010 13:35, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-09-05 14:56, David Stuart wrote:

 Hi All,

 I have done as per below and can create a table from within the hbase
 shell. I found the appropriate create table method
 bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
 returns null

 Any help would be great


 You don't have to create a table manually - this should happen
 automatically when you first run any Nutch tool. Just make sure you have
 hbase-site.xml on your classpath in Nutch - best if you put it in your conf/
 and rebuild, so that it's packed into a job jar.

 Here's for example my config files that work with HBase (I don't use any
 non-standard settings for HBase, so my hbase-site.xml has no properties, but
 still it needs to be included in Nutch job jar):

 gora-hbase-mapping.xml:
 -

 gora-orm

 table name=webtable
  family name=p/ !-- This can also have params like compression, bloom
 filters --
  family name=f/
  family name=s/
  family name=il/
  family name=ol/
  family name=h/
  family name=mtdt/
  family name=mk/
 /table

 class table=webtable keyClass=java.lang.String
 name=org.apache.nutch.storage.WebPage
  !-- fetch fields   --
  field name=baseUrl family=f qualifier=bas/
  field name=status family=f qualifier=st/
  field name=prevFetchTime family=f qualifier=pts/
  field name=fetchTime family=f qualifier=ts/
  field name=fetchInterval family=f qualifier=fi/
  field name=retriesSinceFetch family=f qualifier=rsf/
  field name=reprUrl family=f qualifier=rpr/
  field name=content family=f qualifier=cnt/
  field name=contentType family=f qualifier=typ/
  field name=protocolStatus family=f qualifier=prot/
  field name=modifiedTime family=f qualifier=mod/

  !-- parse fields   --
  field name=title family=p qualifier=t/
  field name=text family=p qualifier=c/
  field name=parseStatus family=p qualifier=st/
  field name=signature family=p qualifier=sig/
  field name=prevSignature family=p qualifier=psig/

  !-- score fields   --
  field name=score family=s qualifier=s/

  field name=headers family=h/

  field name=inlinks family=il/

  field name=outlinks family=ol/

  field name=metadata family=mtdt/

  field name=markers family=mk/

 /class

 /gora-orm
 -

 nutch-site.xml:
 -
 ... blah blah, a lot of unrelated stuff...

 property
  namestorage.data.store.class/name
  valueorg.gora.hbase.store.HBaseStore/value

  descriptionDefault class for storing data/description
 /property
 -

 Of course you need also to use the same hadoop files (hdfs-site and
 mapred-site) as the ones that HBase uses.


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0 Help

2010-09-08 Thread Enis Soztutar
Hi,

I think we need to commit all the necessary files to nutch so that it can
work out of the box for sql, hbase and casssandra. We can even write
commented-out entries in gora.properties, nutch-site.xml, etc so that using
nutch with different backends becomes a configuration change. I will open a
issue to track this down.

Cheers,
Enis

On Wed, Sep 8, 2010 at 1:53 PM, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:

 Hi guys,

 I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
 http://wiki.apache.org/nutch/GORA_HBase

 Feel free to amend and improve as you see fit.

 Please bear in mind that Nutch 2.0 is at a very early stage and is far from
 being bug-proof, see in particular [1].

 HTH

 Julien

 [1] https://issues.apache.org/jira/browse/NUTCH-893

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com


 On 6 September 2010 13:35, Andrzej Bialecki a...@getopt.org wrote:

  On 2010-09-05 14:56, David Stuart wrote:
 
  Hi All,
 
  I have done as per below and can create a table from within the hbase
  shell. I found the appropriate create table method
  bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
  returns null
 
  Any help would be great
 
 
  You don't have to create a table manually - this should happen
  automatically when you first run any Nutch tool. Just make sure you have
  hbase-site.xml on your classpath in Nutch - best if you put it in your
 conf/
  and rebuild, so that it's packed into a job jar.
 
  Here's for example my config files that work with HBase (I don't use any
  non-standard settings for HBase, so my hbase-site.xml has no properties,
 but
  still it needs to be included in Nutch job jar):
 
  gora-hbase-mapping.xml:
  -
 
  gora-orm
 
  table name=webtable
   family name=p/ !-- This can also have params like compression,
 bloom
  filters --
   family name=f/
   family name=s/
   family name=il/
   family name=ol/
   family name=h/
   family name=mtdt/
   family name=mk/
  /table
 
  class table=webtable keyClass=java.lang.String
  name=org.apache.nutch.storage.WebPage
   !-- fetch fields   --
   field name=baseUrl family=f qualifier=bas/
   field name=status family=f qualifier=st/
   field name=prevFetchTime family=f qualifier=pts/
   field name=fetchTime family=f qualifier=ts/
   field name=fetchInterval family=f qualifier=fi/
   field name=retriesSinceFetch family=f qualifier=rsf/
   field name=reprUrl family=f qualifier=rpr/
   field name=content family=f qualifier=cnt/
   field name=contentType family=f qualifier=typ/
   field name=protocolStatus family=f qualifier=prot/
   field name=modifiedTime family=f qualifier=mod/
 
   !-- parse fields   --
   field name=title family=p qualifier=t/
   field name=text family=p qualifier=c/
   field name=parseStatus family=p qualifier=st/
   field name=signature family=p qualifier=sig/
   field name=prevSignature family=p qualifier=psig/
 
   !-- score fields   --
   field name=score family=s qualifier=s/
 
   field name=headers family=h/
 
   field name=inlinks family=il/
 
   field name=outlinks family=ol/
 
   field name=metadata family=mtdt/
 
   field name=markers family=mk/
 
  /class
 
  /gora-orm
  -
 
  nutch-site.xml:
  -
  ... blah blah, a lot of unrelated stuff...
 
  property
   namestorage.data.store.class/name
   valueorg.gora.hbase.store.HBaseStore/value
 
   descriptionDefault class for storing data/description
  /property
  -
 
  Of course you need also to use the same hadoop files (hdfs-site and
  mapred-site) as the ones that HBase uses.
 
 
  --
  Best regards,
  Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 



Re: nutch 2.0 (trunk)

2010-09-07 Thread Andrzej Bialecki

On 2010-09-07 14:50, Faruk Berksöz wrote:

Dear all,

wenn i try to fetch a web page (e.g.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql
storage definition,
I am seeing the following error in my hadoop logs. ,  (no error with
hbase ) ;

java.io.IOException: java.sql.BatchUpdateException: Data truncation:
Data too long for column 'content' at row 1
 at org.gora.sql.store.SqlStore.flush(SqlStore.java:316)
 at org.gora.sql.store.SqlStore.close(SqlStore.java:163)
 at
org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72)
 at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
 at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)

The type of the column 'content' is BLOB.
It may be important for the next developments of Gora.
Should I file this in nutch-jira or hithub/gora or nothing?

environments : ubuntu 10.04
JVM : 1.6.0_20
nutch 2.0 (trunk)
Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed


Yes, please create a JIRA issue. Thanks!



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: nutch 2.0 (trunk)

2010-09-07 Thread Julien Nioche
Hi Faruk,

You can either set a lower value for the parameter http.content.limit or
modify the mapping and set

field name=content column=content jdbc-type=MEDIUMBLOB/

which should work for mysql.

See the discussion on http://github.com/enis/gora/issues/closed#issue/48

HTH

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com



On 7 September 2010 14:02, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-09-07 14:50, Faruk Berksöz wrote:

 Dear all,

 wenn i try to fetch a web page (e.g.
 http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql
 storage definition,
 I am seeing the following error in my hadoop logs. ,  (no error with
 hbase ) ;

 java.io.IOException: java.sql.BatchUpdateException: Data truncation:
 Data too long for column 'content' at row 1
 at org.gora.sql.store.SqlStore.flush(SqlStore.java:316)
 at org.gora.sql.store.SqlStore.close(SqlStore.java:163)
 at
 org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72)
 at
 org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
 at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)

 The type of the column 'content' is BLOB.
 It may be important for the next developments of Gora.
 Should I file this in nutch-jira or hithub/gora or nothing?

 environments : ubuntu 10.04
 JVM : 1.6.0_20
 nutch 2.0 (trunk)
 Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed


 Yes, please create a JIRA issue. Thanks!



 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




Re: Nutch 2.0 Help

2010-09-02 Thread Julien Nioche
Hi David,

I haven't used the Hbase backend with GORA for quite some time but from what
I can remember you'll need the following things :

* conf/hbase-site.xml = this should correspond to your local configuration
* conf/gora-hbase-mapping.xml = see below
* conf/gora.properties = don't think there anything you need to specify for
Hbase

* in nutch-site.xml

property
  namestorage.data.store.class/name
  valueorg.gora.hbase.store.HbaseStore/value
  descriptionDefault class for storing data/description
/property

and of course all the necessary Hbase jars in the /lib dir - probably easier
to modify ivy/ivy.xml so that it includes Hbase

gora-hbase-mapping.xml  : not sure this is the latest version though

*?xml version=1.0 encoding=UTF-8?

gora-orm

table name=webtable
  family name=p/ !-- This can also have params like compression, bloom
filters --
  family name=f/
  family name=s/
  family name=il/
  family name=ol/
  family name=h/
  family name=mtdt/
  family name=mk/
/table

class table=webtable keyClass=java.lang.String
name=org.apache.nutch.storage.WebPage
  !-- fetch fields   --
  field name=baseUrl family=f qualifier=bas/
  field name=status family=f qualifier=st/
  field name=prevFetchTime family=f qualifier=pts/
  field name=fetchTime family=f qualifier=ts/
  field name=fetchInterval family=f qualifier=fi/
  field name=retriesSinceFetch family=f qualifier=rsf/
  field name=reprUrl family=f qualifier=rpr/
  field name=content family=f qualifier=cnt/
  field name=contentType family=f qualifier=typ/
  field name=protocolStatus family=f qualifier=prot/
  field name=modifiedTime family=f qualifier=mod/

  !-- parse fields   --
  field name=title family=p qualifier=t/
  field name=text family=p qualifier=c/
  field name=parseStatus family=p qualifier=st/
  field name=signature family=p qualifier=sig/
  field name=prevSignature family=p qualifier=psig/

  !-- score fields   --
  field name=score family=s qualifier=s/

  field name=headers family=h/

  field name=inlinks family=il/

  field name=outlinks family=ol/

  field name=metadata family=mtdt/

  field name=markers family=mk/

/class

/gora-orm*


HTH

Good luck!

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 2 September 2010 12:58, David Stuart 
david.stu...@progressivealliance.co.uk wrote:

 Hey All,

 I have setup the latest version nutch from trunk and am running into a few
 issues with hbase and injecting urls. when I run the command

 runtime/local/bin/nutch inject runtime/local/seed/

 I get
 InjectorJob: java.lang.RuntimeException: Could not create datastore
at
 org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:70)
at
 org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:50)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:246)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:256)

 Under the gora properties it should be pointing at localhost/nutchtest and
 I created that store manually in hbase is that right?


 I have found a few tutorials around nutchbase but the api seems to have
 changed since the merge with Nutch trunk

 Any help would be appreciated and I try to do a how to writeup

 Regards,

 Dave


Re: Nutch 2.0 : Design issue

2010-07-02 Thread Julien Nioche
On 2 July 2010 12:22, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-07-02 12:42, Julien Nioche wrote:

 Hi guys,

 You've probably seen that there has been some progress on 2.0 lately.
 We've
 updated the nutchbase svn branch with the latest developments done on
 Dogacan's Github i.e. using GORA as a storage layer.
 One of the main issues [1] I raised after using nutchbase was that :

 NutchBase currently marks entries in the table to be fetched | parsed |

 etc... and needs to go through the whole table at every step. As the
 table
 gets bigger it takes more and more time to read through the entries and
 check their marks which is not a viable option. NutchBase is currently
 slower than Nutch 1.1 (might be issues with Gora but still...)
 I suggest instead that we create fetchlists in separate tables, fetch
 parse in these tables then merge the entries back to the main table. The
 segment tables could then be deleted if necessary. We would then have a
 linear processing time for fetching + parsing + updating depending on the
 size of the segments and NOT on the size of the main table. This would be
 an
 improvement compared to 1.1 where the processing time in the updates is
 relative to the size of the crawldb .


 Doing this requires to be able to separate the name of a schema from the
 name of a table in Gora [2], which should not be a big problem.


 I think this is a good idea - this model is conceptually close to the
 current model, and I bet it will be easier to debug problems when changes
 are limited to a separate table... we could create 1 table per segment.

 (Oh, and let's stop calling them segments, please - maybe call them a batch
 or crawl cycle or something. The name segments caused a lot of confusion
 already, and it doesn't convey any useful meaning..)


Makes sense



 As for the time savings .. this remains to be seen. At the end of the
 fetching/parsing job we need to merge this data back into the main table,
 which is a massive update that also takes time.


True





 On a second thought I was wondering whether it would also make sense to
 actually keep the segments as they currently are i.e. stored as
 NutchWritables in HDFS. The advantages of doing this would be that we'd
 keep
 exactly the same code for the fetching + parsing + would only need to
 modify
 the generations and update steps + would be able to easily port pre-2.0
 segments to the webtable. The drawbacks being that there would be a dual
 storage GORA / HDFS and we'd need to keep the legacy Nutch Writable
 objects.


 The fetcher code is already ported in nutchbase not to use the plain files.
 I doubt there would be many users who want to jump to Nutch 2.0 and still
 want to hold on to their old segments... so I think this is not useful. Dual
 storage .. *shudder* that's asking for trouble.


Right, + am not too keen on keeping the legacy objects. Another advantage of
having the GORA-based tables for the segments (or fetch_cycles ;-) ) is that
is makes it easier to restart an interrupted fetch or parse.

Forget about the HDFS based storage, let's just do it with GORA




 Note that it would not change anything to the content of the main webtable
 nor the operations done on them. Maybe it would make sense to do that
 anyway
 at least as a transition while we make the webtable and GORA operations
 stable and then see if there is an advantage in storing the segments as
 GORA
 tables as well.

 I am pretty confident that we need to address the point raised in [1]
 anyway. What do you guys think?

 *[1] http://github.com/dogacan/nutchbase/issues#issue/8
 [2] http://github.com/enis/gora/issues#issue/30*


 +1 to both points, -1 to the dual storage.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com


Re: Nutch 2.0

2010-06-29 Thread Doğacan Güney
Hi,

On Tue, Jun 29, 2010 at 11:49, Julien Nioche
lists.digitalpeb...@gmail.comwrote:

 Thanks Chris,

 I already shared my thoughts on this yesterday, but I still fail to see the
 advantage of keeping the details of the recent github nutchbase commits
 (some of them being just upgrades to the recent changes in 1.1) in svn
 nutchbase knowing that the point is actually to do incremental changes to
 the existing trunk (which already has the 1.1 changes) from svn nutchbase
 and review / comment / improve the code on this occasion.

 Since we also want to produce a patch in JIRA for the changes in svn
 nutchbase in order to put the donated to Apache stamp on it it would make
 sense to do that just once and not for all the commits which have been done
 in github.

 I am probably missing an important point here, but if so I would appreciate
 if someone (Dogacan?) could explain why we should not stick to the original
 plan
 (a) clear the existing svn nutchbase
 (b) generate a large patch with the code from github and JIRA it


Do you mean generating a single patch vs nutch? There are a lot of fixes and
improvements in nutch 1.1 that I cherry-picked to nutchbase later. If we
generate
a larger patch, and then this branch is blessed as trunk then history for
those improvements will be lost.

Or am I misunderstanding you here?




(c) commit the changes to svn nutchbase
 then get on with the interesting bits.

 My concern is that proceeding as Dogacan described yesterday might take
 quite some time and block the rest of the work on 2.0. I am happy to work on
 the 3 steps above BTW.

 Thanks

 Julien





 On 29 June 2010 06:44, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Okey dokey guys, (c), (e) and (g) are done.

 Julien, Doğacan, your turn on (a) and (d) and then we can all work on (e)
 and (f)...

 Cheers,
 Chris




 On 6/28/10 12:55 PM, Doğacan Güney doga...@gmail.com wrote:

 On Mon, Jun 28, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:

 On 2010-06-28 17:57, Mattmann, Chris A (388J) wrote:
  Hi Doğacan,
 
  So your proposition is to combine (a) and (b) then? That’s fine by me,
  so long as there are no objections from others. I can still move forward
  with , (e) and (g) then...


 No objections from me - but IMHO to satisfy the legal minds you still
 need to produce a patch and attach to an issue with the Grant to ASF
 checkbox marked...


 OK, I'll create a new issue in JIRA, and then attach a lot of patches :)

 I'll try to appropriately mark patches that are straightforward ports from
 nutch 1.1
 into nutchbase so that the same committers can commit those patches
 _again_
 hopefully preserving post nutch 1.0 history as much as possible.


 (Also, I always shudder when I imagine a massive merge failing ... but
 that's probably a leftover from my CVS days when a failed merge would
 leave a completely broken tree.. ah, well, good luck :) ).


 I regularly do large merges in git and it works beautifully. We'll see how
 well
 SVN does :)



 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: *chris.mattm...@jpl.nasa.gov
 *WWW:   *http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/
 *++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 --
 DigitalPebble Ltd

 Open Source Solutions for Text Engineering
 http://www.digitalpebble.com




-- 
Doğacan Güney


Re: Nutch 2.0

2010-06-29 Thread Mattmann, Chris A (388J)
Hey Guys,

On 6/29/10 2:30 AM, Andrzej Bialecki a...@getopt.org wrote:

 I am probably missing an important point here, but if so I would
 appreciate if someone (Dogacan?) could explain why we should not stick
 to the original plan
 (a) clear the existing svn nutchbase
 (b) generate a large patch with the code from github and JIRA it
 (c) commit the changes to svn nutchbase
 then get on with the interesting bits.

Like I said, whether we merge the Github Nutchbase into the Apache Nutchbase
branch or we blow away the Apache Nutchbase branch and then import the
Github Nutchbase branch wholesale, either way, we are left with an Apache
Nutchbase branch that needs to incrementally be merged into the Nutch 2.0
trunk, which I agree with Andrzej, and Julien, is the most important part.

So, either way works fine with me, so long as we are left with an Apache
Nutchbase branch that can be merged incrementally with the Apache Nutch 2.0
trunk. I'm just not going to be the one doing that first part (Github
transfer), so I didn't want to push one way or another.

Once the Apache Nutchbase branch is ready, can we identify a set of 5-10
JIRA patches that we can use to track how to bring the Apache Nutchbase
branch into the Apache Nutch 2.0 trunk? At that point, I'll likely be of use
again :) Until then, Julien, Dogacan, I think the floor is yours.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: Nutch 2.0

2010-06-28 Thread Andrzej Bialecki
On 2010-06-28 07:49, Sami Siren wrote:
 One aspect that has not been discussed yet is the legal aspect.
 According to
 http://incubator.apache.org/ip-clearance/index.html there is a formal
 process for integrating externally development efforts that have
 happened outside of Apache. Should we be following the ip clearance
 process in this case too?

The concept of a substantial contribution that should be subject to a
software grant is somewhat tenuous, though. Keep in mind that you do
something equivalent in JIRA already - when you check the Grant license
to ASF box you perform a micro-grant. So the question is whether we
should go through a full grant or through the JIRA micro-grant.

In my opinion it's ok to do the latter, since much of the code is simply
a modified version of Nutch classes - not counting GORA, of course, but
that part will be added as a third-party lib. So IMHO it's enough to zip
all source (without libs), attach it to a JIRA issue and mark the
checkbox. Then we follow the process outlined by Chris, which imports
the same codebase into our svn. What do you think?

If folks agree that this is sufficient, then Dogacan  Enis - can you
please create a separate JIRA issue, prepare a patch like this, mark the
checkbox, and list all dependencies and their licenses for those that
are not already in Nutch svn?

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch 2.0

2010-06-28 Thread Sami Siren

On 06/28/2010 10:10 AM, Andrzej Bialecki wrote:

On 2010-06-28 07:49, Sami Siren wrote:

One aspect that has not been discussed yet is the legal aspect.
According to http://incubator.apache.org/ip-clearance/index.html
there is a formal process for integrating externally development
efforts that have happened outside of Apache. Should we be
following the ip clearance process in this case too?


The concept of a substantial contribution that should be subject to
a software grant is somewhat tenuous, though. Keep in mind that you
do something equivalent in JIRA already - when you check the Grant
license to ASF box you perform a micro-grant. So the question is
whether we should go through a full grant or through the JIRA
micro-grant.

In my opinion it's ok to do the latter, since much of the code is
simply a modified version of Nutch classes - not counting GORA, of
course, but that part will be added as a third-party lib. So IMHO
it's enough to zip all source (without libs), attach it to a JIRA
issue and mark the checkbox. Then we follow the process outlined by
Chris, which imports the same codebase into our svn. What do you
think?


I do not know what is the right approach, that's why I asked the
question. Also I have not looked at the donation but the following
comment made me think it might fall into substantial category:


There has been an enormous amount of changes between the nutchbase
branch and the version on GitHub - pretty much EVERY class has been
modified + a lot of classes have been removed etc...



If folks agree that this is sufficient, then Dogacan  Enis - can
you please create a separate JIRA issue, prepare a patch like this,
mark the checkbox, and list all dependencies and their licenses for
those that are not already in Nutch svn?


This would be a good thing to do in any case. It would help to
understand what the donation is about and also help to decide which process
(if any) needs to be followed.

--
 Sami Siren




Re: Nutch 2.0

2010-06-28 Thread Doğacan Güney
Hey all,

I will double check to make sure, but IIRC, there is no need to delete
svn:nutchbase since current code on
github simply builds on top of that. So why not simply merge github branch
into svn? It will be a clear merge...
The only problem is contributor info is messed up in github but I tried to
preserve as much contrib info as possible
when I pulled in 1.1 changes (via git cherry-pick). So we can break the code
in github into smaller patches, apply them
on top of svn nutchbase (which, again, will be clean) then, 1.1 changes can
be applied by _original_ committers, thus
hopefully preserving contributor info as well.

Makes sense?

On Mon, Jun 28, 2010 at 16:45, Julien Nioche
lists.digitalpeb...@gmail.comwrote:

  Hi,

 (a) deleting svn:nutchbase
 (b) svn:importing Git Nutchbase.
 (c) branch current 1.2-trunk as 1.2-branch
 (d) iteratively apply patches from new svn:nutchbase to trunk to bring
 it up to snuff.
  (e) roll the version # in nutch trunk to 2.0-dev
 (f) all issues in JIRA should be updated to reflect 2.0-dev fixes
 where
 it makes sense
 (g) a 2.1 version is added to mark anything that we don't want in 2.0
 and we file post 2.0 issues there
 (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is
 removed. All unit tests should pass regression where it makes sense.
 (i) Nutch documentation is brought up to date on wiki and checked into
 SVN
 (j) We roll a 2.0 release


 +1



 I'd be happy to do (a), (c), (e) and (g) tomorrow, and would like to
 participate in (d) and (f).

 I'm thinking Julien and Doğacan would be the
 best people to do (b) and (i).


 Doğacan is in the process of writing the documentation



 (h) should be a result of all steps prior
 (a)-(g), and as for (j), I'd be happy to do (j) when the time comes.

 So, if I don't hear any objections, I'll do (a), (c), (e) and (g)
 tomorrow... (6/28, likely PM PST Los Angeles time)


 cool, thanks

 J.
 --
 DigitalPebble Ltd

 Open Source Solutions for Text Engineering
 http://www.digitalpebble.com




-- 
Doğacan Güney


Re: Nutch 2.0

2010-06-28 Thread Mattmann, Chris A (388J)
Hi Doğacan,

So your proposition is to combine (a) and (b) then? That’s fine by me, so long 
as there are no objections from others. I can still move forward with , (e) and 
(g) then...

Cheers,
Chris



On 6/28/10 8:39 AM, Doğacan Güney doga...@gmail.com wrote:

Hey all,

I will double check to make sure, but IIRC, there is no need to delete 
svn:nutchbase since current code on
github simply builds on top of that. So why not simply merge github branch into 
svn? It will be a clear merge...
The only problem is contributor info is messed up in github but I tried to 
preserve as much contrib info as possible
when I pulled in 1.1 changes (via git cherry-pick). So we can break the code in 
github into smaller patches, apply them
on top of svn nutchbase (which, again, will be clean) then, 1.1 changes can be 
applied by _original_ committers, thus
hopefully preserving contributor info as well.

Makes sense?

On Mon, Jun 28, 2010 at 16:45, Julien Nioche lists.digitalpeb...@gmail.com 
wrote:
 Hi,

(a) deleting svn:nutchbase
 (b) svn:importing Git Nutchbase.
 (c) branch current 1.2-trunk as 1.2-branch
 (d) iteratively apply patches from new svn:nutchbase to trunk to bring
it up to snuff.
 (e) roll the version # in nutch trunk to 2.0-dev
 (f) all issues in JIRA should be updated to reflect 2.0-dev fixes where
it makes sense
 (g) a 2.1 version is added to mark anything that we don't want in 2.0
and we file post 2.0 issues there
 (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is
removed. All unit tests should pass regression where it makes sense.
 (i) Nutch documentation is brought up to date on wiki and checked into
SVN
 (j) We roll a 2.0 release

+1


I'd be happy to do (a), (c), (e) and (g) tomorrow, and would like to
participate in (d) and (f).
I'm thinking Julien and Doğacan would be the
best people to do (b) and (i).

Doğacan is in the process of writing the documentation


(h) should be a result of all steps prior
(a)-(g), and as for (j), I'd be happy to do (j) when the time comes.

So, if I don't hear any objections, I'll do (a), (c), (e) and (g)
tomorrow... (6/28, likely PM PST Los Angeles time)

cool, thanks

J.


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Nutch 2.0

2010-06-28 Thread Mattmann, Chris A (388J)
Hi Guys,

And, let me clarify my OK’ness with this. My assumption is that regardless of 
whether we physically svn:delete nutchbase in Apache SVN (the choice I went to 
after hearing there were *significant* changes in the Git version from that of 
the Apache one), and then import a fresh copy from Git, or whether we simply 
update Nutchbase in apache SVN with Git patches (my original suggestion), that 
in the end, we are left with a Nutchbase branch that we can move forward from 
in Apache SVN.

If that is the case, then I think my suggested plan below applies either way 
and we can move forward...

Cheers,
Chris



On 6/28/10 8:57 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov 
wrote:

Hi Doğacan,

So your proposition is to combine (a) and (b) then? That’s fine by me, so long 
as there are no objections from others. I can still move forward with , (e) and 
(g) then...

Cheers,
Chris



On 6/28/10 8:39 AM, Doğacan Güney doga...@gmail.com wrote:

Hey all,

I will double check to make sure, but IIRC, there is no need to delete 
svn:nutchbase since current code on
github simply builds on top of that. So why not simply merge github branch into 
svn? It will be a clear merge...
The only problem is contributor info is messed up in github but I tried to 
preserve as much contrib info as possible
when I pulled in 1.1 changes (via git cherry-pick). So we can break the code in 
github into smaller patches, apply them
on top of svn nutchbase (which, again, will be clean) then, 1.1 changes can be 
applied by _original_ committers, thus
hopefully preserving contributor info as well.

Makes sense?

On Mon, Jun 28, 2010 at 16:45, Julien Nioche lists.digitalpeb...@gmail.com 
wrote:
 Hi,

(a) deleting svn:nutchbase
 (b) svn:importing Git Nutchbase.
 (c) branch current 1.2-trunk as 1.2-branch
 (d) iteratively apply patches from new svn:nutchbase to trunk to bring
it up to snuff.
 (e) roll the version # in nutch trunk to 2.0-dev
 (f) all issues in JIRA should be updated to reflect 2.0-dev fixes where
it makes sense
 (g) a 2.1 version is added to mark anything that we don't want in 2.0
and we file post 2.0 issues there
 (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is
removed. All unit tests should pass regression where it makes sense.
 (i) Nutch documentation is brought up to date on wiki and checked into
SVN
 (j) We roll a 2.0 release

+1


I'd be happy to do (a), (c), (e) and (g) tomorrow, and would like to
participate in (d) and (f).
I'm thinking Julien and Doğacan would be the
best people to do (b) and (i).

Doğacan is in the process of writing the documentation


(h) should be a result of all steps prior
(a)-(g), and as for (j), I'd be happy to do (j) when the time comes.

So, if I don't hear any objections, I'll do (a), (c), (e) and (g)
tomorrow... (6/28, likely PM PST Los Angeles time)

cool, thanks

J.


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Nutch 2.0

2010-06-28 Thread Mattmann, Chris A (388J)
Okey dokey guys, (c), (e) and (g) are done.

Julien, Doğacan, your turn on (a) and (d) and then we can all work on (e) and 
(f)...

Cheers,
Chris



On 6/28/10 12:55 PM, Doğacan Güney doga...@gmail.com wrote:

On Mon, Jun 28, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote:
On 2010-06-28 17:57, Mattmann, Chris A (388J) wrote:
 Hi Doğacan,

 So your proposition is to combine (a) and (b) then? That’s fine by me,
 so long as there are no objections from others. I can still move forward
 with , (e) and (g) then...


No objections from me - but IMHO to satisfy the legal minds you still
need to produce a patch and attach to an issue with the Grant to ASF
checkbox marked...


OK, I'll create a new issue in JIRA, and then attach a lot of patches :)

I'll try to appropriately mark patches that are straightforward ports from 
nutch 1.1
into nutchbase so that the same committers can commit those patches _again_
hopefully preserving post nutch 1.0 history as much as possible.

(Also, I always shudder when I imagine a massive merge failing ... but
that's probably a leftover from my CVS days when a failed merge would
leave a completely broken tree.. ah, well, good luck :) ).


I regularly do large merges in git and it works beautifully. We'll see how well
SVN does :)


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++