Re: Nutch 2.0 Press Announcement
Hello Lewis --great to hear from you, as always. Hello Nutch DevTeam! Of course; I'm happy to help. What's your timeframe? Traditionally, these sorts of announcements are usually something I work with the PMC on, vs. dev (no offense, folks, it's more of an issue of public exposure prior to the announcement being made). Whatever works best for you is fine...I'm flexible. Having said that, what is your timeframe? In other words, has v2.0 already been releases (I hope not!). Also, if you would like to include supporting testimonial quotes from highly-visible users (organizations), we are going to have to plan to set aside at least a week for those to come in (some companies have strict vetting/clearance requirements by their legal teams). And finally, in an ideal situation, we'll work on the announcement together (usually there's a point-person assigned to take the lead on this, and we'll run drafts by the list during the final editing stages) so I can get a better grasp of the project and be able to highlight what's new/important/sexy/*. Thanks again. I look forward to working with y'all g Chat soon, Sally From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: Sally Khudairi s...@apache.org Cc: dev@nutch.apache.org Sent: Thursday, 21 June 2012, 16:49 Subject: Nutch 2.0 Press Announcement Good Evening Sally, First and foremost I hope you are keeping well and that the beginning of the summer has been kind to you... all the good weather still to come not to worry :0) The reason I contact you is that we (the Apache Nutch community) are nearly ready to release Nutch 2.0 which represents a pretty significant milestone for Apache Nutch as a project. Although Nutch 2.0 is not considered as main stream development (a decision made by the PMC some time ago) it still marks a real step forward for the project as a whole and also pays serious merit to users, developers and committers past and present. Due top these reasons I think it would be excellent for the community if we could really get the message out that the project is rocking in addition to the fact that it is an excellent, well followed, vibrant TLP within the foundation. I wonder if it would be possible for us to get a formal press announcement constructed based on input from ourselves in collaboration with your experience in this area? I am coming into the official press releases from an almost blind tangent so would really appreciate your guidance and input on this one if possible. Thanks in advance for any input you have. Best Lewis N.B Please anyone from dev@ chime in on this thread. I personally feel the better an announcement, the more our community grows. Thank you
Re: Nutch 2.0 DOAP
That's great, thanks! On 10 August 2011 14:58, lewis john mcgibbney lewis.mcgibb...@gmail.comwrote: Hi, Just for information purposes, I committed our DOAP which can now be found under trunk svn. I have been informed by site-dev@ that the system they use oes not support more than one doap file, however I thought it best to keep it in svn for the time being. If at some point in the future Nutch 2.0 becomes the de facto Nutch release then no-one will need to recreate one. Thanks -- *Lewis* -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: Nutch 2.0 Documentation
Hi, Maybe a stupid question but i don't see a trunk/docs? Cheers On Thursday 04 August 2011 12:47:54 lewis john mcgibbney wrote: Hi, Was mucking around on a totally separate personal issue with Gora today and couldn't help but like the /docs directory which is bundled when you svn co the project. I would really like to push to get this going as per [1] as I have been trying to get various documentation updated over the last while. This would be a reasonable milestone which would carve the way for a fully documented Nutch 2.0 (and branch 1.4) ;0) Would it be possible for me to invoke a small conversation on this topic to gather thoughts as it seems this issue has been forgotten about again. Thank you [1] https://issues.apache.org/jira/browse/NUTCH-881 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch 2.0 roadmap
Hi Lewis, Currently the slightly (in places) dated roadmap can be found here [1], I was wondering if we could give this an overhaul/update as it would give a more robust overview of where trunk is going. Most of the points you make are still in development, however some have been achieved and integrated into trunk builds. Is there anything else we can add to this page to reflect current initiatives currently in dev regarding trunk (major or minor?). There isn't much happening to the trunk, partly because building it is not very straightforward but this should get better once the GORA artefacts are published (I think Chris was about to do another RC ). There are also outstanding issues in GORA with some of the backends (e.g disappearing URLs), failing tests etc... You make a lot of good points in your Berlin Buzzwords presentation Julien, would it be possible to initiate further disucssion amongst devs on these points. some of the points are relevant for the 1.x branch as well. We can definitely list them on the Wiki I noticed another point you mentioned was that we are thin on documentation for trunk... this is very much true. It would be great to get an up-to-date roadmap for trunk as we plan to release this year moving forward it is essential that this is seen to. Having a roadmap would be good of course but being able to compile, fix essential bugs and have a minimal documentation should probably be enough to do an initial release. Thanks Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: Nutch 2.0 Help
Hi guys, I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on http://wiki.apache.org/nutch/GORA_HBase Feel free to amend and improve as you see fit. Please bear in mind that Nutch 2.0 is at a very early stage and is far from being bug-proof, see in particular [1]. HTH Julien [1] https://issues.apache.org/jira/browse/NUTCH-893 -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 6 September 2010 13:35, Andrzej Bialecki a...@getopt.org wrote: On 2010-09-05 14:56, David Stuart wrote: Hi All, I have done as per below and can create a table from within the hbase shell. I found the appropriate create table method bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only returns null Any help would be great You don't have to create a table manually - this should happen automatically when you first run any Nutch tool. Just make sure you have hbase-site.xml on your classpath in Nutch - best if you put it in your conf/ and rebuild, so that it's packed into a job jar. Here's for example my config files that work with HBase (I don't use any non-standard settings for HBase, so my hbase-site.xml has no properties, but still it needs to be included in Nutch job jar): gora-hbase-mapping.xml: - gora-orm table name=webtable family name=p/ !-- This can also have params like compression, bloom filters -- family name=f/ family name=s/ family name=il/ family name=ol/ family name=h/ family name=mtdt/ family name=mk/ /table class table=webtable keyClass=java.lang.String name=org.apache.nutch.storage.WebPage !-- fetch fields -- field name=baseUrl family=f qualifier=bas/ field name=status family=f qualifier=st/ field name=prevFetchTime family=f qualifier=pts/ field name=fetchTime family=f qualifier=ts/ field name=fetchInterval family=f qualifier=fi/ field name=retriesSinceFetch family=f qualifier=rsf/ field name=reprUrl family=f qualifier=rpr/ field name=content family=f qualifier=cnt/ field name=contentType family=f qualifier=typ/ field name=protocolStatus family=f qualifier=prot/ field name=modifiedTime family=f qualifier=mod/ !-- parse fields -- field name=title family=p qualifier=t/ field name=text family=p qualifier=c/ field name=parseStatus family=p qualifier=st/ field name=signature family=p qualifier=sig/ field name=prevSignature family=p qualifier=psig/ !-- score fields -- field name=score family=s qualifier=s/ field name=headers family=h/ field name=inlinks family=il/ field name=outlinks family=ol/ field name=metadata family=mtdt/ field name=markers family=mk/ /class /gora-orm - nutch-site.xml: - ... blah blah, a lot of unrelated stuff... property namestorage.data.store.class/name valueorg.gora.hbase.store.HBaseStore/value descriptionDefault class for storing data/description /property - Of course you need also to use the same hadoop files (hdfs-site and mapred-site) as the ones that HBase uses. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 Help
Hi, I think we need to commit all the necessary files to nutch so that it can work out of the box for sql, hbase and casssandra. We can even write commented-out entries in gora.properties, nutch-site.xml, etc so that using nutch with different backends becomes a configuration change. I will open a issue to track this down. Cheers, Enis On Wed, Sep 8, 2010 at 1:53 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on http://wiki.apache.org/nutch/GORA_HBase Feel free to amend and improve as you see fit. Please bear in mind that Nutch 2.0 is at a very early stage and is far from being bug-proof, see in particular [1]. HTH Julien [1] https://issues.apache.org/jira/browse/NUTCH-893 -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 6 September 2010 13:35, Andrzej Bialecki a...@getopt.org wrote: On 2010-09-05 14:56, David Stuart wrote: Hi All, I have done as per below and can create a table from within the hbase shell. I found the appropriate create table method bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only returns null Any help would be great You don't have to create a table manually - this should happen automatically when you first run any Nutch tool. Just make sure you have hbase-site.xml on your classpath in Nutch - best if you put it in your conf/ and rebuild, so that it's packed into a job jar. Here's for example my config files that work with HBase (I don't use any non-standard settings for HBase, so my hbase-site.xml has no properties, but still it needs to be included in Nutch job jar): gora-hbase-mapping.xml: - gora-orm table name=webtable family name=p/ !-- This can also have params like compression, bloom filters -- family name=f/ family name=s/ family name=il/ family name=ol/ family name=h/ family name=mtdt/ family name=mk/ /table class table=webtable keyClass=java.lang.String name=org.apache.nutch.storage.WebPage !-- fetch fields -- field name=baseUrl family=f qualifier=bas/ field name=status family=f qualifier=st/ field name=prevFetchTime family=f qualifier=pts/ field name=fetchTime family=f qualifier=ts/ field name=fetchInterval family=f qualifier=fi/ field name=retriesSinceFetch family=f qualifier=rsf/ field name=reprUrl family=f qualifier=rpr/ field name=content family=f qualifier=cnt/ field name=contentType family=f qualifier=typ/ field name=protocolStatus family=f qualifier=prot/ field name=modifiedTime family=f qualifier=mod/ !-- parse fields -- field name=title family=p qualifier=t/ field name=text family=p qualifier=c/ field name=parseStatus family=p qualifier=st/ field name=signature family=p qualifier=sig/ field name=prevSignature family=p qualifier=psig/ !-- score fields -- field name=score family=s qualifier=s/ field name=headers family=h/ field name=inlinks family=il/ field name=outlinks family=ol/ field name=metadata family=mtdt/ field name=markers family=mk/ /class /gora-orm - nutch-site.xml: - ... blah blah, a lot of unrelated stuff... property namestorage.data.store.class/name valueorg.gora.hbase.store.HBaseStore/value descriptionDefault class for storing data/description /property - Of course you need also to use the same hadoop files (hdfs-site and mapred-site) as the ones that HBase uses. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: nutch 2.0 (trunk)
On 2010-09-07 14:50, Faruk Berksöz wrote: Dear all, wenn i try to fetch a web page (e.g. http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage definition, I am seeing the following error in my hadoop logs. , (no error with hbase ) ; java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too long for column 'content' at row 1 at org.gora.sql.store.SqlStore.flush(SqlStore.java:316) at org.gora.sql.store.SqlStore.close(SqlStore.java:163) at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) The type of the column 'content' is BLOB. It may be important for the next developments of Gora. Should I file this in nutch-jira or hithub/gora or nothing? environments : ubuntu 10.04 JVM : 1.6.0_20 nutch 2.0 (trunk) Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed Yes, please create a JIRA issue. Thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: nutch 2.0 (trunk)
Hi Faruk, You can either set a lower value for the parameter http.content.limit or modify the mapping and set field name=content column=content jdbc-type=MEDIUMBLOB/ which should work for mysql. See the discussion on http://github.com/enis/gora/issues/closed#issue/48 HTH Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 7 September 2010 14:02, Andrzej Bialecki a...@getopt.org wrote: On 2010-09-07 14:50, Faruk Berksöz wrote: Dear all, wenn i try to fetch a web page (e.g. http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html ) with mysql storage definition, I am seeing the following error in my hadoop logs. , (no error with hbase ) ; java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too long for column 'content' at row 1 at org.gora.sql.store.SqlStore.flush(SqlStore.java:316) at org.gora.sql.store.SqlStore.close(SqlStore.java:163) at org.gora.mapreduce.GoraOutputFormat$1.close(GoraOutputFormat.java:72) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) The type of the column 'content' is BLOB. It may be important for the next developments of Gora. Should I file this in nutch-jira or hithub/gora or nothing? environments : ubuntu 10.04 JVM : 1.6.0_20 nutch 2.0 (trunk) Mysql/HBase (0.20.6) / Hadoop(0.20.2) pseudo-distributed Yes, please create a JIRA issue. Thanks! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0 Help
Hi David, I haven't used the Hbase backend with GORA for quite some time but from what I can remember you'll need the following things : * conf/hbase-site.xml = this should correspond to your local configuration * conf/gora-hbase-mapping.xml = see below * conf/gora.properties = don't think there anything you need to specify for Hbase * in nutch-site.xml property namestorage.data.store.class/name valueorg.gora.hbase.store.HbaseStore/value descriptionDefault class for storing data/description /property and of course all the necessary Hbase jars in the /lib dir - probably easier to modify ivy/ivy.xml so that it includes Hbase gora-hbase-mapping.xml : not sure this is the latest version though *?xml version=1.0 encoding=UTF-8? gora-orm table name=webtable family name=p/ !-- This can also have params like compression, bloom filters -- family name=f/ family name=s/ family name=il/ family name=ol/ family name=h/ family name=mtdt/ family name=mk/ /table class table=webtable keyClass=java.lang.String name=org.apache.nutch.storage.WebPage !-- fetch fields -- field name=baseUrl family=f qualifier=bas/ field name=status family=f qualifier=st/ field name=prevFetchTime family=f qualifier=pts/ field name=fetchTime family=f qualifier=ts/ field name=fetchInterval family=f qualifier=fi/ field name=retriesSinceFetch family=f qualifier=rsf/ field name=reprUrl family=f qualifier=rpr/ field name=content family=f qualifier=cnt/ field name=contentType family=f qualifier=typ/ field name=protocolStatus family=f qualifier=prot/ field name=modifiedTime family=f qualifier=mod/ !-- parse fields -- field name=title family=p qualifier=t/ field name=text family=p qualifier=c/ field name=parseStatus family=p qualifier=st/ field name=signature family=p qualifier=sig/ field name=prevSignature family=p qualifier=psig/ !-- score fields -- field name=score family=s qualifier=s/ field name=headers family=h/ field name=inlinks family=il/ field name=outlinks family=ol/ field name=metadata family=mtdt/ field name=markers family=mk/ /class /gora-orm* HTH Good luck! Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 2 September 2010 12:58, David Stuart david.stu...@progressivealliance.co.uk wrote: Hey All, I have setup the latest version nutch from trunk and am running into a few issues with hbase and injecting urls. when I run the command runtime/local/bin/nutch inject runtime/local/seed/ I get InjectorJob: java.lang.RuntimeException: Could not create datastore at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:70) at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:50) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:233) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:246) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:256) Under the gora properties it should be pointing at localhost/nutchtest and I created that store manually in hbase is that right? I have found a few tutorials around nutchbase but the api seems to have changed since the merge with Nutch trunk Any help would be appreciated and I try to do a how to writeup Regards, Dave
Re: Nutch 2.0 : Design issue
On 2 July 2010 12:22, Andrzej Bialecki a...@getopt.org wrote: On 2010-07-02 12:42, Julien Nioche wrote: Hi guys, You've probably seen that there has been some progress on 2.0 lately. We've updated the nutchbase svn branch with the latest developments done on Dogacan's Github i.e. using GORA as a storage layer. One of the main issues [1] I raised after using nutchbase was that : NutchBase currently marks entries in the table to be fetched | parsed | etc... and needs to go through the whole table at every step. As the table gets bigger it takes more and more time to read through the entries and check their marks which is not a viable option. NutchBase is currently slower than Nutch 1.1 (might be issues with Gora but still...) I suggest instead that we create fetchlists in separate tables, fetch parse in these tables then merge the entries back to the main table. The segment tables could then be deleted if necessary. We would then have a linear processing time for fetching + parsing + updating depending on the size of the segments and NOT on the size of the main table. This would be an improvement compared to 1.1 where the processing time in the updates is relative to the size of the crawldb . Doing this requires to be able to separate the name of a schema from the name of a table in Gora [2], which should not be a big problem. I think this is a good idea - this model is conceptually close to the current model, and I bet it will be easier to debug problems when changes are limited to a separate table... we could create 1 table per segment. (Oh, and let's stop calling them segments, please - maybe call them a batch or crawl cycle or something. The name segments caused a lot of confusion already, and it doesn't convey any useful meaning..) Makes sense As for the time savings .. this remains to be seen. At the end of the fetching/parsing job we need to merge this data back into the main table, which is a massive update that also takes time. True On a second thought I was wondering whether it would also make sense to actually keep the segments as they currently are i.e. stored as NutchWritables in HDFS. The advantages of doing this would be that we'd keep exactly the same code for the fetching + parsing + would only need to modify the generations and update steps + would be able to easily port pre-2.0 segments to the webtable. The drawbacks being that there would be a dual storage GORA / HDFS and we'd need to keep the legacy Nutch Writable objects. The fetcher code is already ported in nutchbase not to use the plain files. I doubt there would be many users who want to jump to Nutch 2.0 and still want to hold on to their old segments... so I think this is not useful. Dual storage .. *shudder* that's asking for trouble. Right, + am not too keen on keeping the legacy objects. Another advantage of having the GORA-based tables for the segments (or fetch_cycles ;-) ) is that is makes it easier to restart an interrupted fetch or parse. Forget about the HDFS based storage, let's just do it with GORA Note that it would not change anything to the content of the main webtable nor the operations done on them. Maybe it would make sense to do that anyway at least as a transition while we make the webtable and GORA operations stable and then see if there is an advantage in storing the segments as GORA tables as well. I am pretty confident that we need to address the point raised in [1] anyway. What do you guys think? *[1] http://github.com/dogacan/nutchbase/issues#issue/8 [2] http://github.com/enis/gora/issues#issue/30* +1 to both points, -1 to the dual storage. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com
Re: Nutch 2.0
Hi, On Tue, Jun 29, 2010 at 11:49, Julien Nioche lists.digitalpeb...@gmail.comwrote: Thanks Chris, I already shared my thoughts on this yesterday, but I still fail to see the advantage of keeping the details of the recent github nutchbase commits (some of them being just upgrades to the recent changes in 1.1) in svn nutchbase knowing that the point is actually to do incremental changes to the existing trunk (which already has the 1.1 changes) from svn nutchbase and review / comment / improve the code on this occasion. Since we also want to produce a patch in JIRA for the changes in svn nutchbase in order to put the donated to Apache stamp on it it would make sense to do that just once and not for all the commits which have been done in github. I am probably missing an important point here, but if so I would appreciate if someone (Dogacan?) could explain why we should not stick to the original plan (a) clear the existing svn nutchbase (b) generate a large patch with the code from github and JIRA it Do you mean generating a single patch vs nutch? There are a lot of fixes and improvements in nutch 1.1 that I cherry-picked to nutchbase later. If we generate a larger patch, and then this branch is blessed as trunk then history for those improvements will be lost. Or am I misunderstanding you here? (c) commit the changes to svn nutchbase then get on with the interesting bits. My concern is that proceeding as Dogacan described yesterday might take quite some time and block the rest of the work on 2.0. I am happy to work on the 3 steps above BTW. Thanks Julien On 29 June 2010 06:44, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Okey dokey guys, (c), (e) and (g) are done. Julien, Doğacan, your turn on (a) and (d) and then we can all work on (e) and (f)... Cheers, Chris On 6/28/10 12:55 PM, Doğacan Güney doga...@gmail.com wrote: On Mon, Jun 28, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-06-28 17:57, Mattmann, Chris A (388J) wrote: Hi Doğacan, So your proposition is to combine (a) and (b) then? That’s fine by me, so long as there are no objections from others. I can still move forward with , (e) and (g) then... No objections from me - but IMHO to satisfy the legal minds you still need to produce a patch and attach to an issue with the Grant to ASF checkbox marked... OK, I'll create a new issue in JIRA, and then attach a lot of patches :) I'll try to appropriately mark patches that are straightforward ports from nutch 1.1 into nutchbase so that the same committers can commit those patches _again_ hopefully preserving post nutch 1.0 history as much as possible. (Also, I always shudder when I imagine a massive merge failing ... but that's probably a leftover from my CVS days when a failed merge would leave a completely broken tree.. ah, well, good luck :) ). I regularly do large merges in git and it works beautifully. We'll see how well SVN does :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: *chris.mattm...@jpl.nasa.gov *WWW: *http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/ *++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com -- Doğacan Güney
Re: Nutch 2.0
Hey Guys, On 6/29/10 2:30 AM, Andrzej Bialecki a...@getopt.org wrote: I am probably missing an important point here, but if so I would appreciate if someone (Dogacan?) could explain why we should not stick to the original plan (a) clear the existing svn nutchbase (b) generate a large patch with the code from github and JIRA it (c) commit the changes to svn nutchbase then get on with the interesting bits. Like I said, whether we merge the Github Nutchbase into the Apache Nutchbase branch or we blow away the Apache Nutchbase branch and then import the Github Nutchbase branch wholesale, either way, we are left with an Apache Nutchbase branch that needs to incrementally be merged into the Nutch 2.0 trunk, which I agree with Andrzej, and Julien, is the most important part. So, either way works fine with me, so long as we are left with an Apache Nutchbase branch that can be merged incrementally with the Apache Nutch 2.0 trunk. I'm just not going to be the one doing that first part (Github transfer), so I didn't want to push one way or another. Once the Apache Nutchbase branch is ready, can we identify a set of 5-10 JIRA patches that we can use to track how to bring the Apache Nutchbase branch into the Apache Nutch 2.0 trunk? At that point, I'll likely be of use again :) Until then, Julien, Dogacan, I think the floor is yours. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch 2.0
On 2010-06-28 07:49, Sami Siren wrote: One aspect that has not been discussed yet is the legal aspect. According to http://incubator.apache.org/ip-clearance/index.html there is a formal process for integrating externally development efforts that have happened outside of Apache. Should we be following the ip clearance process in this case too? The concept of a substantial contribution that should be subject to a software grant is somewhat tenuous, though. Keep in mind that you do something equivalent in JIRA already - when you check the Grant license to ASF box you perform a micro-grant. So the question is whether we should go through a full grant or through the JIRA micro-grant. In my opinion it's ok to do the latter, since much of the code is simply a modified version of Nutch classes - not counting GORA, of course, but that part will be added as a third-party lib. So IMHO it's enough to zip all source (without libs), attach it to a JIRA issue and mark the checkbox. Then we follow the process outlined by Chris, which imports the same codebase into our svn. What do you think? If folks agree that this is sufficient, then Dogacan Enis - can you please create a separate JIRA issue, prepare a patch like this, mark the checkbox, and list all dependencies and their licenses for those that are not already in Nutch svn? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch 2.0
On 06/28/2010 10:10 AM, Andrzej Bialecki wrote: On 2010-06-28 07:49, Sami Siren wrote: One aspect that has not been discussed yet is the legal aspect. According to http://incubator.apache.org/ip-clearance/index.html there is a formal process for integrating externally development efforts that have happened outside of Apache. Should we be following the ip clearance process in this case too? The concept of a substantial contribution that should be subject to a software grant is somewhat tenuous, though. Keep in mind that you do something equivalent in JIRA already - when you check the Grant license to ASF box you perform a micro-grant. So the question is whether we should go through a full grant or through the JIRA micro-grant. In my opinion it's ok to do the latter, since much of the code is simply a modified version of Nutch classes - not counting GORA, of course, but that part will be added as a third-party lib. So IMHO it's enough to zip all source (without libs), attach it to a JIRA issue and mark the checkbox. Then we follow the process outlined by Chris, which imports the same codebase into our svn. What do you think? I do not know what is the right approach, that's why I asked the question. Also I have not looked at the donation but the following comment made me think it might fall into substantial category: There has been an enormous amount of changes between the nutchbase branch and the version on GitHub - pretty much EVERY class has been modified + a lot of classes have been removed etc... If folks agree that this is sufficient, then Dogacan Enis - can you please create a separate JIRA issue, prepare a patch like this, mark the checkbox, and list all dependencies and their licenses for those that are not already in Nutch svn? This would be a good thing to do in any case. It would help to understand what the donation is about and also help to decide which process (if any) needs to be followed. -- Sami Siren
Re: Nutch 2.0
Hey all, I will double check to make sure, but IIRC, there is no need to delete svn:nutchbase since current code on github simply builds on top of that. So why not simply merge github branch into svn? It will be a clear merge... The only problem is contributor info is messed up in github but I tried to preserve as much contrib info as possible when I pulled in 1.1 changes (via git cherry-pick). So we can break the code in github into smaller patches, apply them on top of svn nutchbase (which, again, will be clean) then, 1.1 changes can be applied by _original_ committers, thus hopefully preserving contributor info as well. Makes sense? On Mon, Jun 28, 2010 at 16:45, Julien Nioche lists.digitalpeb...@gmail.comwrote: Hi, (a) deleting svn:nutchbase (b) svn:importing Git Nutchbase. (c) branch current 1.2-trunk as 1.2-branch (d) iteratively apply patches from new svn:nutchbase to trunk to bring it up to snuff. (e) roll the version # in nutch trunk to 2.0-dev (f) all issues in JIRA should be updated to reflect 2.0-dev fixes where it makes sense (g) a 2.1 version is added to mark anything that we don't want in 2.0 and we file post 2.0 issues there (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is removed. All unit tests should pass regression where it makes sense. (i) Nutch documentation is brought up to date on wiki and checked into SVN (j) We roll a 2.0 release +1 I'd be happy to do (a), (c), (e) and (g) tomorrow, and would like to participate in (d) and (f). I'm thinking Julien and Doğacan would be the best people to do (b) and (i). Doğacan is in the process of writing the documentation (h) should be a result of all steps prior (a)-(g), and as for (j), I'd be happy to do (j) when the time comes. So, if I don't hear any objections, I'll do (a), (c), (e) and (g) tomorrow... (6/28, likely PM PST Los Angeles time) cool, thanks J. -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com -- Doğacan Güney
Re: Nutch 2.0
Hi Doğacan, So your proposition is to combine (a) and (b) then? That’s fine by me, so long as there are no objections from others. I can still move forward with , (e) and (g) then... Cheers, Chris On 6/28/10 8:39 AM, Doğacan Güney doga...@gmail.com wrote: Hey all, I will double check to make sure, but IIRC, there is no need to delete svn:nutchbase since current code on github simply builds on top of that. So why not simply merge github branch into svn? It will be a clear merge... The only problem is contributor info is messed up in github but I tried to preserve as much contrib info as possible when I pulled in 1.1 changes (via git cherry-pick). So we can break the code in github into smaller patches, apply them on top of svn nutchbase (which, again, will be clean) then, 1.1 changes can be applied by _original_ committers, thus hopefully preserving contributor info as well. Makes sense? On Mon, Jun 28, 2010 at 16:45, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, (a) deleting svn:nutchbase (b) svn:importing Git Nutchbase. (c) branch current 1.2-trunk as 1.2-branch (d) iteratively apply patches from new svn:nutchbase to trunk to bring it up to snuff. (e) roll the version # in nutch trunk to 2.0-dev (f) all issues in JIRA should be updated to reflect 2.0-dev fixes where it makes sense (g) a 2.1 version is added to mark anything that we don't want in 2.0 and we file post 2.0 issues there (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is removed. All unit tests should pass regression where it makes sense. (i) Nutch documentation is brought up to date on wiki and checked into SVN (j) We roll a 2.0 release +1 I'd be happy to do (a), (c), (e) and (g) tomorrow, and would like to participate in (d) and (f). I'm thinking Julien and Doğacan would be the best people to do (b) and (i). Doğacan is in the process of writing the documentation (h) should be a result of all steps prior (a)-(g), and as for (j), I'd be happy to do (j) when the time comes. So, if I don't hear any objections, I'll do (a), (c), (e) and (g) tomorrow... (6/28, likely PM PST Los Angeles time) cool, thanks J. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch 2.0
Hi Guys, And, let me clarify my OK’ness with this. My assumption is that regardless of whether we physically svn:delete nutchbase in Apache SVN (the choice I went to after hearing there were *significant* changes in the Git version from that of the Apache one), and then import a fresh copy from Git, or whether we simply update Nutchbase in apache SVN with Git patches (my original suggestion), that in the end, we are left with a Nutchbase branch that we can move forward from in Apache SVN. If that is the case, then I think my suggested plan below applies either way and we can move forward... Cheers, Chris On 6/28/10 8:57 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Doğacan, So your proposition is to combine (a) and (b) then? That’s fine by me, so long as there are no objections from others. I can still move forward with , (e) and (g) then... Cheers, Chris On 6/28/10 8:39 AM, Doğacan Güney doga...@gmail.com wrote: Hey all, I will double check to make sure, but IIRC, there is no need to delete svn:nutchbase since current code on github simply builds on top of that. So why not simply merge github branch into svn? It will be a clear merge... The only problem is contributor info is messed up in github but I tried to preserve as much contrib info as possible when I pulled in 1.1 changes (via git cherry-pick). So we can break the code in github into smaller patches, apply them on top of svn nutchbase (which, again, will be clean) then, 1.1 changes can be applied by _original_ committers, thus hopefully preserving contributor info as well. Makes sense? On Mon, Jun 28, 2010 at 16:45, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, (a) deleting svn:nutchbase (b) svn:importing Git Nutchbase. (c) branch current 1.2-trunk as 1.2-branch (d) iteratively apply patches from new svn:nutchbase to trunk to bring it up to snuff. (e) roll the version # in nutch trunk to 2.0-dev (f) all issues in JIRA should be updated to reflect 2.0-dev fixes where it makes sense (g) a 2.1 version is added to mark anything that we don't want in 2.0 and we file post 2.0 issues there (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is removed. All unit tests should pass regression where it makes sense. (i) Nutch documentation is brought up to date on wiki and checked into SVN (j) We roll a 2.0 release +1 I'd be happy to do (a), (c), (e) and (g) tomorrow, and would like to participate in (d) and (f). I'm thinking Julien and Doğacan would be the best people to do (b) and (i). Doğacan is in the process of writing the documentation (h) should be a result of all steps prior (a)-(g), and as for (j), I'd be happy to do (j) when the time comes. So, if I don't hear any objections, I'll do (a), (c), (e) and (g) tomorrow... (6/28, likely PM PST Los Angeles time) cool, thanks J. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch 2.0
Okey dokey guys, (c), (e) and (g) are done. Julien, Doğacan, your turn on (a) and (d) and then we can all work on (e) and (f)... Cheers, Chris On 6/28/10 12:55 PM, Doğacan Güney doga...@gmail.com wrote: On Mon, Jun 28, 2010 at 20:23, Andrzej Bialecki a...@getopt.org wrote: On 2010-06-28 17:57, Mattmann, Chris A (388J) wrote: Hi Doğacan, So your proposition is to combine (a) and (b) then? That’s fine by me, so long as there are no objections from others. I can still move forward with , (e) and (g) then... No objections from me - but IMHO to satisfy the legal minds you still need to produce a patch and attach to an issue with the Grant to ASF checkbox marked... OK, I'll create a new issue in JIRA, and then attach a lot of patches :) I'll try to appropriately mark patches that are straightforward ports from nutch 1.1 into nutchbase so that the same committers can commit those patches _again_ hopefully preserving post nutch 1.0 history as much as possible. (Also, I always shudder when I imagine a massive merge failing ... but that's probably a leftover from my CVS days when a failed merge would leave a completely broken tree.. ah, well, good luck :) ). I regularly do large merges in git and it works beautifully. We'll see how well SVN does :) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++