[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
[ https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500604#comment-14500604 ] Mattmann, Chris A (388J) commented on NUTCH-1927: - +1 please commit! Thanks seb Sent from my iPhone Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing --- Key: NUTCH-1927 URL: https://issues.apache.org/jira/browse/NUTCH-1927 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: available, patch Fix For: 1.10 Attachments: NUTCH-1927.2015-04-16.patch, NUTCH-1927.2015-04-17.patch, NUTCH-1927.Mattmann.041115.patch.txt, NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt, test_NUTCH-1927.2015-04-17.txt Based on discussion on the dev list, to use Nutch for some security research valid use cases (DDoS; DNS and other testing), I am going to create a patch that allows a whitelist: {code:xml} property namerobot.rules.whitelist/name value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value descriptionComma separated list of hostnames or IP addresses to ignore robot rules parsing for. /description /property {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1832) Make Nutch work without an indexer
[ https://issues.apache.org/jira/browse/NUTCH-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121477#comment-14121477 ] Mattmann, Chris A (388J) commented on NUTCH-1832: - Will reply in more detail soon but will look into enabling plugin back then Sent from my iPhone Make Nutch work without an indexer -- Key: NUTCH-1832 URL: https://issues.apache.org/jira/browse/NUTCH-1832 Project: Nutch Issue Type: Bug Affects Versions: 1.9 Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.10 Attachments: NUTCH-1832.Mattmann.090314.patch.2.txt, NUTCH-1832.Mattmann.090314.patch.txt Nutch used to work out of the box, without requiring an indexing backend. As of 1.9, that's not the case anymore (it's possible even before that). Thanks to [~markus17] for pointing out that this is due to the indexing-solr plugin being enabled by default. We should disable it by default, so that the regression is removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Important : Bunch of Spam Created under Nutch Wiki!!
Hi Kiran, I would give comm...@nutch.apache.org. Please add ChrisMattmann as a username. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: kiran chitturi chitturikira...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Monday, April 1, 2013 6:52 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!! Hi guys, Do you know what is the destination for commit mails ? Can I give 'dev@nutch.apache.org' ? I am planning on giving the below information so far for creating a moin wiki [1] Wiki Name: Nutch Usernames: LewisJohnMcgibbney, kiranchitturi, SebastianNagel, JulienNioche Destination for Commit mails: dev@nutch.apache.org Please let me know if any of the information is incorrect or needed any modifications. [1] - http://wiki.apache.org/general/OurWikiFarm#per_wiki_access_control_-_tight en_your_wiki_just_a_little.2C_benefit_just_a_lot On Sat, Mar 30, 2013 at 4:29 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Kiran, I think here: http://wiki.apache.org/general/OurWikiFarm#per_wiki_access_control_-_tight e n_your_wiki_just_a_little.2C_benefit_just_a_lot Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: kiran chitturi chitturikira...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Saturday, March 30, 2013 12:55 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!! Does anyone know what details we need to provide for the new wiki controls ? I have posted a JIRA [0] to control our spam but the infrabot is asking more information [1] [0] - https://issues.apache.org/jira/browse/INFRA-6081 https://issues.apache.org/jira/browse/INFRA-6081 [1] - http://www.apache.org/dev/infra-contact#what-we-need-to-know On Thu, Mar 28, 2013 at 3:18 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Kiran, Yes, my recommendation: 1. Jump into #asfinfra on freeonode, find Joe, or Gavin or Daniel, ask for help. If you don't have IRC, email infrastruct...@apache.org mailto:infrastruct...@apache.org and/or file a https://issues.apache.org/jira/browse/INFRA https://issues.apache.org/jira/browse/INFRA ticket 2. Request that they enable ASAP ContributorsGroup only acls I know that many Apache wikis (MoinMon) are being attackedŠ Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: kiran chitturi chitturikira...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, March 28, 2013 12:15 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Fwd: Important : Bunch of Spam Created under Nutch Wiki!! Thanks to Ken (check message below) for reporting our insecure wiki. I have checked it and anyone can create an fake account and edit any of our wiki pages or create new ones. When I first registered to the wiki, all the pages are immutable and Lewis had to add me to Contributors group to make changes to the wiki. Probably, the setting was hacked for now and that is the reason we are facing lot of spam. Can we contact the infra@apache and request them to lock down the wiki as the other groups did ? -- Forwarded message -- From: Ken Krugler kkrugler_li...@transpac.com Date: Thu, Mar 28, 2013 at 1:35 PM Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!! To: dev@nutch.apache.org
Re: Important : Bunch of Spam Created under Nutch Wiki!!
Thanks Kiran! ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: kiran chitturi chitturikira...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Monday, April 1, 2013 12:30 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!! I have posted the information on the JIRA issue page [0]. Let's hope the issue will be taken care of soon. [0] - https://issues.apache.org/jira/browse/INFRA-6081 On Mon, Apr 1, 2013 at 3:27 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Kiran, On Mon, Apr 1, 2013 at 6:53 AM, dev-digest-h...@nutch.apache.org wrote: Re: Important : Bunch of Spam Created under Nutch Wiki!! 22926 by: kiran chitturi Hi guys, Do you know what is the destination for commit mails ? Can I give 'dev@nutch.apache.org' ? No, we should put commit emails to the styatic archive here http://mail-archives.apache.org/mod_mbox/nutch-commits/ Thanks for sorting this out Kiran, we are truly getting hounded with spam just now. Best Lewis -- Kiran Chitturi http://www.linkedin.com/in/kiranchitturi
Re: Nutch Wiki
Seconded! ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Saturday, March 30, 2013 3:07 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Nutch Wiki @Kiran Others who have been updating the wiki, Great work on the command line options and elsewhere where you guys have been cleaning up and writing better documentation for Nutch. This is a crucial part of the workload and is greatly appreciated. Have a great weekend. Lewis -- Lewis
Re: Important : Bunch of Spam Created under Nutch Wiki!!
Hi Kiran, Yes, my recommendation: 1. Jump into #asfinfra on freeonode, find Joe, or Gavin or Daniel, ask for help. If you don't have IRC, email infrastruct...@apache.org and/or file a https://issues.apache.org/jira/browse/INFRA ticket 2. Request that they enable ASAP ContributorsGroup only acls I know that many Apache wikis (MoinMon) are being attackedŠ Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: kiran chitturi chitturikira...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, March 28, 2013 12:15 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Fwd: Important : Bunch of Spam Created under Nutch Wiki!! Thanks to Ken (check message below) for reporting our insecure wiki. I have checked it and anyone can create an fake account and edit any of our wiki pages or create new ones. When I first registered to the wiki, all the pages are immutable and Lewis had to add me to Contributors group to make changes to the wiki. Probably, the setting was hacked for now and that is the reason we are facing lot of spam. Can we contact the infra@apache and request them to lock down the wiki as the other groups did ? -- Forwarded message -- From: Ken Krugler kkrugler_li...@transpac.com Date: Thu, Mar 28, 2013 at 1:35 PM Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!! To: dev@nutch.apache.org Hi Kiran, On Mar 28, 2013, at 2:03am, kiran chitturi wrote: Thank you Ken for the information. I think the access is already restricted to Contributors Only. Someone can please confirm, if it is not. It's not, as far as I know. I just created a fake account, logged in with it, and edited the front page. If anyone needs to edit wiki, they would need to ask someone to get access to wiki pages. Do you know if Solr still got hit by spam after locking down the wiki ? I think that change helped cut down most of the spam, but I don't monitor the Solr list that closely, sorry. -- Ken On Thu, Mar 28, 2013 at 1:40 AM, Ken Krugler kkrugler_li...@transpac.com wrote: On Mar 27, 2013, at 6:54pm, kiran chitturi wrote: Thank you Binoy for reporting. We have been monitoring the pages and deleting them when we get time but there are more coming up. Today, I have seen a spam editing on the home page of Nutch wiki. It has inserted spam links under tutorials. We need to find a permanent solution to this. I wonder if any other list-servs are facing the same issue. Yes - Solr recently had to lock down editing on their wiki: The wiki at http://wiki.apache.org/solr/ has come under attack by spammers more frequently of late, so the PMC has decided to lock it down in an attempt to reduce the work involved in tracking and removing spam. From now on, only people who appear on http://wiki.apache.org/solr/ContributorsGroup will be able to create/modify/delete wiki pages. Please request either on the solr-u...@lucene.apache.org or on d...@lucene.apache.org to have your wiki username added to the ContributorsGroup page - this is a one-time step. So I think you need to make a request to Infra to lock down the wiki, then add people (generally in response to explicit requests) to the ContributorsGroup page. -- Ken On Thu, Mar 28, 2013 at 12:49 AM, Binoy d binoy...@gmail.com wrote: I am quite suprised looking at the notification I am getting for new pages for Nutch Wiki Example : http://wiki.apache.org/nutch/KarlPuent I see at least 25-35 emails regarding such notification. All of the links I got are rooted under http://wiki.apache.org/nutch/ http://wiki.apache.org/nutch/ Is some one looking into this , If needed I can gladly forward emails to the person cleaning it up as I am not sure if every one has access to delete the pages. Regards, b -- Forwarded message -- From: Apache Wiki wikidi...@apache.org Date: Wed, Mar 27, 2013 at 9:32 PM Subject: [Nutch Wiki] Trivial Update of EdwinaBro by EdwinaBro To: Apache Wiki wikidi...@apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The EdwinaBro page has been changed by EdwinaBro: http://wiki.apache.org/nutch/EdwinaBro New page: I am 24 years old and my name is Edwina Brownlee. I life in Corjolens (Switzerland).BR BR BR Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue http://modform.org/SolomonKr%7CContinue]] -- Kiran Chitturi
Re: [Nutch Wiki] Trivial Update of PGOSimone by PGOSimone
Hey Julien, I heard on #asfinfra that any of our MoinMoin wikis have been attacked recently by SPAM. I think we may want to contact infra and ask for specific ContributorsGroup only Nutch wiki access. http://wiki.apache.org/general/OurWikiFarm Cheers, Chris From: Julien Nioche lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Monday, March 25, 2013 1:55 AM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: [Nutch Wiki] Trivial Update of PGOSimone by PGOSimone I thought we had to have a login / password to modify the Wiki. If so how come we got so much spam lately? Julien On 25 March 2013 04:26, Apache Wiki wikidi...@apache.orgmailto:wikidi...@apache.org wrote: Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The PGOSimone page has been changed by PGOSimone: http://wiki.apache.org/nutch/PGOSimone [..snip..] -- [http://digitalpebble.com/img/logo.gif] Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: GSOC 2013 project: Apache-Wicket based Nutch webapp
Hi Evert, Thanks. Velocity would be fine, but the big issue is that I don't know Velocity, and I know Wicket. The great part about Wicket is that it's pure XHTML + Java code. No config, no anything in-between. So if you understand the component model of widgets behind the scenes, and understand HTML, JS and CSS, you can easily maintain a Wicket web app. And as a Nutch PMC member there's one person here at least (me) who's willing to maintain and steward such a web app. So we're in business! Cheers, Chris From: Evert Wagenaar evert.wagen...@yahoo.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org, Evert Wagenaar evert.wagen...@mint.nl Date: Sunday, March 24, 2013 12:46 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp I agree as well. The jsp version has become a mess and is currently almost not s supportable anymore. Would velocity be a good alternative? It is very good with Solr Facets and also fits into any CMS. Evert Wagenaar evert.wagen...@me.com +31 653 606 293 From: kiran chitturi chitturikira...@gmail.com To: dev@nutch.apache.org Sent: Saturday, March 23, 2013 9:36 PM Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp Thank you Chris for your interest. I would love to share my thesis and the work but I am still in experimenting stage and I will share with you soon once I have a decent UI running with functionalities. Regards, Kiran. On Sat, Mar 23, 2013 at 2:33 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: That is so awesome Kiran. Great job and I would love a link to your thesis (or even seeing the work in progress) if you are willing to share and have the time. Good plane reading material for me and congrats again. Looking forward to working with you. Cheers, Chris From: kiran chitturi chitturikira...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Saturday, March 23, 2013 9:54 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp Thanks Chris! I am planning to graduate with Masters degree in Computer Science from Virginia Tech University and my advisor is Dr.Fox. My thesis work mostly relates to building search engine for the 10TB crises event data that we have collected over last three years. The data is collected using Internet Archive crawler (Archive-it) and I am indexing data using LucidWorks Big Data Software. The process also involves finding more metadata and clustering. All of this work is related to 'Crisis, Tragedy and Recovery Network Project (CTRnet)' (www.ctrnet.net http://www.ctrnet.net/) My thesis, library work and Nutch are all closely related. It has been a great learning experience so far :) On Sat, Mar 23, 2013 at 12:23 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Kiran, Awesome that works fine for me! Happy to have you contribute, and whether you are a formal mentor or not, if we get a GSoC 2013 student for this you can help me, Lewis, (and others) shepherd it in! Thanks man and congrats on graduating soon! Where are you graduating from and in what subject? Cheers, Chris From: kiran chitturi chitturikira...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Saturday, March 23, 2013 8:51 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp I am very much interested in the Apache Wicket project but I wouldn't be able to be a student since i am finishing my graduation and looking for full-time jobs. I have discussed with Lewis previously about this, and it wouldn't be ideal for me to be a GSoc 2013 student as I can't devote my full-time work on this. However, I will be very happy to work on this in my free time. This is something I am interested in for long time and I would try to contribute in anyway possible. Thank you, Kiran. On Sat, Mar 23, 2013 at 11:23 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Kiran, Great, yes the REST services need work for sure. They haven't been worked on in a while. I'm privy to Apache CXF, but I haven't done anything with it, and Andrzej did an awesome job using Restlet, so we've got Reslet for now. If you are interested in documenting the services, then awesome! Do you want to be a GSoC 2013 student, and are you interested in this project? Cheers, Chris From: kiran chitturi chitturikira...@gmail.com Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Friday, March 22, 2013 9:19 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp Hi Chris, I was just thinking about that this evening. First, to start with this I want to do well documentation of the Nutch REST API. What is the status of Rest API ? Does it need any
Re: Google Summer of Code 2013 - Giraph implementation of Nutch LinkRank Algorithm
Super +1 -- sounds awesome Lewis. Cheers, Chris On 3/24/13 12:38 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi All, After some discussion and drumming up of interest within the Giraph community, I've logged a Google Summer of Code issue [0] for this topic. We are looking for interested students to come forward and participate in the effort. I logged this over in Giraph as there was no GSoC eefort already going on there, we already have an issue for the Wicket-based User Interface implementation in Nutch. I would be very happy if people (users and developers) could chime in on the thread so we can get the project started with the right direction and intention in mind. I propose this for Nutch TRUNK. Thanks for now Best Lewis [0] https://issues.apache.org/jira/browse/GIRAPH-584 -- *Lewis*
Re: GSOC 2013 project: Apache-Wicket based Nutch webapp
Cool thanks! From: kiran chitturi chitturikira...@gmail.commailto:chitturikira...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Saturday, March 23, 2013 1:36 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp Thank you Chris for your interest. I would love to share my thesis and the work but I am still in experimenting stage and I will share with you soon once I have a decent UI running with functionalities. Regards, Kiran. On Sat, Mar 23, 2013 at 2:33 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: That is so awesome Kiran. Great job and I would love a link to your thesis (or even seeing the work in progress) if you are willing to share and have the time. Good plane reading material for me and congrats again. Looking forward to working with you. Cheers, Chris From: kiran chitturi chitturikira...@gmail.commailto:chitturikira...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Saturday, March 23, 2013 9:54 AM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp Thanks Chris! I am planning to graduate with Masters degree in Computer Science from Virginia Tech University and my advisor is Dr.Fox. My thesis work mostly relates to building search engine for the 10TB crises event data that we have collected over last three years. The data is collected using Internet Archive crawler (Archive-it) and I am indexing data using LucidWorks Big Data Software. The process also involves finding more metadata and clustering. All of this work is related to 'Crisis, Tragedy and Recovery Network Project (CTRnet)' (www.ctrnet.nethttp://www.ctrnet.net) My thesis, library work and Nutch are all closely related. It has been a great learning experience so far :) On Sat, Mar 23, 2013 at 12:23 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: Hi Kiran, Awesome that works fine for me! Happy to have you contribute, and whether you are a formal mentor or not, if we get a GSoC 2013 student for this you can help me, Lewis, (and others) shepherd it in! Thanks man and congrats on graduating soon! Where are you graduating from and in what subject? Cheers, Chris From: kiran chitturi chitturikira...@gmail.commailto:chitturikira...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Saturday, March 23, 2013 8:51 AM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp I am very much interested in the Apache Wicket project but I wouldn't be able to be a student since i am finishing my graduation and looking for full-time jobs. I have discussed with Lewis previously about this, and it wouldn't be ideal for me to be a GSoc 2013 student as I can't devote my full-time work on this. However, I will be very happy to work on this in my free time. This is something I am interested in for long time and I would try to contribute in anyway possible. Thank you, Kiran. On Sat, Mar 23, 2013 at 11:23 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: Hi Kiran, Great, yes the REST services need work for sure. They haven't been worked on in a while. I'm privy to Apache CXF, but I haven't done anything with it, and Andrzej did an awesome job using Restlet, so we've got Reslet for now. If you are interested in documenting the services, then awesome! Do you want to be a GSoC 2013 student, and are you interested in this project? Cheers, Chris From: kiran chitturi chitturikira...@gmail.commailto:chitturikira...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Friday, March 22, 2013 9:19 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp Hi Chris, I was just thinking about that this evening. First, to start with this I want to do well documentation of the Nutch REST API. What is the status of Rest API ? Does it need any fixes and working examples ? Hopefully my start would be helpful and it be soon. Thanks for opening up the issue. Regards, kIran. On Fri, Mar 22, 2013 at 11:43 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, I posted: https://issues.apache.org/jira/browse/NUTCH-841 As a potential GSOC 2013 summer project
Re: GSOC 2013 project: Apache-Wicket based Nutch webapp
Hi Kiran, Awesome that works fine for me! Happy to have you contribute, and whether you are a formal mentor or not, if we get a GSoC 2013 student for this you can help me, Lewis, (and others) shepherd it in! Thanks man and congrats on graduating soon! Where are you graduating from and in what subject? Cheers, Chris From: kiran chitturi chitturikira...@gmail.commailto:chitturikira...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Saturday, March 23, 2013 8:51 AM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp I am very much interested in the Apache Wicket project but I wouldn't be able to be a student since i am finishing my graduation and looking for full-time jobs. I have discussed with Lewis previously about this, and it wouldn't be ideal for me to be a GSoc 2013 student as I can't devote my full-time work on this. However, I will be very happy to work on this in my free time. This is something I am interested in for long time and I would try to contribute in anyway possible. Thank you, Kiran. On Sat, Mar 23, 2013 at 11:23 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: Hi Kiran, Great, yes the REST services need work for sure. They haven't been worked on in a while. I'm privy to Apache CXF, but I haven't done anything with it, and Andrzej did an awesome job using Restlet, so we've got Reslet for now. If you are interested in documenting the services, then awesome! Do you want to be a GSoC 2013 student, and are you interested in this project? Cheers, Chris From: kiran chitturi chitturikira...@gmail.commailto:chitturikira...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Friday, March 22, 2013 9:19 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp Hi Chris, I was just thinking about that this evening. First, to start with this I want to do well documentation of the Nutch REST API. What is the status of Rest API ? Does it need any fixes and working examples ? Hopefully my start would be helpful and it be soon. Thanks for opening up the issue. Regards, kIran. On Fri, Mar 22, 2013 at 11:43 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, I posted: https://issues.apache.org/jira/browse/NUTCH-841 As a potential GSOC 2013 summer project. I'm willing to mentor it, since I love Wicket, and I'm willing to maintain the result as a Nutch committer. If NUTCH-841 doesn't get selected, I'll start implementing it this summer if no one beats me to it. Cheers, Chris -- Kiran Chitturi [http://www.linkedin.com/img/webpromo/btn_viewmy_160x33.png]http://www.linkedin.com/in/kiranchitturi -- Kiran Chitturi [http://www.linkedin.com/img/webpromo/btn_viewmy_160x33.png]http://www.linkedin.com/in/kiranchitturi
GSOC 2013 project: Apache-Wicket based Nutch webapp
Hey Guys, I posted: https://issues.apache.org/jira/browse/NUTCH-841 As a potential GSOC 2013 summer project. I'm willing to mentor it, since I love Wicket, and I'm willing to maintain the result as a Nutch committer. If NUTCH-841 doesn't get selected, I'll start implementing it this summer if no one beats me to it. Cheers, Chris
FW: GSoC 2013
[Apologies for cross post] Guys, to play in the GSoC 2013 spec, we just need to tag issues in JIRA with the gsoc2013 tag. I'll try and come up with few projects soon :) Cheers, Chris On 3/15/13 11:15 AM, Luciano Resende luckbr1...@gmail.com wrote: On Fri, Mar 15, 2013 at 11:01 AM, Manish Agrawal text2man...@gmail.com wrote: Hi I am Manish Agrawal, a 3rd year student of Mathematics and computing department from IIT Delhi. I want to participate in GSoC 2013 through one of the ASF projects. I would be really thankful if you could please suggest me how should I proceed for the same. Hoping for a reply. Thanks Manish Agrawal Google is sponsoring GSoC 2013, and Apache Software Foundation is planing to participate again. More information about Apache Participation in GSoC is available at : http://community.apache.org/gsoc.html. The proper way to find a project idea would be to identify an Apache Project in the area of your interest and start discussions with them via the project mailing list. The projects are starting to create their project ideas, and you can start browsing them at https://issues.apache.org/jira/secure/IssueNavigator!executeAdvanced.jspa? jqlQuery=labels+=+gsoc2013runQuery=trueclear=true -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/
Re: [ANNOUNCEMENT] Welcome Kiran Chitturi as Apache Nutch PMC and Committer
This is great to hear Kiran, welcome to the team! Cheers, Chris From: Julien Nioche lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Sunday, March 10, 2013 2:15 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: [ANNOUNCEMENT] Welcome Kiran Chitturi as Apache Nutch PMC and Committer Great to hear about your use of Nutch at your library and welcome on board Kiran! Julien On 10 March 2013 01:27, kiran chitturi chitturikira...@gmail.commailto:chitturikira...@gmail.com wrote: Thanks a lot guys for inviting me and for the wishes. I am a graduate student in Virginia Tech University doing my Masters in Computer Science. I have been using Apache Nutch for the last one year as part of my assistantship with our University Library. The Digital Libraries and Archives division of our libraries was using Google Mini Search Engine for their website that hosts 600k files but Google Mini was no longer supported and we want to try building Search Engine using Open Source technologies. That is when i started my journey with Nutch and we were able to successfully achieve our Goals using Nutch and Solr. The library was pleased with the project and they are more interested now to work with Open Source software whenever possible. I liked working with Nutch community and it has been a great learning experience for me. I would like to learn and contribute back even after my graduation. Few things that I have in my mind right now other than committing patches are to improve our documentation (Wiki), helping users to my best and also to start the Apache Wicket UI work soon for 2.x in Nutch. Regards, Kiran. On Sat, Mar 9, 2013 at 4:06 PM, Tejas Patil tejas.patil...@gmail.commailto:tejas.patil...@gmail.com wrote: Welcome aboard Kiran :) On Sat, Mar 9, 2013 at 12:56 PM, lewis john mcgibbney lewi...@apache.orgmailto:lewi...@apache.org wrote: Hi All, Over the last while we have been aware of Kiran's ongoing contribution to the Nutch community. It is with great pleasure that we invite Kiran to join the Nutch PMC and also take up Committer role. @Kiran, please feel free to say a bit about yourself and introduce what brought you to Apache Nutch. Have a great weekend. Best Lewis -- Kiran Chitturi -- [http://digitalpebble.com/img/logo.gif] Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
FW: [OPENING] Google Summer of Code Applications
FYI On 3/10/13 5:10 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: I just told a huge lie. I got my dates mixed up... Students have from between April 22nd and May 3rd to get proposals in. Sorry about the mix up. Lewis On Sun, Mar 10, 2013 at 5:09 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi All, We have from the 18th until the 29th to submit this years GSoC proposals[0]. Just a gentle reminder for any potential guys wanting to formally apply... The idea would be to sort out any discrepancies just now and to develop your proposal to a comprehensive standard. I am interested in mentoring another project this year, so can work with folks who wish to progress with proposals. Thanks Lewis [0] http://www.google-melange.com/gsoc/events/google/gsoc2013 -- *Lewis* -- *Lewis*
Re: Review board giving issue
Hi Tejas, Yeah I was having some issue at the time, but will try and see if it is working tomorrow. If it's still not working we can contact infra@ Cheers, Chris From: Tejas Patil tejas.patil...@gmail.commailto:tejas.patil...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Tuesday, March 5, 2013 9:07 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Review board giving issue Hi all, I am trying to use review board to upload a patch for a Jira and it is giving me same issue as I had before [0]. Below are the steps that I follow: 1. Generate a patch file using svn diff command. 2. On review board page, I select repository as Nutch 3. Repository as https://svn.apache.org/repos/asf/nutch/trunk; (the patch is for 1.x) 4. Attach the diff file. I used to follow the same steps at work and it worked out well. But over here I get this error message: The file 'https://svn.apache.org/repos/asf/nutch/trunk/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java' (r1453161) could not be found in the repository There was a review request in nutch group in last month [1] after the thread [0]. So I have a feeling that there is something weird with my account or I am doing something wrong. Can anyone help me here ? [0] : http://mail-archives.apache.org/mod_mbox/nutch-dev/201301.mbox/%3cfa2d97dfc830824e9040174e7f89744925085...@ap-embx-sp40.res.ad.jpl%3E [1] : https://reviews.apache.org/r/9119/
Re: [DISCUSS] Google Summer of Code
Hey Lewis, Great job starting this thread. +1 Giraph is welcome here. Multi-project GSoCs always do well. One thing I had in mind was taking an implementation of Hubs and Authorities developed for Nutch 1.3 a few years back in my USC class and then having someone integrate it into the current Nutch 1.x branch to start. If folks are interested I can create a JIRA. Cheers, Chris From: Lewis John Mcgibbney lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Monday, March 4, 2013 12:23 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: [DISCUSS] Google Summer of Code Hi All, I thought I would ask the question as to who (if anyone) is intending on engaging as a mentor (or student if you are one) within this years GSoC project. There are plenty of projects we could do within Nutch. Obvious ones that come to mind are - Wicket webapp for Nutch 2.x - Integration of Giraph with Nutch We already have one proposal which I would consider mentoring over on Apache Gora, but I will certainly not back down from any proposals in Nutch. Would the Giraph project be welcomed here? If so I can head over to user@ Giraph in an attempt to attract interest. Of course this is a discussion based on what folks want to do and the list above should be added to. Thanks for now Lewis -- Lewis
Re: [DISCUSS] Google Summer of Code
Hey Markus, Yep my student implement HITS (on the fly) ranking, and classification (I think). It's sitting on my HD for 2 years :( So if someone can pick it up it would be a nice GSoC project. Glad to hear there is interest. Cheers, Chris On 3/4/13 1:21 PM, Markus Jelsma markus.jel...@openindex.io wrote: Chris! Do you mean automatic classification of hub and authority pages? If so, we're more than interested in that. This is still an issue for our site search platform and one that haven't given much more attention than some research and prototypes. Cheers -Original message- From:Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov Sent: Mon 04-Mar-2013 22:02 To: dev@nutch.apache.org Subject: Re: [DISCUSS] Google Summer of Code Hey Lewis, Great job starting this thread. +1 Giraph is welcome here. Multi-project GSoCs always do well. One thing I had in mind was taking an implementation of Hubs and Authorities developed for Nutch 1.3 a few years back in my USC class and then having someone integrate it into the current Nutch 1.x branch to start. If folks are interested I can create a JIRA. Cheers, Chris From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com Reply-To: dev@nutch.apache.org mailto:dev@nutch.apache.org dev@nutch.apache.org mailto:dev@nutch.apache.org Date: Monday, March 4, 2013 12:23 PM To: dev@nutch.apache.org mailto:dev@nutch.apache.org dev@nutch.apache.org mailto:dev@nutch.apache.org Subject: [DISCUSS] Google Summer of Code Hi All, I thought I would ask the question as to who (if anyone) is intending on engaging as a mentor (or student if you are one) within this years GSoC project. There are plenty of projects we could do within Nutch. Obvious ones that come to mind are - Wicket webapp for Nutch 2.x - Integration of Giraph with Nutch We already have one proposal which I would consider mentoring over on Apache Gora, but I will certainly not back down from any proposals in Nutch. Would the Giraph project be welcomed here? If so I can head over to user@ Giraph in an attempt to attract interest. Of course this is a discussion based on what folks want to do and the list above should be added to. Thanks for now Lewis -- Lewis
Re: [DISCUSS] Google Summer of Code
Hey Markus: https://issues.apache.org/jira/browse/NUTCH-1539 Will submit the code soon. Cheers, Chris On 3/4/13 1:43 PM, Markus Jelsma markus.jel...@openindex.io wrote: Ah yes! Please open an issue and if you can attach anything that matters such as a description of the algorithm, how it should work with Nutch/MapReduce or even code/tests. If there's code i may be able to patch it up for trunk rather quickly and see how it performs. Cheers, Markus -Original message- From:Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov Sent: Mon 04-Mar-2013 22:27 To: dev@nutch.apache.org Subject: Re: [DISCUSS] Google Summer of Code Hey Markus, Yep my student implement HITS (on the fly) ranking, and classification (I think). It's sitting on my HD for 2 years :( So if someone can pick it up it would be a nice GSoC project. Glad to hear there is interest. Cheers, Chris On 3/4/13 1:21 PM, Markus Jelsma markus.jel...@openindex.io wrote: Chris! Do you mean automatic classification of hub and authority pages? If so, we're more than interested in that. This is still an issue for our site search platform and one that haven't given much more attention than some research and prototypes. Cheers -Original message- From:Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov Sent: Mon 04-Mar-2013 22:02 To: dev@nutch.apache.org Subject: Re: [DISCUSS] Google Summer of Code Hey Lewis, Great job starting this thread. +1 Giraph is welcome here. Multi-project GSoCs always do well. One thing I had in mind was taking an implementation of Hubs and Authorities developed for Nutch 1.3 a few years back in my USC class and then having someone integrate it into the current Nutch 1.x branch to start. If folks are interested I can create a JIRA. Cheers, Chris From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com mailto:lewis.mcgibb...@gmail.com Reply-To: dev@nutch.apache.org mailto:dev@nutch.apache.org dev@nutch.apache.org mailto:dev@nutch.apache.org Date: Monday, March 4, 2013 12:23 PM To: dev@nutch.apache.org mailto:dev@nutch.apache.org dev@nutch.apache.org mailto:dev@nutch.apache.org Subject: [DISCUSS] Google Summer of Code Hi All, I thought I would ask the question as to who (if anyone) is intending on engaging as a mentor (or student if you are one) within this years GSoC project. There are plenty of projects we could do within Nutch. Obvious ones that come to mind are - Wicket webapp for Nutch 2.x - Integration of Giraph with Nutch We already have one proposal which I would consider mentoring over on Apache Gora, but I will certainly not back down from any proposals in Nutch. Would the Giraph project be welcomed here? If so I can head over to user@ Giraph in an attempt to attract interest. Of course this is a discussion based on what folks want to do and the list above should be added to. Thanks for now Lewis -- Lewis
Re: Nutch JAVA Application
Hi Shann, Thank you for reaching out! If your goal is to get your project integrated into Apache Nutch, proper, then I would recommend simply: 0. File some JIRA issues in Apache Nutch http://issues.apache.org/jira/browse/NUTCH Small incremental patches and issues are preferred and this will let people know what your plan is so you can get committers and PMC members attention. 1. svn co http://svn.apache.org/repos/asf/nutch/branches/2.x/ 2. cd 2.x 3. Edit files 4. svn status (make sure the files you edited looked correct) 6. svn diff NUTCH-xxx.sleduc.yyMMdd.patch.txt for each issue you created 7. Attach patches from #6 to issues from #1 Otherwise if you go off onto Github, and work it's going to be harder to get your patch accepted since it will represent large change when instead you can effect the change here at the ASF, incrementally making sure your code gets in. ALv2 is the license to use, BTW, either way you decide. Cheers, Chris On 2/12/13 12:25 PM, Shann stanislas.le...@mailoo.org wrote: Hi, Part of my internship, we must develop a specialized search engine using Nutch, Solr, HBase, Tika. I began to develop a Java application for crawler with Nuth branch 2.x. Functions inject, generate, fetch, parse, updatedb, solrindex based on the actual execution of nutch via a shell command from Java application. As an advocate of free software, I propose therefore to give you access to my git project. Using nutch in the background, under what license should I put my application ? -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-JAVA-Application-tp4040050.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
FW: [GSoC Mentors] Google Summer of Code 2013
[Sorry for cross posting] Guys, FYI please note that you can participate as a mentor from a PMC via Apache as they are a GSoC org. ComDev will coordinate our participation but start thinking about what projects we may want to do. Cheers, Chris From: Carol Smith car...@google.commailto:car...@google.com Date: Monday, February 11, 2013 11:02 AM To: Google Summer of Code Mentors List google-summer-of-code-mentors-l...@googlegroups.commailto:google-summer-of-code-mentors-l...@googlegroups.com Subject: [GSoC Mentors] Google Summer of Code 2013 Hi GSoC mentors and org admins, We've announced that we're doing Google Summer of Code 2013 [1]. Yay! If you would like to help spread the word about GSoC, we have presentations [2], logos [3], and flyers [4] for you to use. Please host meetups, tell your friends and colleagues about the program, go to conferences, talk to people about the program, and just generally do all the awesome word-of-mouth stuff you do every year to promote the program. The GSoC calendar, FAQ, and events timeline have all been updated with this year's important dates, so please refer to those for the milestones for this year's program. NB: the normal timeline for the program has been modified for this year. You'll probably want to examine the dates closely to make sure you know when important things are happening. Please consider translating the presentations and/or flyers into your native language and submitting them directly to me to post on the wiki. Localization for our material is integral to reaching the widest possible audience around the world. If you decide to translate a flyer, please fill out our form to request a thank you gift for your effort. [5] If you decide to host a meetup, please email me to let me know the date, time, and location so I can put it on the GSoC calendar. Also, remember to take pictures at your meetup and write up a blog post for our blog using our provided template for formatting [6]. If you need promotional items for your attendees, please fill out our form [7] to request some; we're happy to send some along. We can provide up to about 25 pens, notebooks, or stickers and/or a few t-shirts. Please keep in mind, though, that shipping restrictions and timeline vary country-to-country; request items early to make sure they get there on time! If you have questions about hosting meetups, please see the section in our FAQ [8]. Please consider applying to participate as an organization again this year or maybe joining as a mentor for your favorite organization if they are selected this year. We rely on you for your help for the success of this program, so thank you in advance for all the work you do! [1] - http://google-opensource.blogspot.com/2013/02/flip-bits-not-burgers-google-summer-of.html [2] - http://code.google.com/p/google-summer-of-code/wiki/ProgramPresentations [3] - http://code.google.com/p/google-summer-of-code/wiki/GsocLogos [4] - http://code.google.com/p/google-summer-of-code/wiki/GsocFlyers [5] - http://goo.gl/gEHDO [6] - http://goo.gl/wbZrt [7] - http://goo.gl/0BsR8 [8] - http://goo.gl/2NGfp Cheers, Carol -- You received this message because you are subscribed to the Google Groups Google Summer of Code Mentors List group. To unsubscribe from this group and stop receiving emails from it, send an email to google-summer-of-code-mentors-list+unsubscr...@googlegroups.commailto:google-summer-of-code-mentors-list+unsubscr...@googlegroups.com. To post to this group, send email to google-summer-of-code-mentors-l...@googlegroups.commailto:google-summer-of-code-mentors-l...@googlegroups.com. Visit this group at http://groups.google.com/group/google-summer-of-code-mentors-list?hl=en-US. For more options, visit https://groups.google.com/groups/opt_out.
Re: [DISCUSS] Nutch Policy/Opinion on Review Board
I love it and will use it but don't think it needs to be a policy to each their own :) Thanks buddy Sent from my iPhone On Jan 31, 2013, at 3:58 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi All, I thought I would create this thread as the Review Board platform has been floating around now for a bit and I wonder if we can leverage it to improve/streamline the efficiency of Nutch community contributions. So I thought I'd leave this thread nice and short. 1) I am new to Review Board. I don't know much about it. I haven't used it before. 2) I am interested to see if we can make contributions and particularly reviewing a more open and transparent process. 3) I want to hear what you guys think. Some links which may be of interest [0][1][2] Ta Lewis [0] https://blogs.apache.org/infra/entry/reviewboard_instance_running_at_the [1] https://reviews.apache.org [2] http://www.reviewboard.org/ -- Lewis
Re: review board
Hey Tejas, Yeah I think this has to do with something in the repo URL on the RB server side. I would file an INFRA ticket, or jump on #asfinfra on IRC and ask one of the guys for help there. Cheers, Chris From: Tejas Patil tejas.patil...@gmail.commailto:tejas.patil...@gmail.com Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Date: Friday, January 25, 2013 10:28 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: review board Hi, Has anyone recently faced an issue with Review Board while uploading a patch ? I created a patch for a change and tried to upload it via web UI of review board. It says: The file 'https://svn.apache.org/repos/asf/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java' (r1438860) could not be found in the repository Quite similar to the description given in [0]. HttpBase.java exists at the link given. My patch involves few changes to it. I think what I did is right, but still want to confirm. I generated the patch file using svn diff command. I am using svn, version 1.7.5. The patch was for nutch trunk. For uploading, I obtained the base directory from svn info command. Meanwhile I am googling for this issue, it would be great if someone can point out the problem here. [0] : https://issues.apache.org/jira/browse/INFRA-5046 Thanks, Tejas Patil
Re: 1.8 in Jira
woot yep ;) On 12/21/12 2:55 AM, Markus Jelsma markus.jel...@openindex.io wrote: forget it, i meant 1.7 but it's there already! -Original message- From:Markus Jelsma markus.jel...@openindex.io Sent: Fri 21-Dec-2012 11:54 To: dev@nutch.apache.org dev@nutch.apache.org Subject: 1.8 in Jira Anyone here with rights to add 1.8 to Jira? Thanks
Re: [VOTE] Apache Nutch 1.6 Release Candidate
Thanks guys. I should review this today. Cheers, Chris On Nov 29, 2012, at 5:31 AM, Lewis John Mcgibbney wrote: Hi, On Wed, Nov 28, 2012 at 10:11 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: - CHANGES.txt contains dates in both MM/DD/ and DD/MM/ formats. Shall we write the month in text form e.g. 7th July 2012 from now on? Done - Don't we need to have signatures as part of the RC? Done, thanks for the attention to detail Julien. Best Lewis
Re: Strategy for Assigning Issues by Version
Hey Lewis, On Nov 29, 2012, at 5:54 AM, Lewis John Mcgibbney wrote: Hi All, Right now I found myself facing a bit of a dilemma w.r.t bumping on the issues for the next Nutch release. Currently due to legacy workflows, we have some 120 issues assigned for 1.6... however ALL issues have been addressed for 1.6 meaning that the 120 issues are for 1.6 however not necessarily for 1.7. I would just set them for 1.7. I just use N+1 as the next release whether or not we actually plan to solve them for 1.7. Then when 1.7 comes along you can bump those 1.7s that we didn't get to, to 1.8, etc. A suggestion from myself, can I mark these issues as no fix version? This means that we can carve/manufacture the next development drive to what developers want to fix and to what features requests we receive from the community rather than sitting with a constant pile of issues which are always for the next development drive. Marking them as no fix version destroys pretty important reporting that I like to use which is pulling up a list of all the upcoming issues of relevance set for the next release. Without setting a Fix version you have to use the other JIRA search tools to search by things other than next version. Additionally, may I suggest (and please shoot me down here if I sound cheeky) that we make it a priority in the next development drive, to harness the issues which are marked as patch submitted? It seems to be a waste for such issues to be stagnating. I am conscious that this comment may sound wide of me, this is not the intention, I do think however that it would be nice to work our way towards Nucth releases in a more strategic manner than we have been doing. Hopefully this proposal is a step in the right direction. +50. That was one of my keys to success when I had more time. I would look for issues sitting with patches and just commit them. If I can wrangle some Nutch time over Christmas, I'll do a bunch of this as well. :) Thanks for any feedback. The issue at the top I suppose is the most important one in the short term. Cheers my friend. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Strategy for Assigning Issues by Version
+50 :) On Nov 29, 2012, at 8:32 AM, Lewis John Mcgibbney wrote: So in summary, We retain the legacy behavior and bump them ALL to 1.7 In the 1.7 development drive (if and when we can) we make an effort to act on patched issues in an attempt to pick the low hanging fruit so to speak... if such a thing exists. best Lewis On Thu, Nov 29, 2012 at 3:56 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Good idea! I suspect that most of them will be dating from a looong time ago and it won't be such a straightforward task to apply them, however this would be a good way of sorting them Additionally, may I suggest (and please shoot me down here if I sound cheeky) that we make it a priority in the next development drive, to harness the issues which are marked as patch submitted? It seems to be a waste for such issues to be stagnating. I am conscious that this comment may sound wide of me, this is not the intention, I do think however that it would be nice to work our way towards Nucth releases in a more strategic manner than we have been doing. Hopefully this proposal is a step in the right direction. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Lewis
Re: [DISCUSS] trunk release?
Release early, release often :) I'd say I'd be happy to try and spin it, but you'd beat me to it so I just will say I'll be happy to test the RC and voice my VOTE when you roll it Lewis :) Happy Thanksgiving (even though you're not in the States yet)! Cheers, Chris On Nov 22, 2012, at 7:15 AM, Lewis John Mcgibbney wrote: Hi All, A while ago I asked if it was time to get another release of trunk... Markus expressed the valid opinion that there were some issues with recently committed material that had maybe not been given the chance to mature enough and that could do with more testing. So far in trunk (since 1.5.1), we've resolved some 45 issues [0], but we have some critical issues open [1] which could do with some attention as well. None of these issues are mine therefore I don't know how those of us feel (with patches available) about integrating these issues prior/post 1.6 release... or indeed whether a 1.6 release is welcomed at the moment? The codebase seems to be stable and getting better so from my perspective I would back a 1.X release. All the best for now Lewis [0] http://tinyurl.com/cf3vcpr [1] http://tinyurl.com/d4omnrc -- Lewis
Re: [ANNOUNCE] Apache Nutch 2.1 Released
Great job everyone! Cheers, Chris On Oct 5, 2012, at 9:29 AM, Julien Nioche wrote: Thanks Lewis and well done everyone! Enjoy your week end Julien On 5 October 2012 16:12, lewis john mcgibbney lewi...@apache.org wrote: Good Afternoon Everyone, The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.1. This release continues to provide Nutch users with a simplified Nutch distribution building on the 2.x development drive which is growing in popularity amongst the community. As well as addressing ~20 bugs this release also offers improved properties for better Solr configuration, upgrades to various Gora dependencies and the introduction of the option to build indexes in elastic search, amongst various others. A full PMC Announcement can be seen here [0] Thanks you, have a great weekend on behalf of the Nutch community. Lewis [0] http://nutch.apache.org/#05+October+2012+-+Apache+Nutch+v2.1+Released -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available
Thanks for your VOTE! Cheers, Chris On Oct 4, 2012, at 1:08 AM, j.sulli...@thomsonreuters.com j.sulli...@thomsonreuters.com wrote: A bit late but my two cents. I have done a couple of installs on Ubuntu 12.04 using MySQL for the backend and have noticed a couple of the improvements and no regressions so +1 for releasing from my end. -Original Message- From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Monday, October 01, 2012 9:18 PM To: dev@nutch.apache.org; u...@nutch.apache.org Subject: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available Hi All, Anyone else for this VOTE? Sorry to be a pest! Thanks Lewis On Fri, Sep 21, 2012 at 4:07 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Everyone, A candidate for Apache Nutch 2.1 is available at: http://people.apache.org/~lewismc/apache-nutch-2.1 The release candidate is a src.zip and src.tar.gz ONLY archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.1/ We release Nutch 2.1 in this fashion due to the inclusion of Apache Gora and the likelihood that users will regularly recompile the code to suit dynamic requirements. Further, a staged Maven repository of the 2.1 jar, sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-020/ Please vote on releasing this package as Apache Nutch 2.1. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 2.1 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Status of 2.1 release
Take care dude! I'll give trunk a shot... Cheers, Chris On Sep 21, 2012, at 7:34 AM, Lewis John Mcgibbney wrote: Hi All, Basically thank god it was brought to our attention that giora-cassandra 0.2.1 is buggy and needs some work before it is ready to be integrated into a stable Nutch 2.x release. For the time being I've committed a revert for gora-cassandra v0.2 to the 2.1 branch and to 2.x branch (the latter of which can continue development regardless). I'll run the RC for 2.1 just now. @Markus, How are your thoughts on trunk? @Chris, Depending on outcome of discussion on trunk, do you want to spin an RC? Have a great weekend everyone. Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: svn commit: r1387363 - in /nutch/branches/2.1: CHANGES.txt build.xml pom.xml
Lewis you beat me to it, you ROCK! Cheers, Chris On Sep 18, 2012, at 5:11 PM, lewi...@apache.org lewi...@apache.org wrote: Author: lewismc Date: Tue Sep 18 21:11:06 2012 New Revision: 1387363 URL: http://svn.apache.org/viewvc?rev=1387363view=rev Log: forward port of NUTCH-1415 Modified: nutch/branches/2.1/CHANGES.txt nutch/branches/2.1/build.xml nutch/branches/2.1/pom.xml Modified: nutch/branches/2.1/CHANGES.txt URL: http://svn.apache.org/viewvc/nutch/branches/2.1/CHANGES.txt?rev=1387363r1=1387362r2=1387363view=diff == --- nutch/branches/2.1/CHANGES.txt (original) +++ nutch/branches/2.1/CHANGES.txt Tue Sep 18 21:11:06 2012 @@ -3,6 +3,8 @@ Nutch Change Log Release 2.1 (19/09/2012) ddmm Full Jira Report - https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680version=12321040 +* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x (snagel) + * NUTCH-1432 property storage.schema does not work anymore, should be storage.schema.webpage and storage.schema.host (lewismc) * NUTCH-1468 Redirects that are external links not adhering to db.ignore.external.links (Matt MacDonald via ferdy) Modified: nutch/branches/2.1/build.xml URL: http://svn.apache.org/viewvc/nutch/branches/2.1/build.xml?rev=1387363r1=1387362r2=1387363view=diff == --- nutch/branches/2.1/build.xml (original) +++ nutch/branches/2.1/build.xml Tue Sep 18 21:11:06 2012 @@ -700,14 +700,13 @@ !-- == -- target name=tar-src depends=package-src description=-- generate src.tar.gz distribution package tar compression=gzip longfile=gnu - destfile=${src.dist.version.dir}.tar.gz basedir=${src.dist.version.dir} - tarfileset dir=${dist.dir} mode=664 - exclude name=${src.dist.version.dir}/bin/* / - exclude name=${src.dist.version.dir}/runtime/* / -include name=${src.dist.version.dir}/** / + destfile=${src.dist.version.dir}.tar.gz + tarfileset dir=${src.dist.version.dir} mode=664 prefix=${final.name} +exclude name=src/bin/* / +include name=** / /tarfileset - tarfileset dir=${dist.dir} mode=755 -include name=${src.dist.version.dir}/bin/* / + tarfileset dir=${src.dist.version.dir} mode=755 prefix=${final.name} +include name=src/bin/* / /tarfileset /tar /target @@ -717,13 +716,13 @@ !-- == -- target name=tar-bin depends=package-bin description=-- generate bin.tar.gz distribution package tar compression=gzip longfile=gnu - destfile=${bin.dist.version.dir}.tar.gz basedir=${bin.dist.version.dir} - tarfileset dir=${dist.dir} mode=664 - exclude name=${bin.dist.version.dir}/bin/* / -include name=${bin.dist.version.dir}/** / + destfile=${bin.dist.version.dir}.tar.gz + tarfileset dir=${bin.dist.version.dir} mode=664 prefix=${final.name} +exclude name=bin/* / +include name=** / /tarfileset - tarfileset dir=${dist.dir} mode=755 -include name=${bin.dist.version.dir}/bin/* / + tarfileset dir=${bin.dist.version.dir} mode=755 prefix=${final.name} +include name=bin/* / /tarfileset /tar /target @@ -733,14 +732,13 @@ !-- == -- target name=zip-src depends=package-src description=-- generate src.zip distribution package zip compress=true casesensitive=yes - destfile=${src.dist.version.dir}.zip basedir=${src.dist.version.dir} - zipfileset dir=${dist.dir} filemode=664 - exclude name=${src.dist.version.dir}/bin/* / - exclude name=${src.dist.version.dir}/runtime/* / - include name=${src.dist.version.dir}/** / + destfile=${src.dist.version.dir}.zip + zipfileset dir=${src.dist.version.dir} filemode=664 prefix=${final.name} + exclude name=src/bin/* / + include name=** / /zipfileset - zipfileset dir=${dist.dir} filemode=755 - include name=${src.dist.version.dir}/bin/* / + zipfileset dir=${src.dist.version.dir} filemode=755 prefix=${final.name} + include name=src/bin/* / /zipfileset /zip /target @@ -750,13 +748,13 @@ !-- == -- target name=zip-bin depends=package-bin description=-- generate bin.zip distribution package zip compress=true casesensitive=yes - destfile=${bin.dist.version.dir}.zip basedir=${bin.dist.version.dir} - zipfileset dir=${dist.dir} filemode=664 - exclude name=${bin.dist.version.dir}/bin/* / - include name=${bin.dist.version.dir}/** / + destfile=${bin.dist.version.dir}.zip +
Re: Nutch 2.1 Release???
+1 I'd be happy to help! Cheers, Chris On Sep 15, 2012, at 9:24 AM, Lewis John Mcgibbney wrote: Hi Everyone, Without me slevering on, this suggestion speaks for itself. We have resolved 32 issues, including pulling in upgrades on the Gora dependency. It would be nice to push these improvements in a stable release to the Nutch community. Any thoughts. Best Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch 2.1 Release???
Awesome Lewis. I'll try and roll a 2.1 RC by mid next week if no one beats me to it. Cheers, Chris On Sep 15, 2012, at 2:18 PM, Lewis John Mcgibbney wrote: Actually when I look at it now we're at nearly 30 tickets for trunk as well. Up to you guys @Chris Nice one. Fire in my friend. If you can do RM role it would be great. Best Lewis On Sat, Sep 15, 2012 at 6:07 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: +1 I'd be happy to help! Cheers, Chris On Sep 15, 2012, at 9:24 AM, Lewis John Mcgibbney wrote: Hi Everyone, Without me slevering on, this suggestion speaks for itself. We have resolved 32 issues, including pulling in upgrades on the Gora dependency. It would be nice to push these improvements in a stable release to the Nutch community. Any thoughts. Best Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch talk accepted at ApacheCon Europe
Great to hear, Julien, nice! Cheers, Chris On Sep 13, 2012, at 3:39 AM, Julien Nioche wrote: Hi, I'd just like to mention that I will be giving a talk about Nutch at the Apache Conference Europe (Sinsheim, Germany 5–8 November 2012). The Apache Conference should be a good opportunity for the Nutch community (committers as well as users) to get together and I hope to see many of you there. Early Birds tickets are available until the 1st October. The talk itself will be an overview of Nutch and will be part of the Lucene/SOLR Ecosystem track. If you have an interesting use case using Nutch or have something in particular that you'd like me to talk about, please do get in touch and I'll try to blend that in the presentation. I look foward to seeing you in Sinsheim. Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Happy 10th Birthday Nutch!
Awesome, Jerome! I need to get a Nutch hat! Cheers, Chris On Aug 21, 2012, at 3:59 PM, Markus Jelsma wrote: Hehehe, nice! Cheers -Original message- From:Jérôme Charron jerome.char...@gmail.com Sent: Tue 21-Aug-2012 23:58 To: dev@nutch.apache.org Cc: u...@nutch.apache.org Subject: Re: Happy 10th Birthday Nutch! Oups! Sorry... These one should be ok : http://statigr.am/p/254365383887354210_4414285 http://statigr.am/p/254365383887354210_4414285 ;) On Tue, Aug 21, 2012 at 11:40 PM, Markus Jelsma markus.jel...@openindex.io mailto:markus.jel...@openindex.io wrote: Hi Jérôme, It asks for a login. Cheers -Original message- From:Jérôme Charron jerome.char...@gmail.com mailto:jerome.char...@gmail.com Sent: Tue 21-Aug-2012 22:22 To: u...@nutch.apache.org mailto:u...@nutch.apache.org Cc: dev@nutch.apache.org mailto:dev@nutch.apache.org dev@nutch.apache.org mailto:dev@nutch.apache.org Subject: Re: Happy 10th Birthday Nutch! My small contribution to Nutch birthday... http://statigr.am/viewer.php#/detail/254365383887354210_4414285 http://statigr.am/viewer.php#/detail/254365383887354210_4414285 http://statigr.am/viewer.php#/detail/254365383887354210_4414285 http://statigr.am/viewer.php#/detail/254365383887354210_4414285 Cheers, Jérôme On Fri, Aug 10, 2012 at 1:44 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov wrote: Super cool. Proud to have been around since 2005 (7 of them!) :) Cheers, Chris On Aug 9, 2012, at 1:31 PM, Lewis John Mcgibbney wrote: Nice one Julien I'm going to update the site with this as its a pretty huge milestone @Apache and a lot of projects and current developers owe a lot to the great work done by all you guys over the years. Thank you for sharing. Lewis On Thu, Aug 9, 2012 at 8:56 AM, Julien Nioche lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com wrote: Doug Cutting on twitter : https://twitter.com/cutting/status/233415059798372353 *RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at sourceforce august 2002. Turned out to be quite a game changer. #Hadoop * Happy birthday Nutch and thanks to all contributors past and present! Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://www.digitalpebble.com http://www.digitalpebble.com http://www.digitalpebble.com http://twitter.com/digitalpebble http://twitter.com/digitalpebble http://twitter.com/digitalpebble http://twitter.com/digitalpebble -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov mailto:chris.a.mattm...@nasa.gov mailto:chris.a.mattm...@nasa.gov mailto:chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ http://sunset.usc.edu/~mattmann/ http://sunset.usc.edu/~mattmann/ http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- @jcharron http://www.twitter.com/jcharron http://www.twitter.com/jcharron http://motre.ch/ http://motre.ch/ http://motre.ch/ http://motre.ch/ http://jcharron.posterous.com/ http://jcharron.posterous.com/ http://jcharron.posterous.com/ http://jcharron.posterous.com/ http://www.shopreflex.fr/ http://www.shopreflex.fr/ http://www.shopreflex.fr/ http://www.shopreflex.fr/ http://www.staragora.com/ http://www.staragora.com/ http://www.staragora.com/ http://www.staragora.com/ http://feeds.feedburner.com/Bligblagblog.1.gif http://feeds.feedburner.com/Bligblagblog.1.gif http://feeds.feedburner.com/~r/Bligblagblog/~6/1 http://feeds.feedburner.com/~r/Bligblagblog/~6/1 Hi -- @jcharron http://www.twitter.com/jcharron http://motre.ch/ http://motre.ch/ http://jcharron.posterous.com/ http://jcharron.posterous.com/ http://www.shopreflex.fr/ http://www.shopreflex.fr/ http://www.staragora.com/ http://www.staragora.com/ http://feeds.feedburner.com/Bligblagblog.1.gif http://feeds.feedburner.com/~r/Bligblagblog/~6/1 ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm
Fwd: Call for Papers for ApacheCon Europe 2012 now open!
FYI... Begin forwarded message: From: Nick Burch nick.bu...@alfresco.com Date: July 19, 2012 1:14:57 PM CDT To: committ...@apache.org Subject: Call for Papers for ApacheCon Europe 2012 now open! Reply-To: apachecon-disc...@apache.org Hi All We're pleased to announce that the Call for Papers for ApacheCon Europe 2012 is finally open! (For those who don't already know, ApacheCon Europe will be taking place between the 5th and the 9th of November this year, in Sinsheim, Germany.) If you'd like to submit a talk proposal, please visit the conference website at http://www.apachecon.eu/ and sign up for a new account. Once you've signed up, use your dashboard to enter your speaker bio, then submit your talk proposal(s). There's more information on the CFP page on the conference website. We welcome talk proposals from all projects, from right across the bredth of projects at the foundation! To make things easier for talk selection and scheduling, we'd ask that you tag your proposal with the track that it most closely fits within. The details of the tracks, and what projects they expect to cover, are available at http://www.apachecon.eu/tracks/. (If your project/group of projects was intending to submit a track, and missed the deadline, then please get in touch with us on apachecon-disc...@apache.org straight away, so we can work out if it's possible to squeeze you in...) The CFP will close on Friday 3rd August, so you've a little over weeks to send in your talk proposal. Don't put it off! We'll look forward to seeing some great ones shortly! Thanks Nick (On behalf of the Conferences committee) ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation
Hi Markus, Great question. I am CC'ing Ruth Duerr and Ian Truslove and Ruth Duerr at NSIDC -- maybe they can provide more information? Ruth, ian, please consider subcribing to dev@nutch.apache.org and/or u...@nutch.apache.org by sending blank emails to: dev-subscr...@nutch.apache.org user-subscr...@nutch.apache.org To follow along in the conversation. Thanks all! Cheers, Chris On Jul 17, 2012, at 5:27 PM, Markus Jelsma wrote: Cool! What are they exactly doing with Apache Nutch? And, more interesting, what non-standard stuff do they use? Cheers -Original message- From:Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov Sent: Tue 17-Jul-2012 21:29 To: dev@nutch.apache.org Subject: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation Hey Folks, Ruth Duerr is presenting at today's ESIP Federation and Discovery Hackathon: http://commons.esipfed.org/node/424 The U.S. National Snow and Ice Data Center (NSIDC) is deploying Apache Nutch and Solr to support discovery of datasets (called casting). Really interesting stuff, and worth contacting Ruth and NSIDC if you're interested. I'm highly suggesting to to the NSIDC folks to try and contribute any updates or plugins they are making to the software upstream here to the ASF. Thanks! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [ANNOUNCEMENT] Apache Nutch v1.5.1 Released
Congrats, all! Cheers, Chris On Jul 10, 2012, at 8:03 AM, Julien Nioche wrote: Great Job Lewis! Thanks a lot On 10 July 2012 15:40, lewis john mcgibbney lewi...@apache.org wrote: Good Afternoon Everyone, The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v1.5.1. This release is a maintenance release of the popular mainstream 1.5.X series of the Apache Nutch web search software project. Please see the list of changes http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt made in this version for a full breakdown.. A full PMC release statement can be found below http://nutch.apache.org/#10+July+2012+-+Apache+Nutch+v1.5.1+Released Nutch v1.5.1 is available in source and binary (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/1.5.1 When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://nutch.apache.org Thank you very much Lewis John McGibbney (on behalf of the Apache Nutch community) -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [PROPOSAL] Rename branch nutchgora into 2.x
+1 from me. Cheers, Chris On Jul 9, 2012, at 3:37 AM, Julien Nioche wrote: Guys, Now that we've released 2.0, wouldn't it be better to rename the 'nutchgora' branch into something like 'branch-2.x'? Any thoughts on this? Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 1.5.1 RC#3
Hi Lewis, +1 from me! SIGS check out: [chipotle:~/tmp/nutch-1.5.1] mattmann% $HOME/bin/verify_md5_checksums md5sum: stat '*.bz2': No such file or directory apache-nutch-1.5.1-bin.tar.gz: OK apache-nutch-1.5.1-src.tar.gz: OK apache-nutch-1.5.1-bin.zip: OK apache-nutch-1.5.1-src.zip: OK checksums check out: [chipotle:~/tmp/nutch-1.5.1] mattmann% $HOME/bin/verify_gpg_sigs Verifying Signature for file apache-nutch-1.5.1-bin.tar.gz.asc gpg: Signature made Tue Jul 3 11:31:31 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1 89C1 F45E 7970 C601 BCA7 Verifying Signature for file apache-nutch-1.5.1-bin.zip.asc gpg: Signature made Tue Jul 3 11:32:16 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1 89C1 F45E 7970 C601 BCA7 Verifying Signature for file apache-nutch-1.5.1-src.tar.gz.asc gpg: Signature made Tue Jul 3 11:31:58 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1 89C1 F45E 7970 C601 BCA7 Verifying Signature for file apache-nutch-1.5.1-src.zip.asc gpg: Signature made Tue Jul 3 11:32:33 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1 89C1 F45E 7970 C601 BCA7 [chipotle:~/tmp/nutch-1.5.1] mattmann% Builds fine! runtime: [mkdir] Created dir: /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime [mkdir] Created dir: /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local [mkdir] Created dir: /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/deploy [copy] Copying 1 file to /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/deploy [copy] Copying 1 file to /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/deploy/bin [copy] Copying 1 file to /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/lib [copy] Copying 1 file to /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/lib/native [copy] Copying 21 files to /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/conf [copy] Copying 1 file to /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/bin [copy] Copying 48 files to /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/lib [copy] Copying 123 files to /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/plugins [copy] Copied 2 empty directories to 2 empty directories under /Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/test BUILD SUCCESSFUL Total time: 1 minute 28 seconds [chipotle:~/tmp/nutch-1.5.1/apache-nutch-1.5.1] mattmann% Cheers, Chris On Jul 3, 2012, at 11:42 AM, Lewis John Mcgibbney wrote: Hi Everyone, A candidate for the Apache Nutch 1.5.1 RC#3 is available at: http://people.apache.org/~lewismc/apache-nutch-1.5.1-rc3 The release candidate is a src.zip, src.tar.gz, bin-zip and bin-tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.5.1-rc3/ This Release Candidate (and subsequent release) is a bug fix of the recently released Apache Nutch 1.5 and CHANGES.txt can be seen below http://people.apache.org/~lewismc/apache-nutch-1.5.1-rc3/CHANGES.txt Further, a staged Maven repository of the 1.5.1 jar, sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-023 Please vote on releasing this package as Apache Nutch 1.5.1. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 1.5.1 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
Thanks for your hard work here, Lewis! Cheers, Chris On Jul 7, 2012, at 3:44 PM, Lewis John Mcgibbney wrote: Hi Julien, Believe it or not I've just spent around 45 mins waiting on committing the site... broadband in Paris is nothing short of utterly abysmal to say the very best. Please see my comments below On Sat, Jul 7, 2012 at 9:58 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Looks like you've released 2.0. If so can you make an announcement to the mailing list + update the website. Done It's not really something that should go unnoticed. I know about the press release but surely it does not mean that NOTHING should be said about the release then. Quite right. I see a 1.5 on a mirror (http://apache.mirrors.timporter.net/nutch/) with the same release date as 2.0. Shouldn't it be 1.5.1? Can you please clarify? This relates to the message on private@ the other night and concerns the rearranging (cleaning up) of the dist/nutch directory on people.apache.org to accommodate the additional 2.0 directory. The 1.5 artifacts are identical to the ones we VOTE'd on, same goes with 2.0's. The mirror will confusingly display that these have been mirrored at the same time, which of course is the case, but they were certainly not released in parallel. OK so now concerning 1.5.1, we have still to VOTE on the rc#3 so I've gently put out a ping for this on dev@ and user@ I hope this answers all and I can only really apologise and say thanks to everyone who has made time and effort to VOTE over the last few months. There has been a very encouraging amount of work done within the dev community and it's been very rewarding to see us getting Nutch moving at a really steady pace. All for now Have a great weekend Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
OK, +1 from me :) ant runtime works: job: [jar] Building jar: /Users/mattmann/tmp/nutch2/build/apache-nutch-2.0.job runtime: [mkdir] Created dir: /Users/mattmann/tmp/nutch2/runtime [mkdir] Created dir: /Users/mattmann/tmp/nutch2/runtime/local [mkdir] Created dir: /Users/mattmann/tmp/nutch2/runtime/deploy [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/deploy [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/deploy/bin [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/local/lib [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/local/lib/native [copy] Copying 25 files to /Users/mattmann/tmp/nutch2/runtime/local/conf [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/local/bin [copy] Copying 84 files to /Users/mattmann/tmp/nutch2/runtime/local/lib [copy] Copying 97 files to /Users/mattmann/tmp/nutch2/runtime/local/plugins [copy] Copied 2 empty directories to 2 empty directories under /Users/mattmann/tmp/nutch2/runtime/local/test BUILD SUCCESSFUL Total time: 3 minutes 24 seconds [chipotle:~/tmp/nutch2] mattmann% Good enough for me! Cheers, Chris On Jul 3, 2012, at 11:24 AM, Mattmann, Chris A (388J) wrote: Hey Lewis, I was running ant test -- sorry -- will try ant runtime now (any idea what's up with test?) Cheers, Chris On Jul 3, 2012, at 11:11 AM, Lewis John Mcgibbney wrote: What commands are you using? I just grabbed the src-tar.gz from my local area with wget extracted it to ~/Desktop rm -r ~/.ivy2 cd ~/Desktop/$nutch_folder ant runtime runtime: [mkdir] Created dir: /home/lewismc/Desktop/nutch/runtime [mkdir] Created dir: /home/lewismc/Desktop/nutch/runtime/local [mkdir] Created dir: /home/lewismc/Desktop/nutch/runtime/deploy [copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/deploy [copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/deploy/bin [copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/local/lib [copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/local/lib/native [copy] Copying 25 files to /home/lewismc/Desktop/nutch/runtime/local/conf [copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/local/bin [copy] Copying 84 files to /home/lewismc/Desktop/nutch/runtime/local/lib [copy] Copying 97 files to /home/lewismc/Desktop/nutch/runtime/local/plugins [copy] Copied 2 empty directories to 2 empty directories under /home/lewismc/Desktop/nutch/runtime/local/test BUILD SUCCESSFUL Total time: 2 minutes 40 seconds This is every dependency being down loaded to ivy cache Lewis On Tue, Jul 3, 2012 at 5:12 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Julien, I ran this command: rm -rf /Users/mattmann/.ivy2/ But it still failed with the below messages: [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] [FAILED ] org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar: invalid sha1: expected=d7d8610ba4aad504475e568fd3badb412a0beae9 computed=f8369ff1a71e1a8febbb8e9c3a54ffbb08048f19 (1598ms) [ivy:resolve] [FAILED ] org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/org.apache.hadoop/hadoop-core/1.0.3/jars/hadoop-core.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/1.0.3/hadoop-core-1.0.3.jar [ivy:resolve] [FAILED ] org.hsqldb#hsqldb;2.2.8!hsqldb.jar: invalid sha1: expected=8231a3ff71ba5889f9e2d01ce13503cbdd4038e9 computed=81a7e8d5d1802c7acbc8f8f81d3e4680a4b2441c (523ms) [ivy:resolve] [FAILED ] org.hsqldb#hsqldb;2.2.8!hsqldb.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/org.hsqldb/hsqldb/2.2.8/jars/hsqldb.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/org/hsqldb/hsqldb/2.2.8/hsqldb-2.2.8.jar [ivy:resolve] [FAILED ] org.apache.lucene#lucene-core;3.4.0!lucene-core.jar: invalid sha1: expected=4426bf0764ec5fa634abca236b469d2519c74f65 computed=112d2454390cba8c7c35b34b8f7a821c6cec3f73 (775ms) [ivy:resolve] [FAILED ] org.apache.lucene#lucene-core;3.4.0!lucene-core.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/org.apache.lucene/lucene-core/3.4.0/jars/lucene-core.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/3.4.0/lucene-core-3.4.0.jar [ivy:resolve] [FAILED ] com.ibm.icu#icu4j;4.0.1!icu4j.jar: invalid sha1: expected=06362db7a2556bb58a04e991029196e2aad632d4 computed=d9862ffbc6cd6241a03c06b5911bf22a079d2cda (1544ms) [ivy:resolve] [FAILED ] com.ibm.icu#icu4j;4.0.1!icu4j.jar: (0ms) [ivy:resolve
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
Thanks Lewis, here are mine: [chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% ant -version Apache Ant(TM) version 1.8.2 compiled on May 17 2012 [chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% java -version java version 1.6.0_33 Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720) Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode) [chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% [chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% uname -a Darwin chipotle.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:32:41 PDT 2011; root:xnu-1504.15.3~1/RELEASE_X86_64 x86_64 [chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% I'll try one more time today with a fresh build and see where I get :/ Thanks! Cheers, Chris On Jul 4, 2012, at 3:27 AM, Lewis John Mcgibbney wrote: Hi Chris, lewismc@lewismc-HP-Mini-110-3100:~$ java -showversion java version 1.6.0_25 Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM) Client VM (build 20.0-b11, mixed mode, sharing) lewismc@lewismc-HP-Mini-110-3100:~$ ant -v Apache Ant(TM) version 1.8.2 compiled on August 19 2011 Trying the default build file: build.xml Buildfile: build.xml does not exist! Build failed Lewis On Wed, Jul 4, 2012 at 7:18 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Lewis, Odd, I don't get that. I'll try futzing around again with it tomorrow -- what system are you on? What is your Ant version and Java version? Cheers, Chris On Jul 3, 2012, at 11:49 AM, Lewis John Mcgibbney wrote: Hi Chris, I've no clue whats going on locally with you... em I just did ant test and I get copy-generated-lib: test: [echo] Testing plugin: subcollection [junit] Running org.apache.nutch.collection.TestSubcollection [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.305 sec test: BUILD SUCCESSFUL Total time: 12 minutes 28 seconds On Tue, Jul 3, 2012 at 7:24 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Lewis, I was running ant test -- sorry -- will try ant runtime now (any idea what's up with test?) Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
Hey Julien, On Jul 3, 2012, at 7:49 AM, Julien Nioche wrote: [..snip..] OK, so basically signatures and checksums are fine +1, yep they are great. Tried to build and test and got this: [ivy:resolve] :: [..snip...] Try deleting your entire .ivy dir and re-run ant. Just did that on my machine and Nutch compiles fine OK will do now. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
Hey Julien, I ran this command: rm -rf /Users/mattmann/.ivy2/ But it still failed with the below messages: [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] [FAILED ] org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar: invalid sha1: expected=d7d8610ba4aad504475e568fd3badb412a0beae9 computed=f8369ff1a71e1a8febbb8e9c3a54ffbb08048f19 (1598ms) [ivy:resolve] [FAILED ] org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/org.apache.hadoop/hadoop-core/1.0.3/jars/hadoop-core.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/1.0.3/hadoop-core-1.0.3.jar [ivy:resolve] [FAILED ] org.hsqldb#hsqldb;2.2.8!hsqldb.jar: invalid sha1: expected=8231a3ff71ba5889f9e2d01ce13503cbdd4038e9 computed=81a7e8d5d1802c7acbc8f8f81d3e4680a4b2441c (523ms) [ivy:resolve] [FAILED ] org.hsqldb#hsqldb;2.2.8!hsqldb.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/org.hsqldb/hsqldb/2.2.8/jars/hsqldb.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/org/hsqldb/hsqldb/2.2.8/hsqldb-2.2.8.jar [ivy:resolve] [FAILED ] org.apache.lucene#lucene-core;3.4.0!lucene-core.jar: invalid sha1: expected=4426bf0764ec5fa634abca236b469d2519c74f65 computed=112d2454390cba8c7c35b34b8f7a821c6cec3f73 (775ms) [ivy:resolve] [FAILED ] org.apache.lucene#lucene-core;3.4.0!lucene-core.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/org.apache.lucene/lucene-core/3.4.0/jars/lucene-core.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/3.4.0/lucene-core-3.4.0.jar [ivy:resolve] [FAILED ] com.ibm.icu#icu4j;4.0.1!icu4j.jar: invalid sha1: expected=06362db7a2556bb58a04e991029196e2aad632d4 computed=d9862ffbc6cd6241a03c06b5911bf22a079d2cda (1544ms) [ivy:resolve] [FAILED ] com.ibm.icu#icu4j;4.0.1!icu4j.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/com.ibm.icu/icu4j/4.0.1/jars/icu4j.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/com/ibm/icu/icu4j/4.0.1/icu4j-4.0.1.jar [ivy:resolve] [FAILED ] xerces#xercesImpl;2.9.1!xercesImpl.jar: invalid sha1: expected=7bc7e49ddfe4fb5f193ed37ecc96c12292c8ceb6 computed=88931c057b31ba3ff7ac96e53817b25ff355c4a1 (393ms) [ivy:resolve] [FAILED ] xerces#xercesImpl;2.9.1!xercesImpl.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/xerces/xercesImpl/2.9.1/jars/xercesImpl.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar [ivy:resolve] [FAILED ] com.google.guava#guava;11.0.2!guava.jar: invalid sha1: expected=35a3c69e19d72743cac83778aecbee68680f63eb computed=1e8507869d7db99f60f8d949bc5ba2b5410ce2db (355ms) [ivy:resolve] [FAILED ] com.google.guava#guava;11.0.2!guava.jar: (0ms) [ivy:resolve] local: tried [ivy:resolve] /Users/mattmann/.ivy2/local/com.google.guava/guava/11.0.2/jars/guava.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/com/google/guava/guava/11.0.2/guava-11.0.2.jar [ivy:resolve] :: [ivy:resolve] :: FAILED DOWNLOADS:: [ivy:resolve] :: ^ see resolution messages for details ^ :: [ivy:resolve] :: [ivy:resolve] :: org.apache.lucene#lucene-core;3.4.0!lucene-core.jar [ivy:resolve] :: org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar [ivy:resolve] :: org.hsqldb#hsqldb;2.2.8!hsqldb.jar [ivy:resolve] :: com.ibm.icu#icu4j;4.0.1!icu4j.jar [ivy:resolve] :: xerces#xercesImpl;2.9.1!xercesImpl.jar [ivy:resolve] :: com.google.guava#guava;11.0.2!guava.jar [ivy:resolve] :: [ivy:resolve] [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS BUILD FAILED /Users/mattmann/tmp/nutch2/apache-nutch-2.0/build.xml:431: impossible to resolve dependencies: resolve failed - see output for details Total time: 1 minute 56 seconds [chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% Any ideas? Cheers, Chris On Jul 3, 2012, at 7:49 AM, Julien Nioche wrote: Hi Chris [chipotle:~/tmp/nutch2] mattmann% $HOME/bin/verify_gpg_sigs Verifying Signature for file apache-nutch-2.0-src.tar.gz.asc gpg: Signature made Mon Jun 25 09:28:36 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
I'll try to scope this by tomorrow...thanks Lewis. Cheers, Chris On Jul 2, 2012, at 10:49 AM, Lewis John Mcgibbney wrote: Anyone else for this RC? I've been slighyl distracted with a number of things recently and only just getting round to following this one up so apologies about that. Best Lewis On Wed, Jun 27, 2012 at 10:23 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote: +1 Crawling with HBaseStore works from injecting to indexing. Great work Lewis. On Mon, Jun 25, 2012 at 6:32 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Everyone, A candidate for the Apache Nutch 2.0 RC3 is available at: http://people.apache.org/~lewismc/apache-nutch-2.0rc3 The release candidate is a src.zip and src.tar.gz ONLY archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc3 We release Nutch 2.0 in this fashion due to the inclusion of Apache Gora and the likelihood that users will regularly recompile the code to suit dynamic requirements. Further, a staged Maven repository of the 2.0 jar, sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-275 Please vote on releasing this package as Apache Nutch 2.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 2.0 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: 1.5.1 release
Hey Guys, (sorry for the top post) There's no reason to freeze trunk during releases. In fact, during the RC, once the branch (or tag for that matter) is created, trunk can continue on, no need to stop. Heck, we can always just tag or branch from a specific revision too so it's not really a biggie. Cheers, Chris On Jun 21, 2012, at 2:43 PM, Lewis John Mcgibbney wrote: Hi Markus, On Thu, Jun 21, 2012 at 10:02 PM, Markus Jelsma markus.jel...@openindex.io wrote: It's still not clear to me what 1.5.1 is going to look like. Will it be current trunk incl. the script bugfix or just 1.5 plus the bugfix? I would vote for the latter as it makes more sense for a bugfix release. I am easy on this one... I suggest we do it the normal way. Lets let folks chime in and see where we are on Saturday. It looks like 2.0 is going to be shifted with the new commits so do we wish to try and keep at least the minimal consistency between both releases? There is another debate behind this, in my opinion, about freezing trunk prior to releases and thus stopping active development. This has been an issue in the past. Is this something for another thread? Yeah I must also agree that we should branch trunk, keep the branch for the release then run the RC's from the branch regardless of how trunk comes on. My only suggestion for backporting patches from trunk to the release candidate branch is if it is a pretty critical bug fix as we've now discovered in 1.5! Additionally there is another note here as well w.r.t release managers. We've relied on the excellent work done by Chris (and others) as RM's for a number of releases but during the release period (on occasion, more recently) as you mention trunk has frozen temporarily. Of course it is the aim to prevent this happening should the RC not progress as we would all like. Hopefully we are moving towards a more adaptable and sustainable RM process within Nutch where the RM responsibility can be undertaken/overseen by more than one individual over the entire duration of the process. I think (and hope) we can consider the slight struggle we've had for 1.5 as an exception. As far back as I can remember RC's have always been efficient and smooth and I personally am committed to ensuring we return to the high precedent set by previous RM's. We've also seen an alternative (and in my opinion an improved) publication of Nutch atrifacts for 1.5. For reference I direct you to Julien's commentary [0] on this topic. Due to this, we've had to run additional RC's which has taken a bit longer than usual and I must personally apologise to everyone for at least one RC cock up which could have been avoided had I been more familiar with the Nutch specific release process. I think I'm ranting here so I'm going to give it a bye now. Lewis [0] http://digitalpebble.blogspot.co.uk/2012/06/whats-new-in-nutch-15.html ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch 1.5 Deploy Mode Doesn't Work like Nutch 1.4 Deploy Mode
+1! Cheers, Chris On Jun 19, 2012, at 2:26 AM, Julien Nioche wrote: Quite annoying that we did not spot this before releasing. What about a 1.5.1 soonish with this fix + couple smallish improvements e.g. upgrade to Hadoop 1.0.3? J. -- Forwarded message -- From: Julien Nioche lists.digitalpeb...@gmail.com Date: 19 June 2012 08:56 Subject: Re: Nutch 1.5 Deploy Mode Doesn't Work like Nutch 1.4 Deploy Mode To: u...@nutch.apache.org Alternatively modify the bin/nutch script to make it more robust # NUTCH_JOB if [ -f ${NUTCH_HOME}/*nutch*.job ]; then local=false for f in $NUTCH_HOME/*nutch*.job; do NUTCH_JOB=$f; done fi On 19 June 2012 00:09, sidbatra siddharthaba...@gmail.com wrote: This turns out to be a genuine bug with an easy fix. build.xml is configured to generate a job file titled apache-nutch-1.5.job but the deploy binary is still looking for nutch-1.5.job Renaming apache-nutch-1.5.job to nutch-1.5.job fixes this bug in deploy mode. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-5-Deploy-Mode-Doesn-t-Work-like-Nutch-1-4-Deploy-Mode-tp3990169p3990196.html Sent from the Nutch - User mailing list archive at Nabble.com. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: VOTE Apache Nutch 2.0 RC1
OK you are just making us all look bad now Juls ;) Super fast! Cheers, Chris On Jun 15, 2012, at 2:54 AM, Julien Nioche wrote: see https://issues.apache.org/jira/browse/NUTCH-1396 On 15 June 2012 10:43, Julien Nioche lists.digitalpeb...@gmail.com wrote: Before you do, could you check that NutchGora passes ant test successfully. I just tried and got an error related to the parse-tika tests. Am about to open a JIRA to update to the latest version of Tika for NutchGora which should fix the problem and put it at the same level as trunk J On 15 June 2012 10:01, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote:ly I'll push this in an hour or so guys. Thanks for the input. Lewis On Fri, Jun 15, 2012 at 9:39 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: +1 On 15 June 2012 09:00, Ferdy Galema ferdy.gal...@kalooga.com wrote: Agree with only releasing src. On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Or just not ship a bin release at all. Src is the only thing we really VOTE on legally though bin is provided for convenience purposes. Will type more on this later... Sent from my iPhone On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Julien, Do you suggest with the binary release that we simply open up all gora-* deps and ship it with every jar available? Lewis On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: I disagree. You'd expect a binary release to work out of the box - which is not the case. Plus we'd have to spend more time explaining the workaround, answering the same questions over and over on the ML etc... Fixing this should not be a big deal (i.e. add the gore-x modules for the backends to the ivy deps file). Julien On 14 June 2012 20:27, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, I think the annoyance is probably something folks can live with as they have been waiting for an official release of 2.x for years :) My +1 to roll RC #2 with or without a solution to this and mark it as a TODO. release eary, release often :) Cheers, Chris On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote: Aye this is no good at all. Depending on which backend you wish to use with Gora, you will need to go and manually fetch the correct .jar's from maven central. Does anyone else have either solution or a workaround before I push RC2 with just src dists? Thanks Lewis On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: We only supply src distributions... Does this principle apply to Nutch 2 as well? Maybe, yes. The situation with the current binary package is uncomfortable: I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running. 2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com Hi Guys, Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't supply binary distributions of the code, this is because when using Gora a user may wish/require to recompile the code to accomodate config changes etc. We only supply src distributions... Does this principle apply to Nutch 2 as well? I mean, what if your using the gora-sql dependency, then you wish to switch to HBase and recompile, is this possible within the binary distribution? Best Lewis On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Ferdy The Nutch job jar is not present in the binary archive. This means distributed running of jobs is not supported. I'm not sure if this is a problem (since users can always build one themselves), merely pointing it out. The recently released 1.5 also lacks this job jar, so at least no difference there. The binary distrib corresponds to runtime/local and as such should NOT have the job file there. This is now the norm since 1.5 Will try and do some testing of the RC Thanks Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http
Re: VOTE Apache Nutch 2.0 RC1
Hey Guys, I think the annoyance is probably something folks can live with as they have been waiting for an official release of 2.x for years :) My +1 to roll RC #2 with or without a solution to this and mark it as a TODO. release eary, release often :) Cheers, Chris On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote: Aye this is no good at all. Depending on which backend you wish to use with Gora, you will need to go and manually fetch the correct .jar's from maven central. Does anyone else have either solution or a workaround before I push RC2 with just src dists? Thanks Lewis On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: We only supply src distributions... Does this principle apply to Nutch 2 as well? Maybe, yes. The situation with the current binary package is uncomfortable: I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running. 2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com Hi Guys, Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't supply binary distributions of the code, this is because when using Gora a user may wish/require to recompile the code to accomodate config changes etc. We only supply src distributions... Does this principle apply to Nutch 2 as well? I mean, what if your using the gora-sql dependency, then you wish to switch to HBase and recompile, is this possible within the binary distribution? Best Lewis On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Ferdy The Nutch job jar is not present in the binary archive. This means distributed running of jobs is not supported. I'm not sure if this is a problem (since users can always build one themselves), merely pointing it out. The recently released 1.5 also lacks this job jar, so at least no difference there. The binary distrib corresponds to runtime/local and as such should NOT have the job file there. This is now the norm since 1.5 Will try and do some testing of the RC Thanks Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: VOTE Apache Nutch 2.0 RC1
Or just not ship a bin release at all. Src is the only thing we really VOTE on legally though bin is provided for convenience purposes. Will type more on this later... Sent from my iPhone On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com wrote: Hi Julien, Do you suggest with the binary release that we simply open up all gora-* deps and ship it with every jar available? Lewis On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote: I disagree. You'd expect a binary release to work out of the box - which is not the case. Plus we'd have to spend more time explaining the workaround, answering the same questions over and over on the ML etc... Fixing this should not be a big deal (i.e. add the gore-x modules for the backends to the ivy deps file). Julien On 14 June 2012 20:27, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, I think the annoyance is probably something folks can live with as they have been waiting for an official release of 2.x for years :) My +1 to roll RC #2 with or without a solution to this and mark it as a TODO. release eary, release often :) Cheers, Chris On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote: Aye this is no good at all. Depending on which backend you wish to use with Gora, you will need to go and manually fetch the correct .jar's from maven central. Does anyone else have either solution or a workaround before I push RC2 with just src dists? Thanks Lewis On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel wastl.na...@googlemail.commailto:wastl.na...@googlemail.com wrote: We only supply src distributions... Does this principle apply to Nutch 2 as well? Maybe, yes. The situation with the current binary package is uncomfortable: I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running. 2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com Hi Guys, Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't supply binary distributions of the code, this is because when using Gora a user may wish/require to recompile the code to accomodate config changes etc. We only supply src distributions... Does this principle apply to Nutch 2 as well? I mean, what if your using the gora-sql dependency, then you wish to switch to HBase and recompile, is this possible within the binary distribution? Best Lewis On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote: Ferdy The Nutch job jar is not present in the binary archive. This means distributed running of jobs is not supported. I'm not sure if this is a problem (since users can always build one themselves), merely pointing it out. The recently released 1.5 also lacks this job jar, so at least no difference there. The binary distrib corresponds to runtime/local and as such should NOT have the job file there. This is now the norm since 1.5 Will try and do some testing of the RC Thanks Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.govmailto:chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- [http://digitalpebble.com/img/logo.gif] Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Lewis
Re: Suitable Nutch 2.0 Project Description
+1 to the description w/o experimental too (I agree with Ferdy). You guys ROCK. Cheers, Chris On Jun 13, 2012, at 5:29 AM, Lewis John Mcgibbney wrote: Hi, Seeing as we have the ball rolling with the 2.0 RC. I thought I'd ask about a suitable project descriptor. So far on trunk we have ** Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. This is merely a pot shot, but I was thinking for Nutch 2.0, something like ** Apache Nutch 2.X is an experimental branch of the Apache Nutch open source web-search software project. It builds on Apache Gora for data persistence and Apache Solr for indexing adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. Although there are not many changes here I just wanted to run it by you folks...? Thanks Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: VOTE Apache Nutch 2.0 RC1
Hey Lewis, I will get to this tonight, for sure. Thanks! Cheers, Chris On Jun 12, 2012, at 1:16 PM, Lewis John Mcgibbney wrote: Hi Everyone, I appreciate that most of the core dev's are using trunk, however I would appeal to you guys to at least check out the artifacts and check sigs, tests, license headers if possible. Although this does not fully satisfy the requirements of a thoroughly reviewed RC, hopefully the thorough stuff can be undertaken by those directly using the artifacts and code in development/production. Thanks very much in advance Best Lewis On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org wrote: Good Evening Everyone, A candidate for the Apache Nutch 2.0 RC1 is available at: http://people.apache.org/~lewismc/nutch-2.0 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1 Further, a staged Maven repository of the 2.0 jar, sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-215 Please vote on releasing this package as Apache Nutch 2.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 2.0 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Have a great weekend, Kind Regards, Lewis P.S. Here's my +1. -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: VOTE Apache Nutch 2.0 RC1
Hey Guys, #2 is probably reason enough for a respin. Lewis if you don't have time to do it before Thursday, I could probably give it a whack. Let me know. Cheers, Chris On Jun 12, 2012, at 3:33 PM, Sebastian Nagel wrote: Hi Lewis, my first steps with 2.0 (to be continued, still struggling). Two points (I'll try to give a final vote tomorrow): 1 some guidance would be nice. README.txt points to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x (I'm using http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html) 2 the package contains your nutch-site.xml: namehttp.agent.email/name valuelewi...@apache.org/value I guess that's not intended :) Cheers, Sebastian On 06/12/2012 10:16 PM, Lewis John Mcgibbney wrote: Hi Everyone, I appreciate that most of the core dev's are using trunk, however I would appeal to you guys to at least check out the artifacts and check sigs, tests, license headers if possible. Although this does not fully satisfy the requirements of a thoroughly reviewed RC, hopefully the thorough stuff can be undertaken by those directly using the artifacts and code in development/production. Thanks very much in advance Best Lewis On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org wrote: Good Evening Everyone, A candidate for the Apache Nutch 2.0 RC1 is available at: http://people.apache.org/~lewismc/nutch-2.0 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1 Further, a staged Maven repository of the 2.0 jar, sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-215 Please vote on releasing this package as Apache Nutch 2.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 2.0 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Have a great weekend, Kind Regards, Lewis P.S. Here's my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 1.5 release-1.5RC4
Hey Lewis, +1 from me! SIGS check out: [chipotle:nutch-dev/1.5-release/rc4] mattmann% ls apache-nutch-1.5-bin.tar.gz apache-nutch-1.5-bin.zip apache-nutch-1.5-src.tar.gz apache-nutch-1.5-src.zip apache-nutch-1.5-bin.tar.gz.asc apache-nutch-1.5-bin.zip.asc apache-nutch-1.5-src.tar.gz.asc apache-nutch-1.5-src.zip.asc apache-nutch-1.5-bin.tar.gz.md5 apache-nutch-1.5-bin.zip.md5 apache-nutch-1.5-src.tar.gz.md5 apache-nutch-1.5-src.zip.md5 apache-nutch-1.5-bin.tar.gz.sha apache-nutch-1.5-bin.zip.sha apache-nutch-1.5-src.tar.gz.sha apache-nutch-1.5-src.zip.sha [chipotle:nutch-dev/1.5-release/rc4] mattmann% $HOME/bin/verify_gpg_sigs Verifying Signature for file apache-nutch-1.5-bin.tar.gz.asc gpg: Signature made Thu May 31 13:24:55 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1 89C1 F45E 7970 C601 BCA7 Verifying Signature for file apache-nutch-1.5-bin.zip.asc gpg: Signature made Thu May 31 13:25:57 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1 89C1 F45E 7970 C601 BCA7 Verifying Signature for file apache-nutch-1.5-src.tar.gz.asc gpg: Signature made Thu May 31 13:25:34 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1 89C1 F45E 7970 C601 BCA7 Verifying Signature for file apache-nutch-1.5-src.zip.asc gpg: Signature made Thu May 31 13:26:15 2012 PDT using RSA key ID C601BCA7 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) lewi...@apache.org gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1 89C1 F45E 7970 C601 BCA7 [chipotle:nutch-dev/1.5-release/rc4] mattmann% checkums check out: [chipotle:nutch-dev/1.5-release/rc4] mattmann% $HOME/bin/verify_md5_checksums md5sum: stat '*.bz2': No such file or directory apache-nutch-1.5-bin.tar.gz: OK apache-nutch-1.5-src.tar.gz: OK apache-nutch-1.5-bin.zip: OK apache-nutch-1.5-src.zip: OK [chipotle:nutch-dev/1.5-release/rc4] mattmann% Built source. All good! runtime: [mkdir] Created dir: /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime [mkdir] Created dir: /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local [mkdir] Created dir: /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/deploy [copy] Copying 1 file to /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/deploy [copy] Copying 1 file to /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/deploy/bin [copy] Copying 1 file to /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/lib [copy] Copying 1 file to /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/lib/native [copy] Copying 21 files to /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/conf [copy] Copying 1 file to /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/bin [copy] Copying 48 files to /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/lib [copy] Copying 123 files to /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/plugins [copy] Copied 2 empty directories to 2 empty directories under /Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/test BUILD SUCCESSFUL Total time: 2 minutes 17 seconds [chipotle:1.5-release/rc4/apache-nutch-1.5] mattmann% Minor nit: source package unzips into the current directory as opposed to prior practice of having it unzip into apache-nutch-X.Y folder. No biggie though. Thanks for stepping up and rocking the release process! Cheers, Chris On May 31, 2012, at 1:37 PM, Lewis John Mcgibbney wrote: Good Evening Everyone, A candidate for the Apache Nutch 1.5 RC4 is available at: http://people.apache.org/~lewismc/apache-nutch-1.5-rc4/ The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz archive of the sources in:
Re: [VOTE] Apache Nutch release 1.5 RC3
Hey Guys, Does this warrant a respin, or are you +1 Juls? Cheers, Chris On May 31, 2012, at 1:44 AM, Julien Nioche wrote: Hi Lewis, Minor nitpick : the directory /runtime is not necessary as it is built with ANT. Removing it would massively reduce the size of the archive. Could we fix it for the final release? All fine apart from this. The content of the src archive compiles fine, the pom on the Maven repo looks good. Thanks a lot Julien On 30 May 2012 21:59, lewis john mcgibbney lewi...@apache.org wrote: Good Evening Everyone, A candidate for the Apache Nutch 1.5 RC3 is available at: http://people.apache.org/~lewismc/apache-nutch-1.5-rc3/ The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.5-rc3/ Further, a staged Maven repository of the 1.5 sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-167/ Please vote on releasing this package as Apache Nutch 1.5. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 1.5 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch release 1.5 RC3
okey dokey. I will try and take the time to review the RC today. Thanks for pushing this Lewis! Cheers, Chris On May 31, 2012, at 7:36 AM, Julien Nioche wrote: Hi, Depends on Lewis :-) Let's say I am +1 but if it is not too much hassle it would be nice to fix it J. On 31 May 2012 15:24, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, Does this warrant a respin, or are you +1 Juls? Cheers, Chris On May 31, 2012, at 1:44 AM, Julien Nioche wrote: Hi Lewis, Minor nitpick : the directory /runtime is not necessary as it is built with ANT. Removing it would massively reduce the size of the archive. Could we fix it for the final release? All fine apart from this. The content of the src archive compiles fine, the pom on the Maven repo looks good. Thanks a lot Julien On 30 May 2012 21:59, lewis john mcgibbney lewi...@apache.org wrote: Good Evening Everyone, A candidate for the Apache Nutch 1.5 RC3 is available at: http://people.apache.org/~lewismc/apache-nutch-1.5-rc3/ The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.5-rc3/ Further, a staged Maven repository of the 1.5 sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-167/ Please vote on releasing this package as Apache Nutch 1.5. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 1.5 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch release 1.5 RC3
Hey Lewis, Actually if the bits change, in the past, I've been pushed to generate a new RC (as the SIG files, checksum, etc. will change too). My +1 for a new RC to accommodate that. If you don't have time today I would be happy to help. Cheers, Chris (who now has more time *grin*) On May 31, 2012, at 8:42 AM, Lewis John Mcgibbney wrote: If I were to change to artifacts to accommodate the removal of the runtime dir I don't think it would require a completely new RC. I am happy to generate the same sources via the tag, sign, then push them pending the VOTE result. Does this comply with release policy? Thanks Lewis On Thu, May 31, 2012 at 3:49 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: okey dokey. I will try and take the time to review the RC today. Thanks for pushing this Lewis! Cheers, Chris On May 31, 2012, at 7:36 AM, Julien Nioche wrote: Hi, Depends on Lewis :-) Let's say I am +1 but if it is not too much hassle it would be nice to fix it J. On 31 May 2012 15:24, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hey Guys, Does this warrant a respin, or are you +1 Juls? Cheers, Chris On May 31, 2012, at 1:44 AM, Julien Nioche wrote: Hi Lewis, Minor nitpick : the directory /runtime is not necessary as it is built with ANT. Removing it would massively reduce the size of the archive. Could we fix it for the final release? All fine apart from this. The content of the src archive compiles fine, the pom on the Maven repo looks good. Thanks a lot Julien On 30 May 2012 21:59, lewis john mcgibbney lewi...@apache.org wrote: Good Evening Everyone, A candidate for the Apache Nutch 1.5 RC3 is available at: http://people.apache.org/~lewismc/apache-nutch-1.5-rc3/ The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.5-rc3/ Further, a staged Maven repository of the 1.5 sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-167/ Please vote on releasing this package as Apache Nutch 1.5. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 1.5 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: 1.5 RC2
+1 happy for Lewis to try I've been swamped! Sent from my iPhone On May 22, 2012, at 2:16 AM, Julien Nioche lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote: Hi Lewis, I am sure that Chris will have no problem with you doing the RC2. Chris? It would be a good thing to have more than one person who knows how to do it anyway :-) Note that to generate a fresh pom.xml you need to * get maven-ant-tasks-2.1.3.jar and put it in the ivy dir * ant -lib ivy deploy The resulting pom.xml file should reflect the content of the main ivy.xml. I have committed some minor changes to the pom template in trunk, this will need to be copied to the 1.5 branch as well. We recently discussed a move to Maven, another option would be to manage the dependencies with the Maven Ant task, which would save us the hassle of having to keep the ivy.xml and pom.xml in sync. We'll see Thanks Julien -- [http://digitalpebble.com/img/logo.gif] Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: 1.5 RC2
+1 Sent from my iPhone On May 22, 2012, at 4:43 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com wrote: Hi, As I say, I am able to stick time in tonight to roll this RC, however does anyone have a problem with me rolling the 2.0 RC tonight after the 1.5RC2? I would like to get them out the way saving me time during this week if possible. Thanks Lewis On Tue, May 22, 2012 at 10:35 AM, Lewis John Mcgibbney lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com wrote: OK doke this sounds fine to me then. I will make the relevant commits to the 1.5 branch then work at it later this evening. I'll make a new thread when the stuff is sorted out and we are ready to VOTE on the new RC. Thanks Lewis On Tue, May 22, 2012 at 10:15 AM, Julien Nioche lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote: Hi Lewis, I am sure that Chris will have no problem with you doing the RC2. Chris? It would be a good thing to have more than one person who knows how to do it anyway :-) Note that to generate a fresh pom.xml you need to * get maven-ant-tasks-2.1.3.jar and put it in the ivy dir * ant -lib ivy deploy The resulting pom.xml file should reflect the content of the main ivy.xml. I have committed some minor changes to the pom template in trunk, this will need to be copied to the 1.5 branch as well. We recently discussed a move to Maven, another option would be to manage the dependencies with the Maven Ant task, which would save us the hassle of having to keep the ivy.xml and pom.xml in sync. We'll see Thanks Julien -- [http://digitalpebble.com/img/logo.gif] Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Lewis -- Lewis
Re: [VOTE] Apache Nutch 1.5 release rc #1
Hey Julien, On May 9, 2012, at 3:11 AM, Julien Nioche wrote: Hi Chris Any chance you could do a RC2 for the trunk soonish? We've been a bit stuck since mid April and it would be nice to move on. If not I can try and spin a RC myself but it is likely to be hilarious :-) Haha, no worries. I will try and get one going for this weekend. And I'm sure you'd do fine! :) Re-Maven : I am not against moving to Maven at all : it would make it easier to publish the artefacts + nice integration with Eclipse + most devs familiar with it etc... not sure about the best way to deal with the plugins though - treat them as modules? any thoughts on this? Yeah this is something I would definitely like to explore for 1.6+ -- I think we could just do Maven pom.xml files for each plugin and then do a multi-aggregator core project that built core first, then all the plugins post facto. I will file an issue to explore this for 1.6. Thanks! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Suitable naming for Nutchgora branch?
Great work Lewis, thanks! Cheers, Chris On Apr 25, 2012, at 4:01 PM, Lewis John Mcgibbney wrote: Hi Everyone, As you guys will have seen I've quickly polluted our dev list again (sorry!!!) with set and classify for 2.1. The open issues for 2.0 are ones which I think we could address within the 2.0 release. This is merely my opinion, based upon the assertion that they all contain patches which could be up for review. With the exception of NUTCH-879 which is pretty alarming. I'll test shortly. I'm now away to bed. Best Lewis On Wed, Apr 25, 2012 at 3:06 PM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Guys, ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: NUTCH-1129
Hey Lewis, On Apr 17, 2012, at 3:35 AM, Lewis John Mcgibbney wrote: 3) We previously discussed implementing the Any23 parser plugin as a tika wrapper, therefore it would look very similar to parse-tika? I think it would be super awesome to add the Any23 parsing functionality as a Tika parser, and potentially an extension to the MIME repository to detect microformats, etc. Then in Nutch, we could take advantage of the any23 parser with the existing tika-parser interface. Thoughts? Thanks! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 1.5 release rc #1
Hi Julien, On Apr 16, 2012, at 2:02 AM, Julien Nioche wrote: Thanks Chris, -1 the versions of the deps for hadoop, tika and possibly others are not correct in the pom.xml found in the src archive and on the mvn repository, which will be a problem for whoever tries to use the pom.xml file e.g. in Eclipse or more annoyingly declare Nutch as a dependency with Ivy / Maven. Did you regenerate the pom file from the ivy one? I didn't regenerate it -- but will try and do so for RC #2. I remember that we mentioned delivering the content of runtime/local in the binary archive instead of having the sources + runtime/deploy as well. [..snip...] I don't think it would take much time to do that, so what about doing it now? We could rename the archive into apache-nutch-1.5-local-bin maybe to make the content clearer. +1 to the above, but I think we can just have it be apache-nutch-1.5-bin -- no need to rename it to local. We can just reference this ML thread for documentation in the future. I'll include the above 2 things when I re-roll an RC #2 hopefully in the next few days. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 1.5 release rc #1
Hey Sami, Thanks. I'll fix the 4 license headers you mention below as part of RC #2. Cheers, Chris On Apr 16, 2012, at 3:02 AM, Sami Siren wrote: On Mon, Apr 16, 2012 at 8:43 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, A candidate for the Nutch 1.5 release is available at: http://people.apache.org/~mattmann/apache-nutch-1.5/rc1/ The release candidate is a zip and tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.5/ And a binary build suitable for deployment. A staged Maven repository is available here: https://repository.apache.org/content/repositories/orgapachenutch-054/ Please vote on releasing this package as Apache Nutch 1.5. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 1.5 [ ] -1 Do not release this package because... The basics are good: md5 and sha1 checksums for apache-nutch-1.5-bin.tar.gz and apache-nutch-1.5-src.tar.gz match ant clean test completes succesfully for the source package completed a simple crawl with local mode and a small hadoop 1.0.2 cluster by using the artifacts in the binary package but it seems there are some license headers missing from source files: [rat:report] ==/home/sam/nutch/apache-nutch-1.5/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java [rat:report] ==/home/sam/nutch/apache-nutch-1.5/src/plugin/creativecommons/src/web/web.xml [rat:report] ==/home/sam/nutch/apache-nutch-1.5/src/plugin/protocol-httpclient/src/test/conf/httpclient-auth-test.xml [rat:report] ==/home/sam/nutch/apache-nutch-1.5/src/plugin/protocol-httpclient/src/test/conf/nutch-site-test.xml -1 because of missing license headers -- Sami Siren ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 1.5 release rc #1
Hey Lewis, Hmm, not sure on the MD5 and SHA -- they seem to validate for me and seemed to work at least Sami (and Markus?). Guys, any idea what's up with Lewis's verification step here? Lewis, you may try re-downloading and verifying them again, but wait until RC #2 on that. I'll fix the NOTICE file for RC #2 as you mention below and not sure why the extension was .tar.gz.tar.gz, I'll fix that too. Cheers, Chris On Apr 16, 2012, at 3:12 AM, Lewis John Mcgibbney wrote: Hi Chris, On Mon, Apr 16, 2012 at 6:43 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, A candidate for the Nutch 1.5 release is available at: http://people.apache.org/~mattmann/apache-nutch-1.5/rc1/ I used the KEYS file stored on SVN under the 1.5 tag (as below), and got the following when verifying the above RC (stored on your p.a.o area) lewis@lewis-01:~/Desktop$ gpg --import KEYS gpg: key A7239D59: Doug Cutting (Lucene guy) cutt...@apache.org not changed gpg: key 7C491924: public key Piotr Kosiorowski pkosiorow...@apache.org imported gpg: key 0B7E6CFA: public key Sami Siren si...@apache.org imported gpg: key 57163A4D: public key Dennis E. Kubes ku...@apache.org imported gpg: key 24BCF054: public key Chris A. Mattmann mattm...@apache.org imported gpg: Total number processed: 5 gpg: imported: 4 gpg: unchanged: 1 gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model gpg: depth: 0 valid: 1 signed: 0 trust: 0-, 0q, 0n, 0m, 0f, 1u lewis@lewis-01:~/Desktop$ gpg --verify apache-nutch-1.5-bin.tar.tar.gz.asc gpg: no signed data gpg: can't hash datafile: file open error lewis@lewis-01:~/Desktop$ gpg --verify apache-nutch-1.5-bin.zip.asc gpg: Signature made Mon 16 Apr 2012 06:00:20 BST using DSA key ID B876884A gpg: Can't check signature: public key not found lewis@lewis-01:~/Desktop$ gpg --verify apache-nutch-1.5-src.tar.gz.asc gpg: Signature made Mon 16 Apr 2012 06:00:18 BST using DSA key ID B876884A gpg: Can't check signature: public key not found lewis@lewis-01:~/Desktop$ gpg --verify apache-nutch-1.5-src.zip.asc gpg: Signature made Mon 16 Apr 2012 06:00:22 BST using DSA key ID B876884A gpg: Can't check signature: public key not found lewis@lewis-01:~/Desktop$ md5sum apache-nutch-1.5-bin.tar.tar.gz.asc e32088205efd59ffc882c79add0bafae apache-nutch-1.5-bin.tar.tar.gz.asc lewis@lewis-01:~/Desktop$ md5sum apache-nutch-1.5-bin.zip.asc ff7960b8540673a86756f6b3f53ffd79 apache-nutch-1.5-bin.zip.asc lewis@lewis-01:~/Desktop$ md5sum apache-nutch-1.5-src.tar.gz.asc 9da161bcd5ec0de3f702a12e6bfbf9e6 apache-nutch-1.5-src.tar.gz.asc lewis@lewis-01:~/Desktop$ md5sum apache-nutch-1.5-src.zip.asc 6750bbc93b028776fa888f988df3a614 apache-nutch-1.5-src.zip.asc Some comments: 1) I don't think the tar should be appended twice for the apache-nutch-1.5-bin.tar.tar.gz artefact and accompanying sigs. 2) None of my other attempts to verify the other artefacts via gpg worked! 3) All attempts to verify via md5sum did not match the strings present in your p.a.o area! 4) Really really trivial, but in our NOTICE file, it stated a date of 2009. I should have picked this up a while ago when I updated the other dates in these files, this one seems to have slipped through the net. The release candidate is a zip and tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.5/ Stuff in SVN tag looks OK apart from the stuff I mentioned above. And a binary build suitable for deployment. A staged Maven repository is available here: https://repository.apache.org/content/repositories/orgapachenutch-054/ I've not got around to checking the gpg and md5sum verifications yet, as I'm waiting for someone to confirm that the above failed verifications are correct before I do so. I'm hoping that I've made a mistake somewhere. [X ] -1 Do not release this package because... Because of the above, unless I discover that I've done something wrong then I can't VOTE yes. I'm open to discussion on this, if someone can display that I've taken a wrong turn somewhere then I might change my VOTE however for the time being I need to call this one down. Thanks for spinning the RC Chris. Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[VOTE] Apache Nutch 1.5 release rc #1
Hi Folks, A candidate for the Nutch 1.5 release is available at: http://people.apache.org/~mattmann/apache-nutch-1.5/rc1/ The release candidate is a zip and tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.5/ And a binary build suitable for deployment. A staged Maven repository is available here: https://repository.apache.org/content/repositories/orgapachenutch-054/ Please vote on releasing this package as Apache Nutch 1.5. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 1.5 [ ] -1 Do not release this package because... Thanks! Cheers, Chris P.S. Here's my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Nutch 1.x trunk release
Hey Julien, Yeah my weekend flew by -- this and the SIS RC are the top items on my opensource TODO :) Hopefully this week... Cheers, Chris On Apr 10, 2012, at 8:07 AM, Julien Nioche wrote: Hi guys, Chris - any idea of if / when you'll have the time to do a RC for trunk? Thanks Julien On 3 April 2012 15:30, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Thanks Lewis! Cheers, Chris P.S. Hopefully by this weekend... On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote: Hi, On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: Seems fine. Only updating KEYS is no longer necessary. Now sorted. Thanks whenever you can get round to this Chris. Best Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: NutchGora release, and Nutch 1.x trunk release
Hi Markus, On Apr 3, 2012, at 5:50 AM, Markus Jelsma wrote: Cool! Next time i'll ask infra to allow to supress notifications. Chris, will you RM one RC? And if possible list the detailed steps/command in the process in case you don't have to time RM 1.6 when the time comes. The wiki is dated. Happy to RM it. Check the wiki here: http://wiki.apache.org/nutch/Release_HOWTO Lewis and I updated this after the last release. It's more or less what's required to release the project and what I run. It's also really similar to the OODT release process: https://cwiki.apache.org/confluence/display/OODT/Release+Process Was there something specific that you weren't seeing there? I'm looking forward to yet another big release with lots of fixes and improvements! Agreed, thanks everyone! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: NutchGora release, and Nutch 1.x trunk release
Thanks Lewis! Cheers, Chris P.S. Hopefully by this weekend... On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote: Hi, On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io wrote: Seems fine. Only updating KEYS is no longer necessary. Now sorted. Thanks whenever you can get round to this Chris. Best Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
NutchGora release, and Nutch 1.x trunk release
Hey Guys, I've got some cycles this weekend -- anyone up for a 1.5 release off trunk (stable), and a NutchGora branch release? I suggested this before [1] regarding NutchGora. I'm inclined to say let's do the following: 1. NutchGora: apache-nutch-2.0 - release 2.x series based on this branch 2. Nutch: apache-nutch-1.x - stable trunk branch Then, when the time comes, we can try and create a: 3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches Would this make sense? Anyways we don't have to decide anything now that we can't undo later, but are folks OK with me doing an RC for NutchGora and for 1.x this weekend? Cheers, Chris [1] http://s.apache.org/GD2 ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: NutchGora release, and Nutch 1.x trunk release
Hey Guys, OK, sounds good. Looks like we need to wait for the Tika 1.1 release (seems to be going well so far), and then try and push Gora 0.2 (which I know Lewis is pushing, and which I'm happy to RM once we're ready there). So, maybe I'll shoot for next weekend or the weekend after to push Nutch 1.5 and 2.0 RCs. Cheers, Chris On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote: Yeah I agree Chris Markus. On the Nutchgora note, I would like to see Gora 0.2. released before hand, as we have a blocking issue NUTCH-1205 with Ivy retrieving alien Gora 0.2-SNAPSHOT dependencies from repository.apache.org. We should be able to overcome this issue by releasing Gora 0.2 to maven central then just pulling those dependencies with Ivy in Nutchgora rather than messing about with chain/multiple/snapshot resolvers in the Ivy configuration. My 2 cents On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma markus.jel...@openindex.io wrote: +1 1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3. But i'd like to integrate Tika 1.1 after its pending release. Cheers On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote: Hey Guys, I've got some cycles this weekend -- anyone up for a 1.5 release off trunk (stable), and a NutchGora branch release? I suggested this before [1] regarding NutchGora. I'm inclined to say let's do the following: 1. NutchGora: apache-nutch-2.0 - release 2.x series based on this branch 2. Nutch: apache-nutch-1.x - stable trunk branch Then, when the time comes, we can try and create a: 3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches Would this make sense? Anyways we don't have to decide anything now that we can't undo later, but are folks OK with me doing an RC for NutchGora and for 1.x this weekend? Cheers, Chris [1] http://s.apache.org/GD2 ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Markus Jelsma - CTO - Openindex -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Fwd: Google Summer of Code 2012 upcoming
Guys, FYI...in case anyone is thinking of GSoC, deadlines are approaching. Process is described below... Thanks! Cheers, Chris Begin forwarded message: From: Ulrich Stärk u...@apache.org Date: March 4, 2012 9:01:07 AM PST To: p...@apache.org p...@apache.org Cc: d...@community.apache.org d...@community.apache.org Subject: Google Summer of Code 2012 upcoming Reply-To: priv...@hadoop.apache.org priv...@hadoop.apache.org Hello PMCs, Google Summer of Code is the ideal opportunity for you to attract new contributors to your projects. If you want to participate with your project you NOW need to - understand what it means to be a mentor [1] - propose your project ideas. Just label your issues with gsoc2012 in JIRA and they will show up at [2]. See also [1]. - subscribe to code-awa...@apache.org (restricted to potential mentors, meant to be used as a private list - general discussions on the public d...@community.apache.org list as much as possible please) The ASF will apply as a participating organization with GSoC, your project doesn't need to do that. See [3] for more information. Note that the ASF isn't accepted yet, nevertheless you *really* should start recording your ideas now. Last year we had 38 students completing GSoC successfully, some of which are now active contributors to the projects they worked on. Let's make this a success again this year! On behalf of the GSoC 2012 admins, Uli [1] http://community.apache.org/guide-to-being-a-mentor.html [2] http://s.apache.org/gsoc2012tasks [3] http://community.apache.org/gsoc.html ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Fwd: [blog post] Accumulo, Nutch, and Gora
FYI...awesome! Begin forwarded message: From: Jason Trost jason.tr...@gmail.com Date: February 28, 2012 5:41:23 PM PST To: common-u...@hadoop.apache.org common-u...@hadoop.apache.org Subject: [blog post] Accumulo, Nutch, and Gora Reply-To: common-u...@hadoop.apache.org common-u...@hadoop.apache.org Blog post for anyone who's interested. I cover a basic howto for getting Nutch to use Apache Gora to store web crawl data in Accumulo. Let me know if you have any questions. Accumulo, Nutch, and GORA http://www.covert.io/post/18414889381/accumulo-nutch-and-gora --Jason ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [DISCUSS] Nutchgora 2.0 release
+1 guys. Just let me know when you are ready and I can RM it. Cheers, Chris On Feb 20, 2012, at 8:01 AM, Lewis John Mcgibbney wrote: Hi, Not ignoring Chris' comments, but addressing the points below first, please see comments. On Mon, Feb 20, 2012 at 2:57 PM, Ferdy Galema ferdy.gal...@kalooga.com wrote: Aside from the licensing issue, the only thing I really see as a blocker or as something we need to deal with first is Nutch-1205 (upgrade Gora libs). What are we going to do with that one? I'm going to have another crack with these Ivy resolvers, really quite hard to debug. I can only assume the unresolved dependencies are picked up somewhere upstream! As I said I'm going to try and crack this one maybe today if I get the time. About the Nutch API (webapp), my colleague and I have some ideas about how to improve it, in such as way that it is really easy to use. It won't definitely be ready in a upcoming release, especially when there will be a release very soon. Please see the issue[1] for details. I'm not sure what to do with the current webapp implementation, but my suggestion is to to just leave it be as it. (Perhaps mark it as a work-in-progress) This sounds really encouraging. Somewhere in my crazy pot of thoughts was to progress with establishing this task as a GSoC project. In reflection, I think it would be excellent if the work could be dev/user community driven as it would cater exactly for what we need and want. Please see here for the most up-to-date work I could get in this stuff. I updated it slightly to reflect some recent findings. I'll report back when I get more time on the blocker you mention above. http://wiki.apache.org/nutch/NutchAdministrationUserInterface ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [DISCUSS] Nutchgora 2.0 release
Hey Lewis, I'd be +1 to roll a Nutchgora 2.0 release. I could see dealing with this in two ways, neither of which I like better than the other: 1. Release the nutchgora branch as apache-nutch-2.0, and then nutchgora becomes the 2.0 branch of the system (and we could create branch-2.0) The 1.x trunk branch, as it evolves and gets closer to 2.0, the last release of it is 1.9, then we do 3.0, which could either be: - a merge or combination of 1.x features and 2.x features - simply the next path for 1.x, and independent of 2.x 2. Call the artifact, apache-nutchgora-2.0, independent of the current trunk artifact and its release cycle. Either way, is fine with me. Cheers, Chris On Feb 17, 2012, at 7:23 AM, Lewis John Mcgibbney wrote: Hi Guys, Here we are again :0) What are the perceptions with aiming for a 2.0 release? We have one blocking issue, the webapp, which I got no response from the community at large about. I would like to see this addressed but this is another issue. Speaking with the future in mind, we are hoping to get a Gora 0.2 release out of the door, once a licensing issue is dealt with (the only blocker) and a few other things. Therefore would it be realistic to aim for a Nutch 2.0 release shortly after that? My justification for raising this thread again, is that we are seeing (some) more users interested in this branch/code, I think it is a real shame that we have not been able to get a release yet. I would really like to get more people using the code and hopefully getting involved in identifying bugs, and fixing them if possible. The question has been open for ages, so I just wonder if anything has changed now that Gora is doing better as of recent. Thanks Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Fwd: [Announce] Google Summer of Code 2012
Any Nutch Devs interested in a GSoC student? Begin forwarded message: From: Luciano Resende luckbr1...@gmail.com Date: February 4, 2012 10:40:03 AM PST To: d...@community.apache.org d...@community.apache.org, code-awards code-awa...@apache.org Subject: Fwd: [Announce] Google Summer of Code 2012 Reply-To: d...@community.apache.org d...@community.apache.org -- Forwarded message -- From: Carol Smith car...@google.com Date: Sat, Feb 4, 2012 at 8:44 AM Subject: [Announce] Google Summer of Code 2012 To: Google Summer of Code Discuss google-summer-of-code-disc...@googlegroups.com Hi all, We're pleased to announce that Google Summer of Code will be happening for its eighth year this year. Please check out the blog post [1] about the program and read the FAQs [2] and Timeline [3] on Melange for more information. Please consider translating the presentations and/or flyers into your native language and submitting them directly to me to post on the wiki. Localization for our material is integral to reaching the widest possible audience around the world. [1] - http://google-opensource.blogspot.com/2012/02/google-summer-of-code-2012-is-on.html [2] - http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2012/faqs [3] - http://www.google-melange.com/gsoc/events/google/gsoc2012 Cheers, Carol -- Luciano Resende http://people.apache.org/~lresende http://twitter.com/lresende1975 http://lresende.blogspot.com/ ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Fwd: [Announce] Google Summer of Code 2012
FYI Begin forwarded message: From: Ross Gardler rgard...@opendirective.com Date: February 5, 2012 1:45:18 PM PST To: d...@community.apache.org d...@community.apache.org Subject: RE: [Announce] Google Summer of Code 2012 Reply-To: d...@community.apache.org d...@community.apache.org For those new to GSoC you might want to review the roles defined at http://community.apache.org/mentoringprogramme.html and the GSoC specific info at http://community.apache.org/gsoc.html (yet to be updated for 2012) Sent from my mobile device, please forgive errors and brevity. On Feb 5, 2012 8:31 PM, Franklin, Matthew B. mfrank...@mitre.org wrote: ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: % of different content types out there on the web
Hi Markus, Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes compared to the size of the entire corpus? Cheers, Chris On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote: We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data on those two. However, we also explicitly filter out all/most unwanted suffixes. We do have a lot of suffixes that we encountered so far. On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote: (sorry for the cross post) Hey Guys, I'm trying to find a good citation or estimate (if anyone has done one) that estimates the breakout (by % or some other metric) of content types out there out the web (with a whole web crawl or a meaningful representative dataset) that are non HTML. Anyone have any ideas about this? Thanks! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Markus Jelsma - CTO - Openindex ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [DISCUSS] Issues with Fetcher
Hi Ken, On Jan 21, 2012, at 10:33 AM, Ken Krugler wrote: My own personal favorite area would be to integrate with crawler-commons. +1. Would you crawler-commons guys be interested in bringing that code to Apache? How about bringing it over to Nutch? Would that be something you'd be interested in? Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [jira] [Commented] (NUTCH-1237) Improve javac arguements for more verbose output
Yay, all I heard was that it's building again woo hoo! On Jan 6, 2012, at 9:03 AM, Markus Jelsma wrote: Ah, i get 88 warnings now but things build fine. This is indeed quite more verbose :) On Tuesday 27 December 2011 17:28:31 Lewis John McGibbney (Commented) (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.p lugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176220#comm ent-13176220 ] Lewis John McGibbney commented on NUTCH-1237: - If I can get a +1 I'll commit. Thank you Improve javac arguements for more verbose output - Key: NUTCH-1237 URL: https://issues.apache.org/jira/browse/NUTCH-1237 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: nutchgora, 1.5 Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch When trying to fix another problem I stumbled across this one. I think it is important to ensure that the javac outputs info regarding deprecation and unchecked operations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira -- Markus Jelsma - CTO - Openindex ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Build failed in Jenkins: Nutch-trunk #1702
Merry Christmas buddy! Cheers, Chris On Dec 25, 2011, at 9:14 AM, Lewis John Mcgibbney wrote: Hi Guys, Our trunk builds have been broken since migrating to new Hadoop 0.20.2 and migrating CrawlDBScanner to new MR API e.g. trunk build [1] 1698. Looking to the stack trace, I'm assuming that this has to do with how we are specifying the new file reads. Hopefully this shouldn't be too hard to solve so maybe we can get on to it at some stage in the near future. I just want to say Merry Christmas to EVERYONE celebrating and happy holidays to everyone else who may not be. Best Lewis [1] https://builds.apache.org/view/M-R/view/Nutch/job/Nutch-trunk/1698/ On Sat, Dec 24, 2011 at 7:36 AM, Apache Jenkins Server jenk...@builds.apache.org wrote: See https://builds.apache.org/job/Nutch-trunk/1702/changes Changes: [markus] Updated pom to reflect Hadoop upgrade -- [...truncated 2836 lines...] resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-validator [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlfilter-validator/classes jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlmeta [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 2 source files to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/classes jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/urlmeta.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-basic init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/classes jar: [jar] Building jar:
Re: get rid of outlink code for Tika
+1 from me -- those 3 Tika content handlers should take care of it... Cheers, Chris On Dec 21, 2011, at 6:51 AM, Markus Jelsma wrote: Hi, For using Boilerpipe we need LinkCH, BoilerpipeCH and TeeCH in Tika. LinkCH returns all URL's with some meta data such as title etc. Fixes for old parsers such as Neko are then obsolete. I propose to rely on Tika for all outlinks. Right now this means not all types are returned such as area, form and whatelse. Is this a big problem? Rel is also not returned but i patched Tika to do that so we can still do something with nofollow which is important. Thanks -- Markus Jelsma - CTO - Openindex ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Improving API Java Documentation
Hi Lewis, +1 from me to the update and to logging a JIRA issue. Always nice to see an associated changelog entry for any (even non trivial) updates, short of typos and error corrections in docs/etc. Up to you though, since you're the one doing the work :-) Cheers, Chris On Dec 12, 2011, at 10:28 AM, Lewis John Mcgibbney wrote: Hi Guys, Been doing some snooping around the code recently and think that the API documentation [1] could do with some improving in some areas, please see corresponding Jira issue [2]. A real minor discrepancy I've encountered early on is that the ${name} variable is set in default.properties as ${name} and in build.xml as ${Name}, this means that it is not recognized within the Javadocs [1]. I propose to change this to ${name}, and to additionally add a Capital to the variable value therefore making it Nutch? Any thoughts? Does this require a Jira to be logged as well? Thanks [1] http://nutch.apache.org/apidocs-1.4/index.html [2] https://issues.apache.org/jira/browse/NUTCH-1218 -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Best way to get files out of segment directories
Hey Lewis, Makes total sense. I'll get a patch going this week. Cheers, Chris On Nov 30, 2011, at 8:05 AM, Lewis John Mcgibbney wrote: Hi Chris, There is absolutely no doubt that this is of use, exactly for the issues Markus highlights. I wonder if it is worth adding general options similar to that which are offered by readseg [1]. This would mean that it would be possible to ignore certain directories within a segments directory, therefore reducing overhead on the SegmentContentDumper tool and possibly providing a more accurate content dump. Does this make any sense? [1] http://wiki.apache.org/nutch/bin/nutch_readseg On Tue, Nov 29, 2011 at 8:01 AM, Markus Jelsma markus.jel...@openindex.io wrote: Sounds useful indeed! Especially with the regex pattern. Reading files from the fs is a lot faster then using segread all the time. CTO - Openindex.io Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov schreef: OK, of course, I figured it out, and updated my program :-) You can see it on Github below. I'm going to clean up and generalize this program because I think it's of general use. I'll create an issue shortly. I'm thinking the tool could be something like: ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options] -segmentRootDir full file path to the root segment directory, e.g., crawl/segments -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment -outputDir The output directory to write file names to. -metadata --key=value where key is a Content Metadata key and value is a value to check. If the URL and its content metadata have a matching key,value pair, dump it. Allow for regex matching on the value. This would allow users to unravel the content hidden in segment directories and in sequence files into useable files that were downloaded by Nutch. Do you guys see this as a useful tool? If so, I'll contribute it this week for 1.5. Cheers, Chris On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote: Hey Guys, One more thing. Just to let you know I've followed this blog here: http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/ And started to write a simple program to read the keys in a Segment file, and then dump out the byte content if the key matches the desired URL. You can find my code here: https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java Unfortunately, this code keeps dying due to OOM issues, clearly because the data file is too big, and because I likely have to M/R this. Just wanted to let you guys know where I'm at, and what I've been trying. Thanks, Chris On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote: Hey Guys, So, I've completed my crawl of the vault.fbi.gov website for my class that I'm preparing for. I've got: [chipotle:local/nutch/framework] mattmann% du -hs crawl 28Gcrawl [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ total 0 drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 2027104947/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 2027104955/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 2027105006/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 2027105251/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 2027125721/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 2027144648/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 2027164220/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 2027184345/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 2027204447/ drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 2027224816/ [chipotle:local/nutch/framework] mattmann% ./bin/nutch readseg -list -dir crawl/segments/ NAMEGENERATED FETCHER STARTFETCHER END FETCHED PARSED 2027104947 12011-11-27T10:49:502011-11-27T10:49:50 1 1 2027104955 31 2011-11-27T10:49:572011-11-27T10:49:58 31 31 2027105006 4898 2011-11-27T10:50:082011-11-27T10:51:40 48984890 2027105251 9890 2011-11-27T10:52:522011-11-27T11:56:06 714 713 2027125721 9202 2011-11-27T12:57:242011-11-27T14:00:17 971 686 2027144648 8261 2011-11-27T14:46:502011-11-27T15:48:25 714 712 2027164220 7575 2011-11-27T16:42:222011-11-27T17:45:50 720 718 2027184345 6871 2011-11-27T18:43:482011-11-27T19:47:11 767 766 2027204447 6116 2011-11-27T20:44:502011-11-27T21:48:07 725 724 2027224816 5406 2011-11-27T22:48:182011-11-27T23:51:33 744
Best way to get files out of segment directories
Hey Guys, So, I've completed my crawl of the vault.fbi.gov website for my class that I'm preparing for. I've got: [chipotle:local/nutch/framework] mattmann% du -hs crawl 28Gcrawl [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ total 0 drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 2027104947/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 2027104955/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 2027105006/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 2027105251/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 2027125721/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 2027144648/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 2027164220/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 2027184345/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 2027204447/ drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 2027224816/ [chipotle:local/nutch/framework] mattmann% ./bin/nutch readseg -list -dir crawl/segments/ NAMEGENERATED FETCHER START FETCHER END FETCHED PARSED 2027104947 1 2011-11-27T10:49:50 2011-11-27T10:49:50 1 1 2027104955 31 2011-11-27T10:49:57 2011-11-27T10:49:58 31 31 2027105006 48982011-11-27T10:50:08 2011-11-27T10:51:40 48984890 2027105251 98902011-11-27T10:52:52 2011-11-27T11:56:06 714 713 2027125721 92022011-11-27T12:57:24 2011-11-27T14:00:17 971 686 2027144648 82612011-11-27T14:46:50 2011-11-27T15:48:25 714 712 2027164220 75752011-11-27T16:42:22 2011-11-27T17:45:50 720 718 2027184345 68712011-11-27T18:43:48 2011-11-27T19:47:11 767 766 2027204447 61162011-11-27T20:44:50 2011-11-27T21:48:07 725 724 2027224816 54062011-11-27T22:48:18 2011-11-27T23:51:33 744 744 [chipotle:local/nutch/framework] mattmann% So the reality is, after crawling vault.fbi.gov, all I really wanted is the extracted PDF files that are housed in those segments. I've been playing around with ./bin/nutch readseg, and all I can say based on my initial impressions here are that it's really hard to get it to fulfill these simple requirements that I want it to do: 1. Iterate over all the segments - pull out URLs that have at_download/file in them - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor is the readable PDF name, the actual URL is a Plone CMS url, with little meaning) 2. for each PDF file anchor name - create a file in output_dir with the PDF file data read from the segment My guess is that even at the scale of data that I'm dealing with (10s of GB), that it's impossible and impractical to do anything that's not M/R here. Unfortunately there isn't a tool that will simply grab me the PDF files out of the segment files and then output those into a director, appropriately named with the anchor text. Or...is there? ;-) I'm running in Local mode, with no Hadoop cluster behind me, and with a Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working, intentionally as I don't want it to be a requirement for folks to have a cluster to do this assignment that I'm working on. I was talking to Ken Krugler about this, and after picking his brain, I think that I'm going to have to end up writing a tool to do what I want. So, if that's the case, fine, but can someone point me in the right direction for a good starting point for this? Ken also thought Andrzej might have like 10 magic solutions to make this happen, so here's hoping he's out there listening :-) Thanks for the help, guys. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Best way to get files out of segment directories
Hey Guys, One more thing. Just to let you know I've followed this blog here: http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/ And started to write a simple program to read the keys in a Segment file, and then dump out the byte content if the key matches the desired URL. You can find my code here: https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java Unfortunately, this code keeps dying due to OOM issues, clearly because the data file is too big, and because I likely have to M/R this. Just wanted to let you guys know where I'm at, and what I've been trying. Thanks, Chris On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote: Hey Guys, So, I've completed my crawl of the vault.fbi.gov website for my class that I'm preparing for. I've got: [chipotle:local/nutch/framework] mattmann% du -hs crawl 28G crawl [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ total 0 drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 2027104947/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 2027104955/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 2027105006/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 2027105251/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 2027125721/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 2027144648/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 2027164220/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 2027184345/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 2027204447/ drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 2027224816/ [chipotle:local/nutch/framework] mattmann% ./bin/nutch readseg -list -dir crawl/segments/ NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20271049471 2011-11-27T10:49:50 2011-11-27T10:49:50 1 1 202710495531 2011-11-27T10:49:57 2011-11-27T10:49:58 31 31 202710500648982011-11-27T10:50:08 2011-11-27T10:51:40 48984890 202710525198902011-11-27T10:52:52 2011-11-27T11:56:06 714 713 202712572192022011-11-27T12:57:24 2011-11-27T14:00:17 971 686 202714464882612011-11-27T14:46:50 2011-11-27T15:48:25 714 712 202716422075752011-11-27T16:42:22 2011-11-27T17:45:50 720 718 202718434568712011-11-27T18:43:48 2011-11-27T19:47:11 767 766 202720444761162011-11-27T20:44:50 2011-11-27T21:48:07 725 724 202722481654062011-11-27T22:48:18 2011-11-27T23:51:33 744 744 [chipotle:local/nutch/framework] mattmann% So the reality is, after crawling vault.fbi.gov, all I really wanted is the extracted PDF files that are housed in those segments. I've been playing around with ./bin/nutch readseg, and all I can say based on my initial impressions here are that it's really hard to get it to fulfill these simple requirements that I want it to do: 1. Iterate over all the segments - pull out URLs that have at_download/file in them - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor is the readable PDF name, the actual URL is a Plone CMS url, with little meaning) 2. for each PDF file anchor name - create a file in output_dir with the PDF file data read from the segment My guess is that even at the scale of data that I'm dealing with (10s of GB), that it's impossible and impractical to do anything that's not M/R here. Unfortunately there isn't a tool that will simply grab me the PDF files out of the segment files and then output those into a director, appropriately named with the anchor text. Or...is there? ;-) I'm running in Local mode, with no Hadoop cluster behind me, and with a Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working, intentionally as I don't want it to be a requirement for folks to have a cluster to do this assignment that I'm working on. I was talking to Ken Krugler about this, and after picking his brain, I think that I'm going to have to end up writing a tool to do what I want. So, if that's the case, fine, but can someone point me in the right direction for a good starting point for this? Ken also thought Andrzej might have like 10 magic solutions to make this happen, so here's hoping he's out there listening :-) Thanks for the help, guys. Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246
Re: Best way to get files out of segment directories
OK, of course, I figured it out, and updated my program :-) You can see it on Github below. I'm going to clean up and generalize this program because I think it's of general use. I'll create an issue shortly. I'm thinking the tool could be something like: ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options] -segmentRootDir full file path to the root segment directory, e.g., crawl/segments -regexUrlPattern a regex URL pattern to select URL keys to dump from the content DB in each segment -outputDir The output directory to write file names to. -metadata --key=value where key is a Content Metadata key and value is a value to check. If the URL and its content metadata have a matching key,value pair, dump it. Allow for regex matching on the value. This would allow users to unravel the content hidden in segment directories and in sequence files into useable files that were downloaded by Nutch. Do you guys see this as a useful tool? If so, I'll contribute it this week for 1.5. Cheers, Chris On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote: Hey Guys, One more thing. Just to let you know I've followed this blog here: http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/ And started to write a simple program to read the keys in a Segment file, and then dump out the byte content if the key matches the desired URL. You can find my code here: https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java Unfortunately, this code keeps dying due to OOM issues, clearly because the data file is too big, and because I likely have to M/R this. Just wanted to let you guys know where I'm at, and what I've been trying. Thanks, Chris On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote: Hey Guys, So, I've completed my crawl of the vault.fbi.gov website for my class that I'm preparing for. I've got: [chipotle:local/nutch/framework] mattmann% du -hs crawl 28G crawl [chipotle:local/nutch/framework] mattmann% [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/ total 0 drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:49 2027104947/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:50 2027104955/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 10:52 2027105006/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 12:57 2027105251/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 14:46 2027125721/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 16:42 2027144648/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 18:43 2027164220/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 20:44 2027184345/ drwxr-xr-x 8 mattmann wheel 272 Nov 27 22:48 2027204447/ drwxr-xr-x 8 mattmann wheel 272 Nov 28 00:50 2027224816/ [chipotle:local/nutch/framework] mattmann% ./bin/nutch readseg -list -dir crawl/segments/ NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 2027104947 1 2011-11-27T10:49:50 2011-11-27T10:49:50 1 1 2027104955 31 2011-11-27T10:49:57 2011-11-27T10:49:58 31 31 2027105006 48982011-11-27T10:50:08 2011-11-27T10:51:40 48984890 2027105251 98902011-11-27T10:52:52 2011-11-27T11:56:06 714 713 2027125721 92022011-11-27T12:57:24 2011-11-27T14:00:17 971 686 2027144648 82612011-11-27T14:46:50 2011-11-27T15:48:25 714 712 2027164220 75752011-11-27T16:42:22 2011-11-27T17:45:50 720 718 2027184345 68712011-11-27T18:43:48 2011-11-27T19:47:11 767 766 2027204447 61162011-11-27T20:44:50 2011-11-27T21:48:07 725 724 2027224816 54062011-11-27T22:48:18 2011-11-27T23:51:33 744 744 [chipotle:local/nutch/framework] mattmann% So the reality is, after crawling vault.fbi.gov, all I really wanted is the extracted PDF files that are housed in those segments. I've been playing around with ./bin/nutch readseg, and all I can say based on my initial impressions here are that it's really hard to get it to fulfill these simple requirements that I want it to do: 1. Iterate over all the segments - pull out URLs that have at_download/file in them - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor is the readable PDF name, the actual URL is a Plone CMS url, with little meaning) 2. for each PDF file anchor name - create a file in output_dir with the PDF file data read from the segment My guess is that even at the scale of data that I'm dealing with (10s of GB), that it's impossible and impractical to do anything that's not M/R here. Unfortunately there isn't a tool
[RESULT] [VOTE] Apache Nutch 1.4 release rc #2
Hi Everyone, This VOTE has passed: +1 PMC Julien Nioche Markus Jelsma Lewis John McGibbney Chris Mattmann I'll go ahead and update the website and push the release out to the mirrors. Thanks for VOTE'ing and for your patience! Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[ANNOUNCE] Apache Nutch 1.4 released
(...apologies for the cross posting...) The Apache Nutch project is pleased to announce the release of Apache Nutch 1.4. The release contents have been pushed out to the main Apache release site so the releases should be available as soon as the mirrors get the syncs. Apache Nutch is an extensible framework for building out large-scale web-based search. Layered on top of fellow Apache projects Hadoop, Lucene/Solr, and Tika, Nutch provides an out of the box platform for fetching web pages, pdf files, word documents, and more. Nutch parses the content and its relevant information, indexes its metadata, and makes it available for efficient query and retrieval over modern Internet protocols. Apache Nutch 1.4 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/nutch/CHANGES-1.4.txt Apache Nutch is available in source and binary form from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/ Nutch is also available as a Jar dependency from the Central repository: http://repo2.maven.org/maven2/org/apache/nutch/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://nutch.apache.org -- Chris Mattmann (on behalf of the Apache Nutch community) ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: 2 things I noticed that I will file JIRA issues + fix
Hi Markus, Super +1. Thanks for incorporating it as part of your patch. 1184 looks good -- my +1 to commit it, even if i progress. Then we can close out 1212 at that point. Thanks! Cheers, Chris On Nov 25, 2011, at 5:16 AM, Markus Jelsma wrote: Hi On Friday 25 November 2011 01:13:47 Mattmann, Chris A (388J) wrote: Hi Markus, On Nov 24, 2011, at 12:03 PM, Markus Jelsma wrote: So, what's the point of that initial if(...) block outside of the for loop. Isn't it redundant? This is trunk? I've been and still am working on some issues for a new feature in this part of that source file. https://issues.apache.org/jira/browse/NUTCH-1184 https://issues.apache.org/jira/browse/NUTCH-1174 Yep it's trunk alright. I'm fine with you making the update I suggested, or with me doing it. 2 questions: 1. Am I right in observing that the code is redundant and should be removed? I believe so. Ive tested the removal of that part with the code of NUTCH-1184 and all goes well. 2. If I am right on #1, do you want me to make the update, or are you saying that you want to make it as part of NUTCH-1184 and NUTCH-1174? 1174 is already committed. Ive added a patch for ParseOutputformat to 1184 incorporating your newly created patch. cheers Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -- Markus Jelsma - CTO - Openindex ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
2 things I noticed that I will file JIRA issues + fix
...after I get back from Thanksgiving dinner :-) 1. In URLFilterChecker, the cmd line tool requires URLs to be fed into it on STDIN, but that isn't documented anywhere, even in the tool help printed to STDOUT. I'll fix that. 2. In ParseOutputFormat, I see a code block: {code} // collect outlinks for subsequent db update Outlink[] links = parseData.getOutlinks(); int outlinksToStore = Math.min(maxOutlinks, links.length); if (ignoreExternalLinks) { try { fromHost = new URL(fromUrl).getHost().toLowerCase(); } catch (MalformedURLException e) { fromHost = null; } } else { fromHost = null; } {code} The if(ignoreExternalLinks) part then gets subsequently set and reset in the ensuing for loop: {code} int validCount = 0; CrawlDatum adjust = null; ListEntryText, CrawlDatum targets = new ArrayListEntryText, CrawlDatum(outlinksToStore); ListOutlink outlinkList = new ArrayListOutlink(outlinksToStore); for (int i = 0; i links.length validCount outlinksToStore; i++) { String toUrl = links[i].getToUrl(); // ignore links to self (or anchors within the page) if (fromUrl.equals(toUrl)) { continue; } if (ignoreExternalLinks) { try { toHost = new URL(toUrl).getHost().toLowerCase(); } catch (MalformedURLException e) { toHost = null; } if (toHost == null || !toHost.equals(fromHost)) { // external links continue; // skip it } } {code} So, what's the point of that initial if(...) block outside of the for loop. Isn't it redundant? If so, I'll file an issue and fix that. Cheers, Chris P.S. Happy Thanksgiving to Nutch'ers in the US! ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Dependency Injection
Hey PJ, On Nov 22, 2011, at 10:47 AM, PJ Herring wrote: Hey Chris, Thanks for the response. I looked at the documents you sent me, and I really do think incorporating some kind of DI Framework could be a great addition to Nutch. I have a general plan of attack, but I'll try to write that up more formally and send it out to get some kind of feedback. +1, would love to see it. One question I had when looking at this stuff is what is the status of Nutch 2? It looks like the architecture has shifted quite a bit from 1.3? Nutch2 was originally slated to be the Nutch Gora branch (see here [1]). We ended up deciding [2] that the trunk was more akin to folks who were maintaining the 1.x series of Nutch and thus moved the Nutch Gora branch into [1]. We still have a lot of goals though for Nutch2, which I think we're just working to more incrementally, rather than radically, as before. There are still folks here working on Nutch Gora though so if you're interested in that, check it out. Cheers, Chris [1] http://svn.apache.org/repos/asf/nutch/branches/nutchgora [2] http://s.apache.org/zX ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Dependency Injection
Hey PJ, You aren't being an ass at all. You're asking an important question, and something I've been interested in for a while. Here are some relevant threads to take a look at: http://wiki.apache.org/nutch/Nutch2Architecture http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12688.html http://www.slideshare.net/chrismattmann/lessons-learned-in-the-development-of-a-webscale-search-engine-nutch2-and-beyond https://issues.apache.org/jira/browse/NUTCH-609 http://osdir.com/ml/user.nutch.apache/2011-07/msg00080.html http://5341.com/list/48/349985.html If you're interested in contributing to Apache Nutch, check this great guide out written by Dennis Kubes: wiki.apache.org/nutch/Becoming_A_Nutch_Developer Before there wasn't a ton of interest in replacing the plugin system since it worked and we didn't get a lot of complaints or anything. That interest turned into the perception that a DI framework wouldn't be welcome. On the contrary, I'd say if you figured out how to get a DI framework working with the existing plugin system, I can personally say I'd dedicate some of my time towards helping you shepherd it in and I think the rest of the committers would be on board. Thanks for your interest. If you have any more questions, please ask! Cheers, Chris On Nov 21, 2011, at 1:14 PM, PJ Herring wrote: Hey, So I am admittedly a noob with Nutch, but have spent some time digging through the source code. I am just curious if anyone has talked about, in future developments of Nutch, replacing the whole way we register plugins? I ask because I am using Nutch on a project with Maven. At the moment I have to copy a whole bunch of JAR files with there plugin.xml files into a certain directory on build, which is fine, but is kind of breaking the whole Maven paradigm. It would be nice to have some sort of Maven repository where plugins lived, and then wire up which plugins I want to use using some kind of DI framework, like Spring or Guice. Then instead of writing XML Plugin Descriptor Files, every plugin could write a class extending PluginDescriptor and register its self with the PluginRepo, or something of the sort. Also, I have never contributed to an open source project, so if I am being an ass I don't mean to be. Just would love to help make a great tool better in any way. Best, PJ Herring ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: 0.2-SNAPSHOT now on apache repository
+1 from me, Lewis, great work. Cheers, Chris On Nov 19, 2011, at 4:11 AM, Lewis John Mcgibbney wrote: Hi, Please see here [1], and associated issue logged in Nucth Jira [2]. As I explain in the issue, although Gora trunk is not stable there is ongoing work to fix this. Thanks for now [1] https://repository.apache.org/index.html#nexus-search;quick~gora [2] https://issues.apache.org/jira/browse/NUTCH-1205 -- *Lewis* ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Lewis John McGibbney sent a message via SimilarPages – A web discovery and search add-on
Awesome news, great to hear! Cheers, Chris On Nov 17, 2011, at 8:57 AM, Lewis John Mcgibbney wrote: Hi, Some more positives here. Lewis -- Forwarded message -- From: Pietro Borradori pietro.borrad...@similarpages.com Date: Thu, Nov 17, 2011 at 4:46 PM Subject: Fw: Lewis John McGibbney sent a message via SimilarPages – A web discovery and search add-on To: lewi...@apache.org lewi...@apache.org Cc: Marco Laurita marco.laur...@similarpages.com Hi Lewis, Thanks for your email... I'm sorry to reply you late... Nutch is a fundamental piece of SimilarPages architecture, because of its crawling features and for the solid base on which it is built that is Hadoop. Hadoop allows us to make all the computations on the crawled data, it is really a fantastic project! Hadoop gives us some headache sometimes when we need large clusters to perform the computation on the crawled data, especially when there are some instances whith hardware failures where Hadoop is supposed to overcome such situations without problems. Marco co-founder/CTO of SimilarPages is at you disposal for any deeper insight re Nutch/Hadoop implementation if helpful. Here is the page of our site re Nutch/Hadoop http://www.similarpages.com/web/index.php?option=com_contentview=articleid=8Itemid=20 We liked Nutch/hadoop projects in our 2 official FB pages: http://www.facebook.com/pages/SimilarPagescom/303352486359786?sk=wall http://www.facebook.com/pages/SimilarPages-A-web-discovery-and-search-addon/149182788451193 A take a tour video here... http://www.similarpages.com/web/index.php?option=com_contentview=articleid=15Itemid=4 You can follow me on twitter @MrCappuccini We've finally released the beta of the SimilarPages search engine!! Check it out at www.similarpages.com and let us know what you think!! my best Pietro Pietro Borradori Founder CEO ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: [VOTE] Apache Nutch 1.4 release rc #1
Thanks for the FYI guys. I've got this on my open source radar, along with reviewing the Airavata release (incubating), and the MRUnit release (incubating) for this week. I'll git er' done. Also, since the release updates for rc #2 were largely aesthetic (aka packaging and naming of the output folder, I might not even have to create a new code branch of entry in repository.apache.org for the Maven artifacts). Yay! Should be next day or two for rc #2 spin up. Also I pointed Lewis at the OODT release guide (which is basically my generic Apache release guide for most Java projects), and he has updated the release wiki for Nutch to be based off of this. Cheers, Chris On Nov 16, 2011, at 9:41 AM, Markus Jelsma wrote: Chris, Any idea of when you'll be able to push a new RC for 1.4? Note : I think some stuff marked as 1.5 has been committed - we might need to check the CHANGES Definately, i've committed several items. When i did my first trunk was already prepared for 1.5. Here's the list of changes since 1.4, please note that CHANGES also already contained the release note and date. This is the first rev. for 1.5: 1200344 (NUTCH-1153) This is the last rev. for 1.4: 1197319 (NUTCH-1195) If this has caused any inconvenience then i apologize. Thanks * NUTCH-1090 InvertLinks should inform when ignoring internal links (Marek Backmann via markus) * NUTCH-1174 Outlinks are not properly normalized (markus) * NUTCH-1203 ParseSegment to show number of milliseconds per parse (markus) * NUTCH-1185 Decrease solr.commit.size to 250 (markus) * NUTCH-1180 UpdateDB to backup previous CrawlDB (markus) * NUTCH-1173 DomainStats doesn't count db_not_modified (markus) * NUTCH-1155 Host/domain limit in generator is generate.max.count+1 (markus) * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex (markus) * NUTCH-1178 Incorrect CSV header CrawlDatumCsvOutputFormat (markus) * NUTCH-1142 Normalization and filtering in WebGraph (markus) * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file (markus) Thanks Julien On 9 November 2011 10:21, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Julien, Thanks. OK, so I will respin an RC for 1.4 that fixes the naming screw up. I already created the KEYS file so we're fine there. Hopefully will get it done this week while at ApacheCon NA. BTW, had a great time meeting Lewis in person today, nice to meet you dude! Cheers, Chris On Nov 8, 2011, at 3:27 AM, Julien Nioche wrote: Hi Chris Thanks for the review. Would you consider the below blockers, or would-be-nice-to-fix? If none are blockers I propose fixing them in 1.5 and pushing 1.4. Thoughts? see below I agree on the naming, sorry for the screw-up. no probs. Do you think this could be fixed for 1.4? The KEYS file isn't really needed, since we just maintain a global keys file at http://www.apache.org/dist/nutch/KEYS. 1.4? would need to modify build.xml Odd on the bin version containing the pom.xml file -- wonder why it's not part of the src -- I just did an SVN export? strange indeed. About the runtime/local thing, I think we can do that for 1.5, but I am totally +1 for it. OK for 1.5 Thanks a lot Julien Let me know what you think. Thanks! Cheers, Chris On Nov 7, 2011, at 7:59 AM, Julien Nioche wrote: Thanks Chris, * it would be good to have the same folder name for the src and bin versions. They are currently 'nutch-1.4' and 'apache-nutch-1.4' * do we really need to include the KEYS file? * bin version contains pom.xml, src version does not. Either include in both or remove altogether * What about having the content of 'runtime/local' as a ready-to-use 'bin' distrib instead? Doesnt make sense to have runtime/deploy as the content of the job file (e.g. nutch-site.xml) would have to be generated from the source anyway. Julien On 5 November 2011 01:03, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, A candidate for the Nutch 1.4 release is available at: http://people.apache.org/~mattmann/apache-nutch-1.4/rc1/ The release candidate is a zip and tar.gz archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-1.4/ And a binary build suitable for deployment. A staged Maven repository is available here: https://repository.apache.org/content/repositories/orgapachenutch-161/ Please vote on releasing this package as Apache Nutch 1.4. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 1.4 [ ] -1 Do not release this package because... Thanks! Cheers, Chris P.S. Here's my +1. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA
Re: Community Comments
+1 to the GUI comment, even though I haven't made one yet, it's definitely on my list of items should I find the cycles to do more besides releasing. Thanks! Cheers, Chris On Nov 15, 2011, at 1:01 PM, Markus Jelsma wrote: Hi Guys, During ApacheCon I made a point of trying to gauge how people that used Nutch found it. From the outset I would like to say that my reasoning behind this exercise was not to pick holes in the work that we put in to the project as a community, the great ideas, improvements and subsequently Apache product which we develop and maintain is a fantastic piece of software. I thought it could benefit us if we could get, at least a few comments regarding users experience. Here's one for starters --- Hi Lewis, Thank you for contacting us regarding Apache Nutch. Yes, we have been using Nutch for web crawling, and thank you for making it possible! We will gladly share our opinions and comments with you. Here is several items that we like and some that we would like to see addressed in future Nutch development. What we like about Nutch: 1. Open source, Apache license 2. Integrates with Solr 3. Modular architecture, we are a development shop and value the extendability the most 4. Plans for 2.0 to remove search and index from Nutch and only focus on crawling Clearly good points indeed. What we do not like about Nutch: 1. Lack of incremental index update, needs twice the storage to build a new index (will go away in 2.0) I'm not sure what he/she means. The index is in Solr. Perhaps he/she works with old Nutch? 2. Integration with Hadoop FS, it takes disproportional/large amount of space to do segment merging or indexing Seems like old Nutch indeed with embedded Lucene. Segments merging is not something that is required anymore but may be useful from a maintenenace point of view, not for daily operations. 3. Unstable, out of memory exceptions on large crawls during segment merging or indexing, worker threads hang occasionally OOM's are indeed a possibility, we also sometimes suffer from this. However, if one calculates worst case scenario you will most likely never run OOM during fetch, parse or indexing. We rely on good distribution of pages and our average heap consumption is just right, except once in a while ;) The problem is that handling and recovering from OOM is extremely difficult if not impossible. 4. Lack of GUI/web management/reporting Well, i never have and still don't see any useful case for some GUI. It's a complex package of many jobs. What would one want to manage through a GUI? We hope our comments will help you to continue making Nutch an even better Web crawler. Interesting, i'd like to hear more if there is any. Thanks --- Any comments guys? I've already explained to the guy that his first point 4. has been fully addressed in 1.3 onwards. I am curious to get you guys opinions on the rest fo the stuff (over and above the obvious GUI/web management/reporting) stuff. Thank you. ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
Re: Update to release information tutorial
WOOT! Lewis and I talked about updating this at ApacheCon NA and I sent him the OODT release guide and he's done a masterful job updating ours. Thanks Lewis you rock man. Cheers, Chris On Nov 15, 2011, at 1:56 PM, Lewis John Mcgibbney wrote: Hi guys, Please see here [1] for my attempt at updating the release stuff. There WILL be mistakes so please correct where you find them. Thanks [1] http://wiki.apache.org/nutch/Release_HOWTO -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++