[jira] [Commented] (NUTCH-1927) Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing

2015-04-17 Thread Mattmann, Chris A (388J) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500604#comment-14500604
 ] 

Mattmann, Chris A (388J) commented on NUTCH-1927:
-

+1 please commit! Thanks seb 

Sent from my iPhone



 Create a whitelist of IPs/hostnames to allow skipping of RobotRules parsing
 ---

 Key: NUTCH-1927
 URL: https://issues.apache.org/jira/browse/NUTCH-1927
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: available, patch
 Fix For: 1.10

 Attachments: NUTCH-1927.2015-04-16.patch, 
 NUTCH-1927.2015-04-17.patch, NUTCH-1927.Mattmann.041115.patch.txt, 
 NUTCH-1927.Mattmann.041215.patch.txt, NUTCH-1927.Mattmann.041415.patch.txt, 
 test_NUTCH-1927.2015-04-17.txt


 Based on discussion on the dev list, to use Nutch for some security research 
 valid use cases (DDoS; DNS and other testing), I am going to create a patch 
 that allows a whitelist:
 {code:xml}
 property
   namerobot.rules.whitelist/name
   value132.54.99.22,hostname.apache.org,foo.jpl.nasa.gov/value
   descriptionComma separated list of hostnames or IP addresses to ignore 
 robot rules parsing for.
   /description
 /property
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1832) Make Nutch work without an indexer

2014-09-04 Thread Mattmann, Chris A (388J) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121477#comment-14121477
 ] 

Mattmann, Chris A (388J) commented on NUTCH-1832:
-

Will reply in more detail soon but will look into enabling plugin back then

Sent from my iPhone



 Make Nutch work without an indexer
 --

 Key: NUTCH-1832
 URL: https://issues.apache.org/jira/browse/NUTCH-1832
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.9
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.10

 Attachments: NUTCH-1832.Mattmann.090314.patch.2.txt, 
 NUTCH-1832.Mattmann.090314.patch.txt


 Nutch used to work out of the box, without requiring an indexing backend. As 
 of 1.9, that's not the case anymore (it's possible even before that). Thanks 
 to [~markus17] for pointing out that this is due to the indexing-solr plugin 
 being enabled by default. We should disable it by default, so that the 
 regression is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Important : Bunch of Spam Created under Nutch Wiki!!

2013-04-01 Thread Mattmann, Chris A (388J)
Hi Kiran,

I would give comm...@nutch.apache.org. Please add ChrisMattmann
as a username.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: kiran chitturi chitturikira...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Monday, April 1, 2013 6:52 AM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!!

Hi guys,


Do you know what is the destination for commit mails ? Can I give
'dev@nutch.apache.org' ?


I am planning on giving the below information so far for creating a moin
wiki [1] 


Wiki Name: Nutch
Usernames: LewisJohnMcgibbney, kiranchitturi, SebastianNagel, JulienNioche
Destination for Commit mails: dev@nutch.apache.org


Please let me know if any of the information is incorrect or needed any
modifications.


[1] - 
http://wiki.apache.org/general/OurWikiFarm#per_wiki_access_control_-_tight
en_your_wiki_just_a_little.2C_benefit_just_a_lot




On Sat, Mar 30, 2013 at 4:29 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:

Hey Kiran,

I think here:

http://wiki.apache.org/general/OurWikiFarm#per_wiki_access_control_-_tight
e
n_your_wiki_just_a_little.2C_benefit_just_a_lot


Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: kiran chitturi chitturikira...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org

Date: Saturday, March 30, 2013 12:55 PM
To: dev@nutch.apache.org dev@nutch.apache.org

Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!!


Does anyone know what details we need to provide for the new wiki
controls ?



I have posted a JIRA [0] to control our spam but the infrabot is asking
more information [1]

[0] - 
https://issues.apache.org/jira/browse/INFRA-6081
https://issues.apache.org/jira/browse/INFRA-6081
[1] -  http://www.apache.org/dev/infra-contact#what-we-need-to-know



On Thu, Mar 28, 2013 at 3:18 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:

Hi Kiran,

Yes, my recommendation:

1. Jump into #asfinfra on freeonode, find Joe, or Gavin or Daniel,
ask for help. If you don't have IRC, email

infrastruct...@apache.org mailto:infrastruct...@apache.org
and/or file a
https://issues.apache.org/jira/browse/INFRA
https://issues.apache.org/jira/browse/INFRA ticket

2. Request that they enable ASAP ContributorsGroup only acls

I know that many Apache wikis (MoinMon) are being attackedŠ

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




-Original Message-
From: kiran chitturi chitturikira...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, March 28, 2013 12:15 PM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Fwd: Important : Bunch of Spam Created under Nutch Wiki!!

Thanks to Ken (check message below) for reporting our insecure wiki. I
have checked it and anyone can create an fake account and edit any of
our
wiki pages or create new ones.


When I first registered to the wiki, all the pages are immutable and
Lewis had to add me to Contributors group to make changes to the wiki.


Probably, the setting was hacked for now and that is the reason we are
facing lot of spam.


Can we contact the infra@apache and request them to lock down the wiki
as
the other groups did ?




-- Forwarded message --
From: Ken Krugler kkrugler_li...@transpac.com
Date: Thu, Mar 28, 2013 at 1:35 PM
Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!!
To: dev@nutch.apache.org

Re: Important : Bunch of Spam Created under Nutch Wiki!!

2013-04-01 Thread Mattmann, Chris A (388J)
Thanks Kiran!

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: kiran chitturi chitturikira...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Monday, April 1, 2013 12:30 PM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!!

I have posted the information on the JIRA issue page [0]. Let's hope the
issue will be taken care of soon.




[0] - https://issues.apache.org/jira/browse/INFRA-6081



On Mon, Apr 1, 2013 at 3:27 PM, Lewis John Mcgibbney
lewis.mcgibb...@gmail.com wrote:

Hi Kiran,


On Mon, Apr 1, 2013 at 6:53 AM, dev-digest-h...@nutch.apache.org wrote:


Re: Important : Bunch of Spam Created under Nutch Wiki!!

22926 by: kiran chitturi





Hi guys,


Do you know what is the destination for commit mails ? Can I give
'dev@nutch.apache.org' ?






No, we should put commit emails to the styatic archive here
http://mail-archives.apache.org/mod_mbox/nutch-commits/
 



Thanks for sorting this out Kiran, we are truly getting hounded with spam
just now.

Best
Lewis











-- 
Kiran Chitturi


 http://www.linkedin.com/in/kiranchitturi











Re: Nutch Wiki

2013-03-30 Thread Mattmann, Chris A (388J)
Seconded!

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Saturday, March 30, 2013 3:07 PM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Nutch Wiki

@Kiran  Others who have been updating the wiki,

Great work on the command line options and elsewhere where you guys have
been cleaning up and writing better documentation for Nutch.

This is a crucial part of the workload and is greatly appreciated.

Have a great weekend.
Lewis

-- 
Lewis








Re: Important : Bunch of Spam Created under Nutch Wiki!!

2013-03-28 Thread Mattmann, Chris A (388J)
Hi Kiran, 

Yes, my recommendation:

1. Jump into #asfinfra on freeonode, find Joe, or Gavin or Daniel,
ask for help. If you don't have IRC, email infrastruct...@apache.org
and/or file a https://issues.apache.org/jira/browse/INFRA ticket

2. Request that they enable ASAP ContributorsGroup only acls

I know that many Apache wikis (MoinMon) are being attackedŠ

Cheers,
Chris


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




-Original Message-
From: kiran chitturi chitturikira...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Thursday, March 28, 2013 12:15 PM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Fwd: Important : Bunch of Spam Created under Nutch Wiki!!

Thanks to Ken (check message below) for reporting our insecure wiki. I
have checked it and anyone can create an fake account and edit any of our
wiki pages or create new ones.


When I first registered to the wiki, all the pages are immutable and
Lewis had to add me to Contributors group to make changes to the wiki.


Probably, the setting was hacked for now and that is the reason we are
facing lot of spam.


Can we contact the infra@apache and request them to lock down the wiki as
the other groups did ?




-- Forwarded message --
From: Ken Krugler kkrugler_li...@transpac.com
Date: Thu, Mar 28, 2013 at 1:35 PM
Subject: Re: Important : Bunch of Spam Created under Nutch Wiki!!
To: dev@nutch.apache.org


Hi Kiran,

On Mar 28, 2013, at 2:03am, kiran chitturi wrote:


Thank you Ken for the information. I think the access is already
restricted to Contributors Only. Someone can please confirm, if it is
not. 





It's not, as far as I know. I just created a fake account, logged in with
it, and edited the front page.


If anyone needs to edit wiki, they would need to ask someone to get
access to wiki pages.


Do you know if Solr still got hit by spam after locking down the wiki ?






I think that change helped cut down most of the spam, but I don't monitor
the Solr list that closely, sorry.


-- Ken






On Thu, Mar 28, 2013 at 1:40 AM, Ken Krugler
kkrugler_li...@transpac.com wrote:



On Mar 27, 2013, at 6:54pm, kiran chitturi wrote:


Thank you Binoy for reporting.


We have been monitoring the pages and deleting them when we get time but
there are more coming up. Today, I have seen a spam editing on the home
page of Nutch wiki. It has inserted spam links under tutorials.


We need to find a permanent solution to this. I wonder if any other
list-servs are facing the same issue.






Yes - Solr recently had to lock down editing on their wiki:



The wiki at http://wiki.apache.org/solr/ has come under attack by
spammers more frequently of late, so the PMC has decided to lock it down
 in an attempt to reduce the work involved in tracking and removing spam.

From now on, only people who appear on
http://wiki.apache.org/solr/ContributorsGroup will be able to
create/modify/delete wiki pages.

Please request either on the solr-u...@lucene.apache.org or on
d...@lucene.apache.org to have your wiki username added to the
ContributorsGroup
 page - this is a one-time step.




So I think you need to make a request to Infra to lock down the wiki,
then add people (generally in response to explicit requests) to the
ContributorsGroup page.


-- Ken






On Thu, Mar 28, 2013 at 12:49 AM, Binoy d
binoy...@gmail.com wrote:

I am quite suprised looking at the notification I am getting for new
pages for Nutch Wiki
Example :
http://wiki.apache.org/nutch/KarlPuent

I see at least 25-35 emails regarding such notification.

All of the links I got are  rooted under
http://wiki.apache.org/nutch/ http://wiki.apache.org/nutch/


Is some one looking into this , If needed I can gladly forward emails to
the person cleaning it up as I am not sure if every one has access to
delete the pages.

Regards,
b

-- Forwarded message --
From: Apache Wiki wikidi...@apache.org
Date: Wed, Mar 27, 2013 at 9:32 PM
Subject: [Nutch Wiki] Trivial Update of EdwinaBro by EdwinaBro
To: Apache Wiki wikidi...@apache.org


Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for
change notification.

The EdwinaBro page has been changed by EdwinaBro:
http://wiki.apache.org/nutch/EdwinaBro

New page:
I am 24 years old and my name is Edwina Brownlee. I life in Corjolens
(Switzerland).BR
BR
BR
Take a look at my web-site ... [[http://modform.org/SolomonKr|Continue
http://modform.org/SolomonKr%7CContinue]]









-- 
Kiran Chitturi


 

Re: [Nutch Wiki] Trivial Update of PGOSimone by PGOSimone

2013-03-25 Thread Mattmann, Chris A (388J)
Hey Julien,

I heard on #asfinfra that any of our MoinMoin wikis have been attacked recently 
by SPAM.

I think we may want to contact infra and ask for specific ContributorsGroup 
only Nutch wiki access.

http://wiki.apache.org/general/OurWikiFarm

Cheers,
Chris


From: Julien Nioche 
lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Monday, March 25, 2013 1:55 AM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: [Nutch Wiki] Trivial Update of PGOSimone by PGOSimone

I thought we had to have a login / password to modify the Wiki. If so how come 
we got so much spam lately?

Julien

On 25 March 2013 04:26, Apache Wiki 
wikidi...@apache.orgmailto:wikidi...@apache.org wrote:
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The PGOSimone page has been changed by PGOSimone:
http://wiki.apache.org/nutch/PGOSimone
[..snip..]


--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

2013-03-24 Thread Mattmann, Chris A (388J)
Hi Evert,

Thanks. Velocity would be fine, but the big issue is that I don't know
Velocity, and I know Wicket.

The great part about Wicket is that it's pure XHTML + Java code. No
config, no anything in-between.
So if you understand the component model of widgets behind the scenes, and
understand HTML, JS and CSS,
you can easily maintain a Wicket web app.

And as a Nutch PMC member there's one person here at least (me) who's
willing to maintain and steward
such a web app. So we're in business!

Cheers,
Chris


From:  Evert Wagenaar evert.wagen...@yahoo.com
Reply-To:  dev@nutch.apache.org dev@nutch.apache.org, Evert Wagenaar
evert.wagen...@mint.nl
Date:  Sunday, March 24, 2013 12:46 AM
To:  dev@nutch.apache.org dev@nutch.apache.org
Subject:  Re: GSOC 2013 project: Apache-Wicket based Nutch webapp


 I agree as well. The jsp version has become a
mess and is currently almost not s supportable anymore. Would velocity be
a good alternative? It is very good with Solr Facets and also fits into
any CMS.

 


Evert Wagenaar
evert.wagen...@me.com
+31 653 606 293






From: kiran chitturi chitturikira...@gmail.com
To: dev@nutch.apache.org
Sent: Saturday, March 23, 2013 9:36 PM
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp


Thank you Chris for your interest.

I would love to share my thesis and the work but I am still in
experimenting stage and I will share with you soon once I have a decent UI
running with functionalities.

Regards,
Kiran.


On Sat, Mar 23, 2013 at 2:33 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:

That is so awesome Kiran.

Great job and I would love a link to your thesis (or even seeing the work
in progress) 
if you are willing to share and have the time.

Good plane reading material for me and congrats again. Looking forward to
working
with you.

Cheers,
Chris


From: kiran chitturi chitturikira...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org

Date: Saturday, March 23, 2013 9:54 AM

To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp




Thanks Chris!

I am planning to graduate with Masters degree in Computer Science from
Virginia Tech University and my advisor is Dr.Fox.

My thesis work mostly relates to building search engine for the 10TB
crises event data that we have collected over last three years. The data
is collected using Internet Archive crawler (Archive-it) and I am indexing
data using LucidWorks Big Data Software.
 The process also involves finding more metadata and clustering. All of
this work is related to 'Crisis, Tragedy and Recovery Network Project
(CTRnet)' (www.ctrnet.net http://www.ctrnet.net/)

My thesis, library work and Nutch are all closely related. It has been a
great learning experience so far :)




On Sat, Mar 23, 2013 at 12:23 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:

Hi Kiran,

Awesome that works fine for me! Happy to have you contribute, and whether
you are a formal mentor or not,
if we get a GSoC 2013 student for this you can help me, Lewis, (and
others) shepherd it in!

Thanks man and congrats on graduating soon! Where are you graduating from
and in what subject?

Cheers,
Chris

From: kiran chitturi chitturikira...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org

Date: Saturday, March 23, 2013 8:51 AM

To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp




I am very much interested in the Apache Wicket project but I wouldn't be
able to be a student since i am finishing my graduation and looking for
full-time jobs. I have discussed with Lewis previously about this, and it
wouldn't be ideal for me
 to be a GSoc 2013 student as I can't devote my full-time work on this.

However, I will be very happy to work on this in my free time. This is
something I am interested in for long time and I would try to contribute
in anyway possible.

Thank you,
Kiran.







On Sat, Mar 23, 2013 at 11:23 AM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:

Hi Kiran,

Great, yes the REST services need work for sure. They haven't been worked
on in a while.

I'm privy to Apache CXF, but I haven't done anything with it, and Andrzej
did an awesome job
using Restlet, so we've got Reslet for now.

If you are interested in documenting the services, then awesome! Do you
want to be a GSoC 2013 student,
and are you interested in this project?

Cheers,
Chris


From: kiran chitturi chitturikira...@gmail.com
Reply-To: dev@nutch.apache.org dev@nutch.apache.org
Date: Friday, March 22, 2013 9:19 PM
To: dev@nutch.apache.org dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp


Hi Chris,

I was just thinking about that this evening. First, to start with this I
want to do well documentation of the Nutch REST API.

What is the status of Rest API ? Does it need any

Re: Google Summer of Code 2013 - Giraph implementation of Nutch LinkRank Algorithm

2013-03-24 Thread Mattmann, Chris A (388J)
Super +1 -- sounds awesome Lewis.

Cheers,
Chris


On 3/24/13 12:38 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

Hi All,

After some discussion and drumming up of interest within the Giraph
community, I've logged a Google Summer of Code issue [0] for this topic.
We are looking for interested students to come forward and participate in
the effort.
I logged this over in Giraph as there was no GSoC eefort already going on
there, we already have an issue for the Wicket-based User Interface
implementation in Nutch.
I would be very happy if people (users and developers) could chime in on
the thread so we can get the project started with the right direction and
intention in mind.
I propose this for Nutch TRUNK.

Thanks for now

Best

Lewis

[0] https://issues.apache.org/jira/browse/GIRAPH-584

-- 
*Lewis*



Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

2013-03-24 Thread Mattmann, Chris A (388J)
Cool thanks!

From: kiran chitturi 
chitturikira...@gmail.commailto:chitturikira...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Saturday, March 23, 2013 1:36 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

Thank you Chris for your interest.

I would love to share my thesis and the work but I am still in experimenting 
stage and I will share with you soon once I have a decent UI running with 
functionalities.

Regards,
Kiran.


On Sat, Mar 23, 2013 at 2:33 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
That is so awesome Kiran.

Great job and I would love a link to your thesis (or even seeing the work in 
progress)
if you are willing to share and have the time.

Good plane reading material for me and congrats again. Looking forward to 
working
with you.

Cheers,
Chris


From: kiran chitturi 
chitturikira...@gmail.commailto:chitturikira...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Saturday, March 23, 2013 9:54 AM

To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

Thanks Chris!

I am planning to graduate with Masters degree in Computer Science from Virginia 
Tech University and my advisor is Dr.Fox.

My thesis work mostly relates to building search engine for the 10TB crises 
event data that we have collected over last three years. The data is collected 
using Internet Archive crawler (Archive-it) and I am indexing data using 
LucidWorks Big Data Software. The process also involves finding more metadata 
and clustering. All of this work is related to 'Crisis, Tragedy and Recovery 
Network Project (CTRnet)' (www.ctrnet.nethttp://www.ctrnet.net)

My thesis, library work and Nutch are all closely related. It has been a great 
learning experience so far :)




On Sat, Mar 23, 2013 at 12:23 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
Hi Kiran,

Awesome that works fine for me! Happy to have you contribute, and whether you 
are a formal mentor or not,
if we get a GSoC 2013 student for this you can help me, Lewis, (and others) 
shepherd it in!

Thanks man and congrats on graduating soon! Where are you graduating from and 
in what subject?

Cheers,
Chris

From: kiran chitturi 
chitturikira...@gmail.commailto:chitturikira...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Saturday, March 23, 2013 8:51 AM

To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

I am very much interested in the Apache Wicket project but I wouldn't be able 
to be a student since i am finishing my graduation and looking for full-time 
jobs. I have discussed with Lewis previously about this, and it wouldn't be 
ideal for me to be a GSoc 2013 student as I can't devote my full-time work on 
this.

However, I will be very happy to work on this in my free time. This is 
something I am interested in for long time and I would try to contribute in 
anyway possible.

Thank you,
Kiran.






On Sat, Mar 23, 2013 at 11:23 AM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
Hi Kiran,

Great, yes the REST services need work for sure. They haven't been worked on in 
a while.

I'm privy to Apache CXF, but I haven't done anything with it, and Andrzej did 
an awesome job
using Restlet, so we've got Reslet for now.

If you are interested in documenting the services, then awesome! Do you want to 
be a GSoC 2013 student,
and are you interested in this project?

Cheers,
Chris


From: kiran chitturi 
chitturikira...@gmail.commailto:chitturikira...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Friday, March 22, 2013 9:19 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

Hi Chris,

I was just thinking about that this evening. First, to start with this I want 
to do well documentation of the Nutch REST API.

What is the status of Rest API ? Does it need any fixes and working examples ?

Hopefully my start would be helpful and it be soon.

Thanks for opening up the issue.

Regards,
kIran.






On Fri, Mar 22, 2013 at 11:43 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
Hey Guys,

I posted:

https://issues.apache.org/jira/browse/NUTCH-841


As a potential GSOC 2013 summer project

Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

2013-03-23 Thread Mattmann, Chris A (388J)
Hi Kiran,

Awesome that works fine for me! Happy to have you contribute, and whether you 
are a formal mentor or not,
if we get a GSoC 2013 student for this you can help me, Lewis, (and others) 
shepherd it in!

Thanks man and congrats on graduating soon! Where are you graduating from and 
in what subject?

Cheers,
Chris

From: kiran chitturi 
chitturikira...@gmail.commailto:chitturikira...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Saturday, March 23, 2013 8:51 AM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

I am very much interested in the Apache Wicket project but I wouldn't be able 
to be a student since i am finishing my graduation and looking for full-time 
jobs. I have discussed with Lewis previously about this, and it wouldn't be 
ideal for me to be a GSoc 2013 student as I can't devote my full-time work on 
this.

However, I will be very happy to work on this in my free time. This is 
something I am interested in for long time and I would try to contribute in 
anyway possible.

Thank you,
Kiran.






On Sat, Mar 23, 2013 at 11:23 AM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
Hi Kiran,

Great, yes the REST services need work for sure. They haven't been worked on in 
a while.

I'm privy to Apache CXF, but I haven't done anything with it, and Andrzej did 
an awesome job
using Restlet, so we've got Reslet for now.

If you are interested in documenting the services, then awesome! Do you want to 
be a GSoC 2013 student,
and are you interested in this project?

Cheers,
Chris


From: kiran chitturi 
chitturikira...@gmail.commailto:chitturikira...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Friday, March 22, 2013 9:19 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: GSOC 2013 project: Apache-Wicket based Nutch webapp

Hi Chris,

I was just thinking about that this evening. First, to start with this I want 
to do well documentation of the Nutch REST API.

What is the status of Rest API ? Does it need any fixes and working examples ?

Hopefully my start would be helpful and it be soon.

Thanks for opening up the issue.

Regards,
kIran.






On Fri, Mar 22, 2013 at 11:43 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
Hey Guys,

I posted:

https://issues.apache.org/jira/browse/NUTCH-841


As a potential GSOC 2013 summer project. I'm willing to mentor it, since I
love
Wicket, and I'm willing to maintain the result as a Nutch committer.

If NUTCH-841 doesn't get selected, I'll start implementing it this summer
if no
one beats me to it.

Cheers,
Chris




--
Kiran Chitturi

[http://www.linkedin.com/img/webpromo/btn_viewmy_160x33.png]http://www.linkedin.com/in/kiranchitturi





--
Kiran Chitturi

[http://www.linkedin.com/img/webpromo/btn_viewmy_160x33.png]http://www.linkedin.com/in/kiranchitturi




GSOC 2013 project: Apache-Wicket based Nutch webapp

2013-03-22 Thread Mattmann, Chris A (388J)
Hey Guys,

I posted:

https://issues.apache.org/jira/browse/NUTCH-841


As a potential GSOC 2013 summer project. I'm willing to mentor it, since I
love
Wicket, and I'm willing to maintain the result as a Nutch committer.

If NUTCH-841 doesn't get selected, I'll start implementing it this summer
if no
one beats me to it.

Cheers,
Chris



FW: GSoC 2013

2013-03-18 Thread Mattmann, Chris A (388J)
[Apologies for cross post]

Guys, to play in the GSoC 2013 spec, we just need to tag issues in JIRA
with the gsoc2013 tag.

I'll try and come up with  few projects soon :)

Cheers,
Chris


On 3/15/13 11:15 AM, Luciano Resende luckbr1...@gmail.com wrote:

On Fri, Mar 15, 2013 at 11:01 AM, Manish Agrawal text2man...@gmail.com
wrote:
 Hi

 I am Manish Agrawal, a 3rd year student of Mathematics and computing
 department from IIT Delhi.

 I want to participate in GSoC 2013 through one of the ASF projects. I
would
 be really thankful if you could please suggest me how should I proceed
for
 the same.

 Hoping for a reply.

 Thanks
 Manish Agrawal

Google is sponsoring GSoC 2013, and Apache Software Foundation is
planing to participate again.
More information about Apache Participation in GSoC is available at :
http://community.apache.org/gsoc.html.

The proper way to find a project idea would be to identify an Apache
Project in the area of your interest and start discussions with them
via the project mailing list.

The projects are starting to create their project ideas, and you can
start browsing them at
https://issues.apache.org/jira/secure/IssueNavigator!executeAdvanced.jspa?
jqlQuery=labels+=+gsoc2013runQuery=trueclear=true


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/



Re: [ANNOUNCEMENT] Welcome Kiran Chitturi as Apache Nutch PMC and Committer

2013-03-10 Thread Mattmann, Chris A (388J)
This is great to hear Kiran, welcome to the team!

Cheers,
Chris


From: Julien Nioche 
lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Sunday, March 10, 2013 2:15 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: [ANNOUNCEMENT] Welcome Kiran Chitturi as Apache Nutch PMC and 
Committer

Great to hear about your use of Nutch at your library and welcome on board 
Kiran!

Julien

On 10 March 2013 01:27, kiran chitturi 
chitturikira...@gmail.commailto:chitturikira...@gmail.com wrote:
Thanks a lot guys for inviting me and for the wishes.

I am a graduate student in Virginia Tech University doing my Masters in 
Computer Science. I have been using Apache Nutch for the last one year as part 
of my assistantship with our University Library.

The Digital Libraries and Archives division of our libraries was using Google 
Mini Search Engine for their website that hosts 600k files but Google Mini was 
no longer supported and we want to try building Search Engine using Open Source 
technologies.

That is when i started my journey with Nutch and we were able to successfully 
achieve our Goals using Nutch and Solr. The library was pleased with the 
project and they are more interested now to work with Open Source software 
whenever possible.

I liked working with Nutch community and it has been a great learning 
experience for me. I would like to learn and contribute back even after my 
graduation.

Few things that I have in my mind right now other than committing patches are 
to improve our documentation (Wiki), helping users to my best and also to start 
the Apache Wicket UI work soon for 2.x in Nutch.

Regards,
Kiran.




On Sat, Mar 9, 2013 at 4:06 PM, Tejas Patil 
tejas.patil...@gmail.commailto:tejas.patil...@gmail.com wrote:
Welcome aboard Kiran :)


On Sat, Mar 9, 2013 at 12:56 PM, lewis john mcgibbney 
lewi...@apache.orgmailto:lewi...@apache.org wrote:
Hi All,

Over the last while we have been aware of Kiran's ongoing contribution to the 
Nutch community.
It is with great pleasure that we invite Kiran to join the Nutch PMC and also 
take up Committer role.
@Kiran, please feel free to say a bit about yourself and introduce what brought 
you to Apache Nutch.
Have a great weekend.
Best
Lewis




--
Kiran Chitturi



--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


FW: [OPENING] Google Summer of Code Applications

2013-03-10 Thread Mattmann, Chris A (388J)
FYI

On 3/10/13 5:10 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

I just told a huge lie.
I got my dates mixed up...
Students have from between April 22nd and May 3rd to get proposals in.
Sorry about the mix up.

Lewis

On Sun, Mar 10, 2013 at 5:09 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi All,

 We have from the 18th until the 29th to submit this years GSoC
 proposals[0].

 Just a gentle reminder for any potential guys wanting to formally
apply...

 The idea would be to sort out any discrepancies just now and to develop
 your proposal to a comprehensive standard.

 I am interested in mentoring another project this year, so can work with
 folks who wish to progress with proposals.

 Thanks

 Lewis

 [0] http://www.google-melange.com/gsoc/events/google/gsoc2013

 --
 *Lewis*




-- 
*Lewis*



Re: Review board giving issue

2013-03-07 Thread Mattmann, Chris A (388J)
Hi Tejas,

Yeah I was having some issue at the time, but will try and see if it is working 
tomorrow. If it's still not working we can contact infra@

Cheers,
Chris


From: Tejas Patil tejas.patil...@gmail.commailto:tejas.patil...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Tuesday, March 5, 2013 9:07 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Review board giving issue

Hi all,

I am trying to use review board to upload a patch for a Jira and it is giving 
me same issue as I had before [0]. Below are the steps that I follow:
1. Generate a patch file using svn diff command.
2. On review board page, I select repository as Nutch
3. Repository as https://svn.apache.org/repos/asf/nutch/trunk; (the patch is 
for 1.x)
4. Attach the diff file.

I used to follow the same steps at work and it worked out well.
But over here I get this error message:
The file 
'https://svn.apache.org/repos/asf/nutch/trunk/src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java'
 (r1453161) could not be found in the repository

There was a review request in nutch group in last month [1] after the thread 
[0]. So I have a feeling that there is something weird with my account or I am 
doing something wrong. Can anyone help me here ?

[0] : 
http://mail-archives.apache.org/mod_mbox/nutch-dev/201301.mbox/%3cfa2d97dfc830824e9040174e7f89744925085...@ap-embx-sp40.res.ad.jpl%3E

[1] : https://reviews.apache.org/r/9119/


Re: [DISCUSS] Google Summer of Code

2013-03-04 Thread Mattmann, Chris A (388J)
Hey Lewis,

Great job starting this thread. +1 Giraph is welcome here. Multi-project GSoCs 
always do well.

One thing I had in mind was taking an implementation of Hubs and Authorities 
developed for
Nutch 1.3 a few years back in my USC class and then having someone integrate it 
into the
current Nutch 1.x branch to start.

If folks are interested I can create a JIRA.

Cheers,
Chris


From: Lewis John Mcgibbney 
lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Monday, March 4, 2013 12:23 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: [DISCUSS] Google Summer of Code

Hi All,

I thought I would ask the question as to who (if anyone) is intending on 
engaging as a mentor (or student if you are one) within this years GSoC project.
There are plenty of projects we could do within Nutch.
Obvious ones that come to mind are
- Wicket webapp for Nutch 2.x
- Integration of Giraph with Nutch
We already have one proposal which I would consider mentoring over on Apache 
Gora, but I will certainly not back down from any proposals in Nutch.
Would the Giraph project be welcomed here? If so I can head over to user@ 
Giraph in an attempt to attract interest.
Of course this is a discussion based on what folks want to do and the list 
above should be added to.
Thanks for now
Lewis

--
Lewis


Re: [DISCUSS] Google Summer of Code

2013-03-04 Thread Mattmann, Chris A (388J)
Hey Markus,

Yep my student implement HITS (on the fly) ranking, and classification (I
think).

It's sitting on my HD for 2 years :(

So if someone can pick it up it would be a nice GSoC project.

Glad to hear there is interest.

Cheers,
Chris

On 3/4/13 1:21 PM, Markus Jelsma markus.jel...@openindex.io wrote:

Chris!

Do you mean automatic classification of hub and authority pages? If so,
we're more than interested in that. This is still an issue for our site
search platform and one that haven't given much more attention than some
research and prototypes.

Cheers

 
 
-Original message-
 From:Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
 Sent: Mon 04-Mar-2013 22:02
 To: dev@nutch.apache.org
 Subject: Re: [DISCUSS] Google Summer of Code
 
 Hey Lewis,
 
 Great job starting this thread. +1 Giraph is welcome here.
Multi-project GSoCs always do well.
 
 One thing I had in mind was taking an implementation of Hubs and
Authorities developed for
 Nutch 1.3 a few years back in my USC class and then having someone
integrate it into the
 current Nutch 1.x branch to start.
 
 If folks are interested I can create a JIRA.
 
 Cheers,
 Chris
 
 
 From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
mailto:lewis.mcgibb...@gmail.com 
 Reply-To: dev@nutch.apache.org mailto:dev@nutch.apache.org 
dev@nutch.apache.org mailto:dev@nutch.apache.org 
 Date: Monday, March 4, 2013 12:23 PM
 To: dev@nutch.apache.org mailto:dev@nutch.apache.org 
dev@nutch.apache.org mailto:dev@nutch.apache.org 
 Subject: [DISCUSS] Google Summer of Code
 
 Hi All,
 
 I thought I would ask the question as to who (if anyone) is intending
on engaging as a mentor (or student if you are one) within this years
GSoC project.
 There are plenty of projects we could do within Nutch.
 Obvious ones that come to mind are
 - Wicket webapp for Nutch 2.x
 - Integration of Giraph with Nutch
 We already have one proposal which I would consider mentoring over on
Apache Gora, but I will certainly not back down from any proposals in
Nutch.
 Would the Giraph project be welcomed here? If so I can head over to
user@ Giraph in an attempt to attract interest.
 Of course this is a discussion based on what folks want to do and the
list above should be added to.
 Thanks for now
 Lewis
 
 -- 
 Lewis
 



Re: [DISCUSS] Google Summer of Code

2013-03-04 Thread Mattmann, Chris A (388J)
Hey Markus:

https://issues.apache.org/jira/browse/NUTCH-1539


Will submit the code soon.

Cheers,
Chris

On 3/4/13 1:43 PM, Markus Jelsma markus.jel...@openindex.io wrote:

Ah yes! Please open an issue and if you can attach anything that matters
such as a description of the algorithm, how it should work with
Nutch/MapReduce or even code/tests.

If there's code i may be able to patch it up for trunk rather quickly and
see how it performs.

Cheers,
Markus

 
 
-Original message-
 From:Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
 Sent: Mon 04-Mar-2013 22:27
 To: dev@nutch.apache.org
 Subject: Re: [DISCUSS] Google Summer of Code
 
 Hey Markus,
 
 Yep my student implement HITS (on the fly) ranking, and classification
(I
 think).
 
 It's sitting on my HD for 2 years :(
 
 So if someone can pick it up it would be a nice GSoC project.
 
 Glad to hear there is interest.
 
 Cheers,
 Chris
 
 On 3/4/13 1:21 PM, Markus Jelsma markus.jel...@openindex.io wrote:
 
 Chris!
 
 Do you mean automatic classification of hub and authority pages? If so,
 we're more than interested in that. This is still an issue for our site
 search platform and one that haven't given much more attention than
some
 research and prototypes.
 
 Cheers
 
  
  
 -Original message-
  From:Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
  Sent: Mon 04-Mar-2013 22:02
  To: dev@nutch.apache.org
  Subject: Re: [DISCUSS] Google Summer of Code
  
  Hey Lewis,
  
  Great job starting this thread. +1 Giraph is welcome here.
 Multi-project GSoCs always do well.
  
  One thing I had in mind was taking an implementation of Hubs and
 Authorities developed for
  Nutch 1.3 a few years back in my USC class and then having someone
 integrate it into the
  current Nutch 1.x branch to start.
  
  If folks are interested I can create a JIRA.
  
  Cheers,
  Chris
  
  
  From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 mailto:lewis.mcgibb...@gmail.com 
  Reply-To: dev@nutch.apache.org mailto:dev@nutch.apache.org 
 dev@nutch.apache.org mailto:dev@nutch.apache.org 
  Date: Monday, March 4, 2013 12:23 PM
  To: dev@nutch.apache.org mailto:dev@nutch.apache.org 
 dev@nutch.apache.org mailto:dev@nutch.apache.org 
  Subject: [DISCUSS] Google Summer of Code
  
  Hi All,
  
  I thought I would ask the question as to who (if anyone) is intending
 on engaging as a mentor (or student if you are one) within this years
 GSoC project.
  There are plenty of projects we could do within Nutch.
  Obvious ones that come to mind are
  - Wicket webapp for Nutch 2.x
  - Integration of Giraph with Nutch
  We already have one proposal which I would consider mentoring over on
 Apache Gora, but I will certainly not back down from any proposals in
 Nutch.
  Would the Giraph project be welcomed here? If so I can head over to
 user@ Giraph in an attempt to attract interest.
  Of course this is a discussion based on what folks want to do and the
 list above should be added to.
  Thanks for now
  Lewis
  
  -- 
  Lewis
  
 
 



Re: Nutch JAVA Application

2013-02-12 Thread Mattmann, Chris A (388J)
Hi Shann,

Thank you for reaching out! If your goal is to get your project integrated
into Apache Nutch, 
proper, then I would recommend simply:

0. File some JIRA issues in Apache Nutch
http://issues.apache.org/jira/browse/NUTCH Small incremental patches and
issues are preferred and this will let people know what your plan is so
you can get committers and PMC members attention.

1. svn co http://svn.apache.org/repos/asf/nutch/branches/2.x/
2. cd 2.x
3. Edit files
4. svn status (make sure the files you edited looked correct)
6. svn diff  NUTCH-xxx.sleduc.yyMMdd.patch.txt for each issue you created
7. Attach patches from #6 to issues from #1

Otherwise if you go off onto Github, and work it's going to be harder to
get your patch accepted since it will represent large change when instead
you can effect the change here at the ASF, incrementally making sure your
code gets in.

ALv2 is the license to use, BTW, either way you decide.

Cheers,
Chris


On 2/12/13 12:25 PM, Shann stanislas.le...@mailoo.org wrote:

Hi,
Part of my internship, we must develop a specialized search engine using
Nutch, Solr, HBase, Tika.

I began to develop a Java application for crawler with Nuth branch 2.x.

Functions inject, generate, fetch, parse, updatedb, solrindex based on the
actual execution of nutch via a shell command from Java application.

As an advocate of free software, I propose therefore to give you access to
my git project.
 
Using nutch in the background, under what license should I put my
application ?



--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-JAVA-Application-tp4040050.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.



FW: [GSoC Mentors] Google Summer of Code 2013

2013-02-11 Thread Mattmann, Chris A (388J)
[Sorry for cross posting]

Guys,

FYI please note that you can participate as a mentor from a PMC via Apache as 
they are a GSoC org. ComDev will coordinate our participation but start 
thinking about what projects we may want to do.

Cheers,
Chris

From: Carol Smith car...@google.commailto:car...@google.com
Date: Monday, February 11, 2013 11:02 AM
To: Google Summer of Code Mentors List 
google-summer-of-code-mentors-l...@googlegroups.commailto:google-summer-of-code-mentors-l...@googlegroups.com
Subject: [GSoC Mentors] Google Summer of Code 2013

Hi GSoC mentors and org admins,

We've announced that we're doing Google Summer of Code 2013 [1]. Yay!

If you would like to help spread the word about GSoC, we have presentations 
[2], logos [3], and flyers [4] for you to use. Please host meetups, tell your 
friends and colleagues about the program, go to conferences, talk to people 
about the program, and just generally do all the awesome word-of-mouth stuff 
you do every year to promote the program.

The GSoC calendar, FAQ, and events timeline have all been updated with this 
year's important dates, so please refer to those for the milestones for this 
year's program. NB: the normal timeline for the program has been modified for 
this year. You'll probably want to examine the dates closely to make sure you 
know when important things are happening.

Please consider translating the presentations and/or flyers into your native 
language and submitting them directly to me to post on the wiki. Localization 
for our material is integral to reaching the widest possible audience around 
the world. If you decide to translate a flyer, please fill out our form to 
request a thank you gift for your effort. [5]

If you decide to host a meetup, please email me to let me know the date, time, 
and location so I can put it on the GSoC calendar. Also, remember to take 
pictures at your meetup and write up a blog post for our blog using our 
provided template for formatting [6]. If you need promotional items for your 
attendees, please fill out our form [7] to request some; we're happy to send 
some along. We can provide up to about 25 pens, notebooks, or stickers and/or a 
few t-shirts. Please keep in mind, though, that shipping restrictions and 
timeline vary country-to-country; request items early to make sure they get 
there on time! If you have questions about hosting meetups, please see the 
section in our FAQ [8].

Please consider applying to participate as an organization again this year or 
maybe joining as a mentor for your favorite organization if they are selected 
this year.

We rely on you for your help for the success of this program, so thank you in 
advance for all the work you do!

[1] - 
http://google-opensource.blogspot.com/2013/02/flip-bits-not-burgers-google-summer-of.html
[2] - http://code.google.com/p/google-summer-of-code/wiki/ProgramPresentations
[3] - http://code.google.com/p/google-summer-of-code/wiki/GsocLogos
[4] - http://code.google.com/p/google-summer-of-code/wiki/GsocFlyers
[5] - http://goo.gl/gEHDO
[6] - http://goo.gl/wbZrt
[7] - http://goo.gl/0BsR8
[8] - http://goo.gl/2NGfp

Cheers,
Carol

--
You received this message because you are subscribed to the Google Groups 
Google Summer of Code Mentors List group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
google-summer-of-code-mentors-list+unsubscr...@googlegroups.commailto:google-summer-of-code-mentors-list+unsubscr...@googlegroups.com.
To post to this group, send email to 
google-summer-of-code-mentors-l...@googlegroups.commailto:google-summer-of-code-mentors-l...@googlegroups.com.
Visit this group at 
http://groups.google.com/group/google-summer-of-code-mentors-list?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.




Re: [DISCUSS] Nutch Policy/Opinion on Review Board

2013-01-31 Thread Mattmann, Chris A (388J)
I love it and will use it but don't think it needs to be a policy to each their 
own :)

Thanks buddy

Sent from my iPhone

On Jan 31, 2013, at 3:58 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com 
wrote:

 Hi All,
 
 I thought I would  create this thread as the Review Board platform has
 been floating around now for a bit and I wonder if we can leverage it
 to improve/streamline the efficiency of Nutch community contributions.
 
 So I thought I'd leave this thread nice and short.
 
 1) I am new to Review Board. I don't know much about it. I haven't
 used it before.
 2) I am interested to see if we can make contributions and
 particularly reviewing a more open and transparent process.
 3) I want to hear what you guys think.
 
 Some links which may be of interest [0][1][2]
 
 Ta
 Lewis
 
 [0] https://blogs.apache.org/infra/entry/reviewboard_instance_running_at_the
 [1] https://reviews.apache.org
 [2] http://www.reviewboard.org/
 
 -- 
 Lewis


Re: review board

2013-01-26 Thread Mattmann, Chris A (388J)
Hey Tejas,

Yeah I think this has to do with something in the repo URL on the RB server 
side. I would file an INFRA ticket, or jump on #asfinfra on IRC and ask one of 
the guys for help there.

Cheers,
Chris

From: Tejas Patil tejas.patil...@gmail.commailto:tejas.patil...@gmail.com
Reply-To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Date: Friday, January 25, 2013 10:28 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org 
dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: review board

Hi,

Has anyone recently faced an issue with Review Board while uploading a patch ?
I created a patch for a change and tried to upload it via web UI of review 
board. It says:
The file 
'https://svn.apache.org/repos/asf/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java'
 (r1438860) could not be found in the repository

Quite similar to the description given in [0]. HttpBase.java exists at the link 
given. My patch involves few changes to it.

I think what I did is right, but still want to confirm. I generated the patch 
file using svn diff command. I am using svn, version 1.7.5. The patch was for 
nutch trunk. For uploading, I obtained the base directory from svn info 
command.

Meanwhile I am googling for this issue, it would be great if someone can point 
out the problem here.

[0] : https://issues.apache.org/jira/browse/INFRA-5046

Thanks,
Tejas Patil



Re: 1.8 in Jira

2012-12-21 Thread Mattmann, Chris A (388J)
woot yep ;)

On 12/21/12 2:55 AM, Markus Jelsma markus.jel...@openindex.io wrote:

forget it, i meant 1.7 but it's there already!
 
-Original message-
 From:Markus Jelsma markus.jel...@openindex.io
 Sent: Fri 21-Dec-2012 11:54
 To: dev@nutch.apache.org dev@nutch.apache.org
 Subject: 1.8 in Jira
 
 Anyone here with rights to add 1.8 to Jira?
 Thanks
 



Re: [VOTE] Apache Nutch 1.6 Release Candidate

2012-11-29 Thread Mattmann, Chris A (388J)
Thanks guys.

I should review this today.

Cheers,
Chris

On Nov 29, 2012, at 5:31 AM, Lewis John Mcgibbney wrote:

 Hi,
 
 On Wed, Nov 28, 2012 at 10:11 AM, Julien Nioche
 lists.digitalpeb...@gmail.com wrote:
 
   - CHANGES.txt contains dates in both MM/DD/ and DD/MM/ formats.
   Shall we write the month in text form e.g. 7th July 2012 from now on?
 
 Done
 
   - Don't we need to have signatures as part of the RC?
 
 
 Done, thanks for the attention to detail Julien.
 
 Best
 
 Lewis



Re: Strategy for Assigning Issues by Version

2012-11-29 Thread Mattmann, Chris A (388J)
Hey Lewis,

On Nov 29, 2012, at 5:54 AM, Lewis John Mcgibbney wrote:

 Hi All,
 
 Right now I found myself facing a bit of a dilemma w.r.t bumping on
 the issues for the next Nutch release.
 
 Currently due to legacy workflows, we have some 120 issues assigned
 for 1.6... however ALL issues have been addressed for 1.6 meaning that
 the 120 issues are for  1.6 however not necessarily for 1.7.

I would just set them for 1.7. I just use N+1 as the next release whether or 
not we actually plan to solve them for 1.7. Then when 1.7 comes along you 
can bump those 1.7s that we didn't get to, to 1.8, etc.

 
 A suggestion from myself, can I mark these issues as no fix version?
 This means that we can carve/manufacture the next development drive to
 what developers want to fix and to what features requests we receive
 from the community rather than sitting with a constant pile of issues
 which are always for the next development drive.

Marking them as no fix version destroys pretty important reporting that I like
to use which is pulling up a list of all the upcoming issues of relevance set
for the next release. Without setting a Fix version you have to use the other
JIRA search tools to search by things other than next version.

 
 Additionally, may I suggest (and please shoot me down here if I sound
 cheeky) that we make it a priority in the next development drive, to
 harness the issues which are marked as patch submitted? It seems to be
 a waste for such issues to be stagnating. I am conscious that this
 comment may sound wide of me, this is not the intention, I do think
 however that it would be nice to work our way towards Nucth releases
 in a more strategic manner than we have been doing. Hopefully this
 proposal is a step in the right direction.

+50. That was one of my keys to success when I had more time. I would look
for issues sitting with patches and just commit them. If I can wrangle some 
Nutch
time over Christmas, I'll do a bunch of this as well. :)

 
 Thanks for any feedback. The issue at the top I suppose is the most
 important one in the short term.

Cheers my friend.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Strategy for Assigning Issues by Version

2012-11-29 Thread Mattmann, Chris A (388J)
+50 :)

On Nov 29, 2012, at 8:32 AM, Lewis John Mcgibbney wrote:

 So in summary,
 
 We retain the legacy behavior and bump them ALL to 1.7
 
 In the 1.7 development drive (if and when we can) we make an effort to act on 
 patched issues in an attempt to pick the low hanging fruit so to speak... if 
 such a thing exists.
 
 best
 
 Lewis
 
 On Thu, Nov 29, 2012 at 3:56 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
 Good idea! I suspect that most of them will be dating from a looong time ago 
 and it won't be such a straightforward task to apply them, however this would 
 be a good way of sorting them
 
 
 
 Additionally, may I suggest (and please shoot me down here if I sound
 cheeky) that we make it a priority in the next development drive, to
 harness the issues which are marked as patch submitted? It seems to be
 a waste for such issues to be stagnating. I am conscious that this
 comment may sound wide of me, this is not the intention, I do think
 however that it would be nice to work our way towards Nucth releases
 in a more strategic manner than we have been doing. Hopefully this
 proposal is a step in the right direction.
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 
 
 
 
 -- 
 Lewis 
 



Re: [DISCUSS] trunk release?

2012-11-22 Thread Mattmann, Chris A (388J)
Release early, release often :)

I'd say I'd be happy to try and spin it, but you'd beat me to it so I just 
will say I'll be happy to test the RC and voice my VOTE when you roll
it Lewis :)

Happy Thanksgiving (even though you're not in the States yet)!

Cheers,
Chris

On Nov 22, 2012, at 7:15 AM, Lewis John Mcgibbney wrote:

 Hi All,
 
 A while ago I asked if it was time to get another release of trunk...
 Markus expressed the valid opinion that there were some issues with
 recently committed material that had maybe not been given the chance
 to mature enough and that could do with more testing.
 
 So far in trunk (since 1.5.1), we've resolved some 45 issues [0], but
 we have some critical issues open [1] which could do with some
 attention as well.
 None of these issues are mine therefore I don't know how those of us
 feel (with patches available) about integrating these issues
 prior/post 1.6 release... or indeed whether a 1.6 release is welcomed
 at the moment? The codebase seems to be stable and getting better so
 from my perspective I would back a 1.X release.
 
 All the best for now
 
 Lewis
 
 [0] http://tinyurl.com/cf3vcpr
 [1] http://tinyurl.com/d4omnrc
 
 -- 
 Lewis



Re: [ANNOUNCE] Apache Nutch 2.1 Released

2012-10-05 Thread Mattmann, Chris A (388J)
Great job everyone!

Cheers,
Chris

On Oct 5, 2012, at 9:29 AM, Julien Nioche wrote:

 Thanks Lewis and well done everyone!
 Enjoy your week end
 
 Julien
 
 On 5 October 2012 16:12, lewis john mcgibbney lewi...@apache.org wrote:
 Good Afternoon Everyone,
 
 The Apache Nutch PMC are very pleased to announce the release of
 Apache Nutch v2.1. This release continues to provide Nutch users with
 a simplified Nutch distribution building on the 2.x development drive
 which is growing in popularity amongst the community. As well as
 addressing ~20 bugs this release also offers improved properties for
 better Solr configuration, upgrades to various Gora dependencies and
 the introduction of the option to build indexes in elastic search,
 amongst various others.
 
 A full PMC Announcement can be seen here [0]
 
 Thanks you, have a great weekend on behalf of the Nutch community.
 
 Lewis
 
 [0] http://nutch.apache.org/#05+October+2012+-+Apache+Nutch+v2.1+Released
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available

2012-10-04 Thread Mattmann, Chris A (388J)
Thanks for your VOTE!

Cheers,
Chris

On Oct 4, 2012, at 1:08 AM, j.sulli...@thomsonreuters.com
 j.sulli...@thomsonreuters.com wrote:

 A bit late but my two cents. I have done a couple of installs on Ubuntu 12.04 
 using MySQL for the backend and have noticed a couple of the improvements and 
 no regressions so +1 for releasing from my end.
 
 -Original Message-
 From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com] 
 Sent: Monday, October 01, 2012 9:18 PM
 To: dev@nutch.apache.org; u...@nutch.apache.org
 Subject: [PING] [VOTE] Apache Nutch 2.1 Release Candidate Available
 
 Hi All,
 
 Anyone else for this VOTE?
 
 Sorry to be a pest!
 
 Thanks
 
 Lewis
 
 On Fri, Sep 21, 2012 at 4:07 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:
 Hi Everyone,
 
 A candidate for Apache Nutch 2.1 is available at:
 
 http://people.apache.org/~lewismc/apache-nutch-2.1
 
 The release candidate is a src.zip and src.tar.gz ONLY archive of the 
 sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-2.1/
 
 We release Nutch 2.1 in this fashion due to the inclusion of Apache 
 Gora and the likelihood that users will regularly recompile the code 
 to suit dynamic requirements.
 
 Further, a staged Maven repository of the 2.1 jar, sources.jar and 
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-020/
 
 Please vote on releasing this package as Apache Nutch 2.1.
 The vote is open for the next 72 hours and passes if a majority of at 
 least three +1 Nutch PMC votes are cast.
 
 [ ] +1 Release this package as Apache Nutch 2.1  [ ] -1 Do not 
 release this package because...
 
 Many Thanks and heres to plenty more.
 
 Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 --
 Lewis
 
 
 
 --
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Status of 2.1 release

2012-09-21 Thread Mattmann, Chris A (388J)
Take care dude! I'll give trunk a shot...

Cheers,
Chris

On Sep 21, 2012, at 7:34 AM, Lewis John Mcgibbney wrote:

 Hi All,
 
 Basically thank god it was brought to our attention that
 giora-cassandra 0.2.1 is buggy and needs some work before it is ready
 to be integrated into a stable Nutch 2.x release.
 
 For the time being I've committed a revert for gora-cassandra v0.2 to
 the 2.1 branch and to 2.x branch (the latter of which can continue
 development regardless).
 
 I'll run the RC for 2.1 just now.
 
 @Markus,
 How are your thoughts on trunk?
 
 @Chris,
 
 Depending on outcome of discussion on trunk, do you want to spin an RC?
 
 Have a great weekend everyone.
 
 Lewis
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: svn commit: r1387363 - in /nutch/branches/2.1: CHANGES.txt build.xml pom.xml

2012-09-18 Thread Mattmann, Chris A (388J)
Lewis you beat me to it, you ROCK!

Cheers,
Chris

On Sep 18, 2012, at 5:11 PM, lewi...@apache.org
 lewi...@apache.org wrote:

 Author: lewismc
 Date: Tue Sep 18 21:11:06 2012
 New Revision: 1387363
 
 URL: http://svn.apache.org/viewvc?rev=1387363view=rev
 Log:
 forward port of NUTCH-1415
 
 Modified:
nutch/branches/2.1/CHANGES.txt
nutch/branches/2.1/build.xml
nutch/branches/2.1/pom.xml
 
 Modified: nutch/branches/2.1/CHANGES.txt
 URL: 
 http://svn.apache.org/viewvc/nutch/branches/2.1/CHANGES.txt?rev=1387363r1=1387362r2=1387363view=diff
 ==
 --- nutch/branches/2.1/CHANGES.txt (original)
 +++ nutch/branches/2.1/CHANGES.txt Tue Sep 18 21:11:06 2012
 @@ -3,6 +3,8 @@ Nutch Change Log
 Release 2.1 (19/09/2012) ddmm
 Full Jira Report - 
 https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680version=12321040
 
 +* NUTCH-1415 release packages to contain top level folder apache-nutch-x.x 
 (snagel)
 +
 * NUTCH-1432 property storage.schema does not work anymore, should be 
 storage.schema.webpage and storage.schema.host (lewismc)
 
 * NUTCH-1468 Redirects that are external links not adhering to 
 db.ignore.external.links (Matt MacDonald via ferdy)
 
 Modified: nutch/branches/2.1/build.xml
 URL: 
 http://svn.apache.org/viewvc/nutch/branches/2.1/build.xml?rev=1387363r1=1387362r2=1387363view=diff
 ==
 --- nutch/branches/2.1/build.xml (original)
 +++ nutch/branches/2.1/build.xml Tue Sep 18 21:11:06 2012
 @@ -700,14 +700,13 @@
   !-- == --
   target name=tar-src depends=package-src description=-- generate 
 src.tar.gz distribution package
 tar compression=gzip longfile=gnu
 -  destfile=${src.dist.version.dir}.tar.gz 
 basedir=${src.dist.version.dir}
 -  tarfileset dir=${dist.dir} mode=664
 - exclude name=${src.dist.version.dir}/bin/* /
 - exclude name=${src.dist.version.dir}/runtime/* /
 -include name=${src.dist.version.dir}/** /
 +  destfile=${src.dist.version.dir}.tar.gz
 +  tarfileset dir=${src.dist.version.dir} mode=664 
 prefix=${final.name}
 +exclude name=src/bin/* /
 +include name=** /
   /tarfileset
 -  tarfileset dir=${dist.dir} mode=755
 -include name=${src.dist.version.dir}/bin/* /
 +  tarfileset dir=${src.dist.version.dir} mode=755 
 prefix=${final.name}
 +include name=src/bin/* /
   /tarfileset
 /tar
   /target
 @@ -717,13 +716,13 @@
   !-- == --
   target name=tar-bin depends=package-bin description=-- generate 
 bin.tar.gz distribution package
 tar compression=gzip longfile=gnu
 -  destfile=${bin.dist.version.dir}.tar.gz 
 basedir=${bin.dist.version.dir}
 -  tarfileset dir=${dist.dir} mode=664
 - exclude name=${bin.dist.version.dir}/bin/* /
 -include name=${bin.dist.version.dir}/** /
 +  destfile=${bin.dist.version.dir}.tar.gz
 +  tarfileset dir=${bin.dist.version.dir} mode=664 
 prefix=${final.name}
 +exclude name=bin/* /
 +include name=** /
   /tarfileset
 -  tarfileset dir=${dist.dir} mode=755
 -include name=${bin.dist.version.dir}/bin/* /
 +  tarfileset dir=${bin.dist.version.dir} mode=755 
 prefix=${final.name}
 +include name=bin/* /
   /tarfileset
 /tar
   /target
 @@ -733,14 +732,13 @@
   !-- == --
   target name=zip-src depends=package-src description=-- generate 
 src.zip distribution package
zip compress=true casesensitive=yes 
 -   destfile=${src.dist.version.dir}.zip basedir=${src.dist.version.dir}
 -   zipfileset dir=${dist.dir} filemode=664
 -   exclude name=${src.dist.version.dir}/bin/* /
 -   exclude name=${src.dist.version.dir}/runtime/* /
 -   include name=${src.dist.version.dir}/** /
 + destfile=${src.dist.version.dir}.zip
 +   zipfileset dir=${src.dist.version.dir} filemode=664 
 prefix=${final.name}
 +   exclude name=src/bin/* /
 +   include name=** /
/zipfileset
 -   zipfileset dir=${dist.dir} filemode=755
 -   include name=${src.dist.version.dir}/bin/* /
 +   zipfileset dir=${src.dist.version.dir} filemode=755 
 prefix=${final.name}
 +   include name=src/bin/* /
/zipfileset
/zip
   /target
 @@ -750,13 +748,13 @@
   !-- == --
   target name=zip-bin depends=package-bin description=-- generate 
 bin.zip distribution package
zip compress=true casesensitive=yes 
 -   destfile=${bin.dist.version.dir}.zip basedir=${bin.dist.version.dir}
 -   zipfileset dir=${dist.dir} filemode=664
 -   exclude name=${bin.dist.version.dir}/bin/* /
 -   include name=${bin.dist.version.dir}/** /
 + destfile=${bin.dist.version.dir}.zip
 +   

Re: Nutch 2.1 Release???

2012-09-15 Thread Mattmann, Chris A (388J)
+1 I'd be happy to help!

Cheers,
Chris

On Sep 15, 2012, at 9:24 AM, Lewis John Mcgibbney wrote:

 Hi Everyone,
 
 Without me slevering on, this suggestion speaks for itself.
 
 We have resolved 32 issues, including pulling in upgrades on the Gora
 dependency. It would be nice to push these improvements in a stable
 release to the Nutch community.
 
 Any thoughts.
 
 Best
 
 Lewis
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Nutch 2.1 Release???

2012-09-15 Thread Mattmann, Chris A (388J)
Awesome Lewis. I'll try and roll a 2.1 RC by mid next week if no one
beats me to it.

Cheers,
Chris

On Sep 15, 2012, at 2:18 PM, Lewis John Mcgibbney wrote:

 Actually when I look at it now we're at nearly 30 tickets for trunk as well.
 
 Up to you guys
 
 @Chris
 Nice one. Fire in my friend. If you can do RM role it would be great.
 
 Best
 
 Lewis
 
 On Sat, Sep 15, 2012 at 6:07 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 +1 I'd be happy to help!
 
 Cheers,
 Chris
 
 On Sep 15, 2012, at 9:24 AM, Lewis John Mcgibbney wrote:
 
 Hi Everyone,
 
 Without me slevering on, this suggestion speaks for itself.
 
 We have resolved 32 issues, including pulling in upgrades on the Gora
 dependency. It would be nice to push these improvements in a stable
 release to the Nutch community.
 
 Any thoughts.
 
 Best
 
 Lewis
 
 --
 Lewis
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Nutch talk accepted at ApacheCon Europe

2012-09-13 Thread Mattmann, Chris A (388J)
Great to hear, Julien, nice!

Cheers,
Chris

On Sep 13, 2012, at 3:39 AM, Julien Nioche wrote:

 Hi, 
 
 I'd just like to mention that I will be giving a talk about Nutch at the 
 Apache Conference Europe (Sinsheim, Germany 5–8 November 2012). The Apache 
 Conference should be a good opportunity for the Nutch community (committers 
 as well as users) to get together and I hope to see many of you there. Early 
 Birds tickets are available until the 1st October.
 
 The talk itself will be an overview of Nutch and will be part of the 
 Lucene/SOLR Ecosystem track. If you have an interesting use case using Nutch 
 or have something in particular that you'd like me to talk about, please do 
 get in touch and I'll try to blend that in the presentation.
 
 I look foward to seeing you in Sinsheim.
 
 Julien
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Happy 10th Birthday Nutch!

2012-08-22 Thread Mattmann, Chris A (388J)
Awesome, Jerome! I need to get a Nutch hat!

Cheers,
Chris

On Aug 21, 2012, at 3:59 PM, Markus Jelsma wrote:

 Hehehe, nice! 
 
 Cheers
 
 -Original message-
 From:Jérôme Charron jerome.char...@gmail.com
 Sent: Tue 21-Aug-2012 23:58
 To: dev@nutch.apache.org
 Cc: u...@nutch.apache.org
 Subject: Re: Happy 10th Birthday Nutch!
 
 Oups! Sorry...
 These one should be ok : http://statigr.am/p/254365383887354210_4414285 
 http://statigr.am/p/254365383887354210_4414285 
 ;)
 
 
 On Tue, Aug 21, 2012 at 11:40 PM, Markus Jelsma markus.jel...@openindex.io 
 mailto:markus.jel...@openindex.io  wrote:
 Hi Jérôme,
 
 It asks for a login.
 
 Cheers
 
 
 
 -Original message-
 From:Jérôme Charron jerome.char...@gmail.com 
 mailto:jerome.char...@gmail.com 
 Sent: Tue 21-Aug-2012 22:22
 To: u...@nutch.apache.org mailto:u...@nutch.apache.org 
 Cc: dev@nutch.apache.org mailto:dev@nutch.apache.org  
 dev@nutch.apache.org mailto:dev@nutch.apache.org 
 Subject: Re: Happy 10th Birthday Nutch!
 
 My small contribution to Nutch birthday...
 http://statigr.am/viewer.php#/detail/254365383887354210_4414285 
 http://statigr.am/viewer.php#/detail/254365383887354210_4414285 
 http://statigr.am/viewer.php#/detail/254365383887354210_4414285 
 http://statigr.am/viewer.php#/detail/254365383887354210_4414285 
 
 Cheers,
 Jérôme
 
 On Fri, Aug 10, 2012 at 1:44 AM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov mailto:chris.a.mattm...@jpl.nasa.gov 
 mailto:chris.a.mattm...@jpl.nasa.gov 
 mailto:chris.a.mattm...@jpl.nasa.gov   wrote:
 Super cool. Proud to have been around since 2005 (7 of them!)
 
 :)
 
 Cheers,
 Chris
 
 On Aug 9, 2012, at 1:31 PM, Lewis John Mcgibbney wrote:
 
 Nice one Julien
 
 I'm going to update the site with this as its a pretty huge milestone
 @Apache and a lot of projects and current developers owe a lot to the
 great work done by all you guys over the years.
 
 Thank you for sharing.
 
 Lewis
 
 On Thu, Aug 9, 2012 at 8:56 AM, Julien Nioche
 lists.digitalpeb...@gmail.com mailto:lists.digitalpeb...@gmail.com 
 mailto:lists.digitalpeb...@gmail.com 
 mailto:lists.digitalpeb...@gmail.com   wrote:
 Doug Cutting on twitter :
 https://twitter.com/cutting/status/233415059798372353
 
 *RT @StefanGroschupf: Happy 10th birthday#Nutch! Registered at sourceforce
 august 2002. Turned out to be quite a game changer. #Hadoop
 *
 Happy birthday Nutch and thanks to all contributors past and present!
 
 Julien
 
 --
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ 
 http://digitalpebble.blogspot.com/ http://digitalpebble.blogspot.com/ 
 http://www.digitalpebble.com http://www.digitalpebble.com 
 http://www.digitalpebble.com http://www.digitalpebble.com 
 http://twitter.com/digitalpebble http://twitter.com/digitalpebble 
 http://twitter.com/digitalpebble http://twitter.com/digitalpebble 
 
 
 
 --
 Lewis
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov mailto:chris.a.mattm...@nasa.gov 
 mailto:chris.a.mattm...@nasa.gov mailto:chris.a.mattm...@nasa.gov 
 WWW:   http://sunset.usc.edu/~mattmann/ http://sunset.usc.edu/~mattmann/ 
 http://sunset.usc.edu/~mattmann/ http://sunset.usc.edu/~mattmann/ 
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 --
 
 @jcharron http://www.twitter.com/jcharron 
 http://www.twitter.com/jcharron 
 http://motre.ch/ http://motre.ch/ http://motre.ch/ http://motre.ch/ 
 http://jcharron.posterous.com/ http://jcharron.posterous.com/ 
 http://jcharron.posterous.com/ http://jcharron.posterous.com/ 
 http://www.shopreflex.fr/ http://www.shopreflex.fr/ 
 http://www.shopreflex.fr/ http://www.shopreflex.fr/ 
 http://www.staragora.com/ http://www.staragora.com/ 
 http://www.staragora.com/ http://www.staragora.com/ 
 
  http://feeds.feedburner.com/Bligblagblog.1.gif 
 http://feeds.feedburner.com/Bligblagblog.1.gif  
 http://feeds.feedburner.com/~r/Bligblagblog/~6/1 
 http://feeds.feedburner.com/~r/Bligblagblog/~6/1 
 
 Hi
 
 
 
 -- 
 
 @jcharron http://www.twitter.com/jcharron 
 http://motre.ch/ http://motre.ch/ 
 http://jcharron.posterous.com/ http://jcharron.posterous.com/ 
 http://www.shopreflex.fr/ http://www.shopreflex.fr/ 
 http://www.staragora.com/ http://www.staragora.com/ 
 
 http://feeds.feedburner.com/Bligblagblog.1.gif 
 http://feeds.feedburner.com/~r/Bligblagblog/~6/1 
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm

Fwd: Call for Papers for ApacheCon Europe 2012 now open!

2012-07-19 Thread Mattmann, Chris A (388J)
FYI...

Begin forwarded message:

 From: Nick Burch nick.bu...@alfresco.com
 Date: July 19, 2012 1:14:57 PM CDT
 To: committ...@apache.org
 Subject: Call for Papers for ApacheCon Europe 2012 now open!
 Reply-To: apachecon-disc...@apache.org
 
 Hi All
 
 We're pleased to announce that the Call for Papers for ApacheCon Europe 2012 
 is finally open!
 
 (For those who don't already know, ApacheCon Europe will be taking place 
 between the 5th and the 9th of November this year, in Sinsheim, Germany.)
 
 If you'd like to submit a talk proposal, please visit the conference website 
 at http://www.apachecon.eu/ and sign up for a new account. Once you've 
 signed up, use your dashboard to enter your speaker bio, then submit your 
 talk proposal(s). There's more information on the CFP page on the conference 
 website.
 
 We welcome talk proposals from all projects, from right across the bredth of 
 projects at the foundation! To make things easier for talk selection and 
 scheduling, we'd ask that you tag your proposal with the track that it most 
 closely fits within. The details of the tracks, and what projects they expect 
 to cover, are available at http://www.apachecon.eu/tracks/.
 
 (If your project/group of projects was intending to submit a track, and 
 missed the deadline, then please get in touch with us on 
 apachecon-disc...@apache.org  straight away, so we can work out if it's 
 possible to squeeze you in...)
 
 The CFP will close on Friday 3rd August, so you've a little over weeks to 
 send in your talk proposal. Don't put it off! We'll look forward to seeing 
 some great ones shortly!
 
 Thanks
 Nick
 (On behalf of the Conferences committee)


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation

2012-07-17 Thread Mattmann, Chris A (388J)
Hi Markus,

Great question. I am CC'ing Ruth Duerr and Ian Truslove and Ruth Duerr at NSIDC 
-- maybe they
can provide more information?

Ruth, ian, please consider subcribing to dev@nutch.apache.org and/or 
u...@nutch.apache.org
by sending blank emails to:

dev-subscr...@nutch.apache.org
user-subscr...@nutch.apache.org

To follow along in the conversation.

Thanks all!

Cheers,
Chris

On Jul 17, 2012, at 5:27 PM, Markus Jelsma wrote:

 Cool!
 
 What are they exactly doing with Apache Nutch? And, more interesting, what 
 non-standard stuff do they use?
 
 Cheers
 
 -Original message-
 From:Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
 Sent: Tue 17-Jul-2012 21:29
 To: dev@nutch.apache.org
 Subject: Apache Nutch being used at National Snow and Ice Data Center: ESIP 
 Federation
 
 Hey Folks,
 
 Ruth Duerr is presenting at today's ESIP Federation and Discovery Hackathon:
 
 http://commons.esipfed.org/node/424
 
 The U.S. National Snow and Ice Data Center (NSIDC) is deploying Apache Nutch 
 and 
 Solr to support discovery of datasets (called casting).
 
 Really interesting stuff, and worth contacting Ruth and NSIDC if you're 
 interested.
 I'm highly suggesting to to the NSIDC folks to try and contribute any 
 updates or plugins
 they are making to the software upstream here to the ASF.
 
 Thanks!
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [ANNOUNCEMENT] Apache Nutch v1.5.1 Released

2012-07-10 Thread Mattmann, Chris A (388J)
Congrats, all!

Cheers,
Chris

On Jul 10, 2012, at 8:03 AM, Julien Nioche wrote:

 Great Job Lewis! Thanks a lot
 
 On 10 July 2012 15:40, lewis john mcgibbney lewi...@apache.org wrote:
 Good Afternoon Everyone,
 
 The Apache Nutch PMC are very pleased to announce the release of
 Apache Nutch v1.5.1. This release is a maintenance release of the
 popular mainstream
 1.5.X series of the Apache Nutch web search software project.
 
 Please see the list of changes
 
 http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt
 
 made in this version for a full breakdown.. A full PMC release
 statement can be found below
 
 http://nutch.apache.org/#10+July+2012+-+Apache+Nutch+v1.5.1+Released
 
 Nutch v1.5.1 is available in source and binary (zip and tar.gz) from the
 following download page: http://www.apache.org/dyn/closer.cgi/nutch/1.5.1
 
 When downloading from a mirror site, please remember to verify the
 downloads using signatures found on the Apache site:
 
 http://www.apache.org/dist/nutch/KEYS
 
 For more information on Apache Nutch, visit the project home page:
 http://nutch.apache.org
 
 Thank you very much
 
 Lewis John McGibbney (on behalf of the Apache Nutch community)
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [PROPOSAL] Rename branch nutchgora into 2.x

2012-07-09 Thread Mattmann, Chris A (388J)
+1 from me.

Cheers,
Chris

On Jul 9, 2012, at 3:37 AM, Julien Nioche wrote:

 Guys, 
 
 Now that we've released 2.0, wouldn't it be better to rename the 'nutchgora' 
 branch into something like 'branch-2.x'? Any thoughts on this?
 
 Julien
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.5.1 RC#3

2012-07-07 Thread Mattmann, Chris A (388J)
Hi Lewis,

+1 from me!

SIGS check out:

[chipotle:~/tmp/nutch-1.5.1] mattmann% $HOME/bin/verify_md5_checksums 
md5sum: stat '*.bz2': No such file or directory
apache-nutch-1.5.1-bin.tar.gz: OK
apache-nutch-1.5.1-src.tar.gz: OK
apache-nutch-1.5.1-bin.zip: OK
apache-nutch-1.5.1-src.zip: OK

checksums check out:

[chipotle:~/tmp/nutch-1.5.1] mattmann% $HOME/bin/verify_gpg_sigs 
Verifying Signature for file apache-nutch-1.5.1-bin.tar.gz.asc
gpg: Signature made Tue Jul  3 11:31:31 2012 PDT using RSA key ID C601BCA7
gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
lewi...@apache.org
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1  89C1 F45E 7970 C601 BCA7
Verifying Signature for file apache-nutch-1.5.1-bin.zip.asc
gpg: Signature made Tue Jul  3 11:32:16 2012 PDT using RSA key ID C601BCA7
gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
lewi...@apache.org
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1  89C1 F45E 7970 C601 BCA7
Verifying Signature for file apache-nutch-1.5.1-src.tar.gz.asc
gpg: Signature made Tue Jul  3 11:31:58 2012 PDT using RSA key ID C601BCA7
gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
lewi...@apache.org
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1  89C1 F45E 7970 C601 BCA7
Verifying Signature for file apache-nutch-1.5.1-src.zip.asc
gpg: Signature made Tue Jul  3 11:32:33 2012 PDT using RSA key ID C601BCA7
gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
lewi...@apache.org
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1  89C1 F45E 7970 C601 BCA7
[chipotle:~/tmp/nutch-1.5.1] mattmann% 

Builds fine!


runtime:
[mkdir] Created dir: 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime
[mkdir] Created dir: 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local
[mkdir] Created dir: 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/deploy
 [copy] Copying 1 file to 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/deploy
 [copy] Copying 1 file to 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/deploy/bin
 [copy] Copying 1 file to 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/lib
 [copy] Copying 1 file to 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/lib/native
 [copy] Copying 21 files to 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/conf
 [copy] Copying 1 file to 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/bin
 [copy] Copying 48 files to 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/lib
 [copy] Copying 123 files to 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/plugins
 [copy] Copied 2 empty directories to 2 empty directories under 
/Users/mattmann/tmp/nutch-1.5.1/apache-nutch-1.5.1/runtime/local/test

BUILD SUCCESSFUL
Total time: 1 minute 28 seconds
[chipotle:~/tmp/nutch-1.5.1/apache-nutch-1.5.1] mattmann% 

Cheers,
Chris


On Jul 3, 2012, at 11:42 AM, Lewis John Mcgibbney wrote:

 Hi Everyone,
 
 A candidate for the Apache Nutch 1.5.1 RC#3 is available at:
 
 http://people.apache.org/~lewismc/apache-nutch-1.5.1-rc3
 
 The release candidate is a src.zip, src.tar.gz, bin-zip and bin-tar.gz
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-1.5.1-rc3/
 
 This Release Candidate (and subsequent release) is a bug fix of the
 recently released Apache Nutch 1.5 and CHANGES.txt can be seen below
 
 http://people.apache.org/~lewismc/apache-nutch-1.5.1-rc3/CHANGES.txt
 
 Further, a staged Maven repository of the 1.5.1 jar, sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-023
 
 Please vote on releasing this package as Apache Nutch 1.5.1.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
 [ ] +1 Release this package as Apache Nutch 1.5.1
 [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/

Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-07 Thread Mattmann, Chris A (388J)
Thanks for your hard work here, Lewis!

Cheers,
Chris

On Jul 7, 2012, at 3:44 PM, Lewis John Mcgibbney wrote:

 Hi Julien,
 
 Believe it or not I've just spent around 45 mins waiting on committing
 the site... broadband in Paris is nothing short of utterly abysmal to
 say the very best. Please see my comments below
 
 On Sat, Jul 7, 2012 at 9:58 PM, Julien Nioche
 lists.digitalpeb...@gmail.com wrote:
 Looks like you've released 2.0. If so can you make an announcement to the
 mailing list + update the website.
 
 Done
 
 It's not really something that should go
 unnoticed. I know about the press release but surely it does not mean that
 NOTHING should be said about the release then.
 
 Quite right.
 
 
 I see a 1.5 on a mirror (http://apache.mirrors.timporter.net/nutch/) with
 the same release date as 2.0. Shouldn't it be 1.5.1? Can you please clarify?
 
 This relates to the message on private@ the other night and concerns
 the rearranging (cleaning up) of the dist/nutch directory on
 people.apache.org to accommodate the additional 2.0 directory. The 1.5
 artifacts are identical to the ones we VOTE'd on, same goes with
 2.0's. The mirror will confusingly display that these have been
 mirrored at the same time, which of course is the case, but they were
 certainly not released in parallel.
 
 OK so now concerning 1.5.1, we have still to VOTE on the rc#3 so I've
 gently put out a ping for this on dev@ and user@
 
 I hope this answers all and I can only really apologise and say thanks
 to everyone who has made time and effort to VOTE over the last few
 months. There has been a very encouraging amount of work done within
 the dev community and it's been very rewarding to see us getting Nutch
 moving at a really steady pace.
 
 All for now
 
 Have a great weekend
 
 Lewis
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-06 Thread Mattmann, Chris A (388J)
OK, +1 from me :) 

ant runtime works:

job:
  [jar] Building jar: /Users/mattmann/tmp/nutch2/build/apache-nutch-2.0.job

runtime:
[mkdir] Created dir: /Users/mattmann/tmp/nutch2/runtime
[mkdir] Created dir: /Users/mattmann/tmp/nutch2/runtime/local
[mkdir] Created dir: /Users/mattmann/tmp/nutch2/runtime/deploy
 [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/deploy
 [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/deploy/bin
 [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/local/lib
 [copy] Copying 1 file to 
/Users/mattmann/tmp/nutch2/runtime/local/lib/native
 [copy] Copying 25 files to /Users/mattmann/tmp/nutch2/runtime/local/conf
 [copy] Copying 1 file to /Users/mattmann/tmp/nutch2/runtime/local/bin
 [copy] Copying 84 files to /Users/mattmann/tmp/nutch2/runtime/local/lib
 [copy] Copying 97 files to /Users/mattmann/tmp/nutch2/runtime/local/plugins
 [copy] Copied 2 empty directories to 2 empty directories under 
/Users/mattmann/tmp/nutch2/runtime/local/test

BUILD SUCCESSFUL
Total time: 3 minutes 24 seconds
[chipotle:~/tmp/nutch2] mattmann% 

Good enough for me!

Cheers,
Chris

On Jul 3, 2012, at 11:24 AM, Mattmann, Chris A (388J) wrote:

 Hey Lewis,
 
 I was running ant test -- sorry -- will try ant runtime now (any idea
 what's up with test?)
 
 Cheers,
 Chris
 
 On Jul 3, 2012, at 11:11 AM, Lewis John Mcgibbney wrote:
 
 What commands are you using?
 
 I just grabbed the src-tar.gz from my local area with wget
 extracted it to ~/Desktop
 rm -r ~/.ivy2
 cd ~/Desktop/$nutch_folder
 ant runtime
 
 runtime:
   [mkdir] Created dir: /home/lewismc/Desktop/nutch/runtime
   [mkdir] Created dir: /home/lewismc/Desktop/nutch/runtime/local
   [mkdir] Created dir: /home/lewismc/Desktop/nutch/runtime/deploy
[copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/deploy
[copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/deploy/bin
[copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/local/lib
[copy] Copying 1 file to
 /home/lewismc/Desktop/nutch/runtime/local/lib/native
[copy] Copying 25 files to /home/lewismc/Desktop/nutch/runtime/local/conf
[copy] Copying 1 file to /home/lewismc/Desktop/nutch/runtime/local/bin
[copy] Copying 84 files to /home/lewismc/Desktop/nutch/runtime/local/lib
[copy] Copying 97 files to
 /home/lewismc/Desktop/nutch/runtime/local/plugins
[copy] Copied 2 empty directories to 2 empty directories under
 /home/lewismc/Desktop/nutch/runtime/local/test
 
 BUILD SUCCESSFUL
 Total time: 2 minutes 40 seconds
 
 This is every dependency being down loaded to ivy cache
 
 Lewis
 
 On Tue, Jul 3, 2012 at 5:12 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hey Julien,
 
 I ran this command: rm -rf /Users/mattmann/.ivy2/
 
 But it still failed with the below messages:
 
 [ivy:resolve] :: problems summary ::
 [ivy:resolve]  WARNINGS
 [ivy:resolve]   [FAILED ] 
 org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar: invalid sha1: 
 expected=d7d8610ba4aad504475e568fd3badb412a0beae9 
 computed=f8369ff1a71e1a8febbb8e9c3a54ffbb08048f19 (1598ms)
 [ivy:resolve]   [FAILED ] 
 org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar:  (0ms)
 [ivy:resolve]    local: tried
 [ivy:resolve] 
 /Users/mattmann/.ivy2/local/org.apache.hadoop/hadoop-core/1.0.3/jars/hadoop-core.jar
 [ivy:resolve]    maven2: tried
 [ivy:resolve] 
 http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/1.0.3/hadoop-core-1.0.3.jar
 [ivy:resolve]   [FAILED ] org.hsqldb#hsqldb;2.2.8!hsqldb.jar: 
 invalid sha1: expected=8231a3ff71ba5889f9e2d01ce13503cbdd4038e9 
 computed=81a7e8d5d1802c7acbc8f8f81d3e4680a4b2441c (523ms)
 [ivy:resolve]   [FAILED ] org.hsqldb#hsqldb;2.2.8!hsqldb.jar:  
 (0ms)
 [ivy:resolve]    local: tried
 [ivy:resolve] 
 /Users/mattmann/.ivy2/local/org.hsqldb/hsqldb/2.2.8/jars/hsqldb.jar
 [ivy:resolve]    maven2: tried
 [ivy:resolve] 
 http://repo1.maven.org/maven2/org/hsqldb/hsqldb/2.2.8/hsqldb-2.2.8.jar
 [ivy:resolve]   [FAILED ] 
 org.apache.lucene#lucene-core;3.4.0!lucene-core.jar: invalid sha1: 
 expected=4426bf0764ec5fa634abca236b469d2519c74f65 
 computed=112d2454390cba8c7c35b34b8f7a821c6cec3f73 (775ms)
 [ivy:resolve]   [FAILED ] 
 org.apache.lucene#lucene-core;3.4.0!lucene-core.jar:  (0ms)
 [ivy:resolve]    local: tried
 [ivy:resolve] 
 /Users/mattmann/.ivy2/local/org.apache.lucene/lucene-core/3.4.0/jars/lucene-core.jar
 [ivy:resolve]    maven2: tried
 [ivy:resolve] 
 http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/3.4.0/lucene-core-3.4.0.jar
 [ivy:resolve]   [FAILED ] com.ibm.icu#icu4j;4.0.1!icu4j.jar: 
 invalid sha1: expected=06362db7a2556bb58a04e991029196e2aad632d4 
 computed=d9862ffbc6cd6241a03c06b5911bf22a079d2cda (1544ms)
 [ivy:resolve]   [FAILED ] com.ibm.icu#icu4j;4.0.1!icu4j.jar:  
 (0ms)
 [ivy:resolve

Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-04 Thread Mattmann, Chris A (388J)
Thanks Lewis, here are mine:

[chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% ant -version
Apache Ant(TM) version 1.8.2 compiled on May 17 2012
[chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% java -version
java version 1.6.0_33
Java(TM) SE Runtime Environment (build 1.6.0_33-b03-424-10M3720)
Java HotSpot(TM) 64-Bit Server VM (build 20.8-b03-424, mixed mode)
[chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% 

[chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% uname -a
Darwin chipotle.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:32:41 
PDT 2011; root:xnu-1504.15.3~1/RELEASE_X86_64 x86_64
[chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% 

I'll try one more time today with a fresh build and see where I get :/

Thanks!

Cheers,
Chris


On Jul 4, 2012, at 3:27 AM, Lewis John Mcgibbney wrote:

 Hi Chris,
 
 lewismc@lewismc-HP-Mini-110-3100:~$ java -showversion
 java version 1.6.0_25
 Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
 Java HotSpot(TM) Client VM (build 20.0-b11, mixed mode, sharing)
 
 lewismc@lewismc-HP-Mini-110-3100:~$ ant -v
 Apache Ant(TM) version 1.8.2 compiled on August 19 2011
 Trying the default build file: build.xml
 Buildfile: build.xml does not exist!
 Build failed
 
 Lewis
 
 On Wed, Jul 4, 2012 at 7:18 AM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Lewis,
 
 Odd, I don't get that.
 
 I'll try futzing around again with it tomorrow -- what system are you on? 
 What is
 your Ant version and Java version?
 
 Cheers,
 Chris
 
 On Jul 3, 2012, at 11:49 AM, Lewis John Mcgibbney wrote:
 
 Hi Chris,
 
 I've no clue whats going on locally with you... em I just did
 
 ant test
 
 and I get
 
 copy-generated-lib:
 
 test:
[echo] Testing plugin: subcollection
   [junit] Running org.apache.nutch.collection.TestSubcollection
   [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.305 sec
 
 test:
 
 BUILD SUCCESSFUL
 Total time: 12 minutes 28 seconds
 
 
 On Tue, Jul 3, 2012 at 7:24 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hey Lewis,
 
 I was running ant test -- sorry -- will try ant runtime now (any idea
 what's up with test?)
 
 Cheers,
 Chris
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-03 Thread Mattmann, Chris A (388J)
Hey Julien,


On Jul 3, 2012, at 7:49 AM, Julien Nioche wrote:
[..snip..]
 
 OK, so basically signatures and checksums are fine

+1, yep they are great.

 
  
 
 Tried to build and test and got this:
 
 [ivy:resolve]   ::
 [..snip...]
 
 Try deleting your entire .ivy dir and re-run ant. Just did that on my machine 
 and Nutch compiles fine

OK will do now.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-03 Thread Mattmann, Chris A (388J)
Hey Julien,

I ran this command: rm -rf /Users/mattmann/.ivy2/

But it still failed with the below messages:

[ivy:resolve] :: problems summary ::
[ivy:resolve]  WARNINGS
[ivy:resolve]   [FAILED ] 
org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar: invalid sha1: 
expected=d7d8610ba4aad504475e568fd3badb412a0beae9 
computed=f8369ff1a71e1a8febbb8e9c3a54ffbb08048f19 (1598ms)
[ivy:resolve]   [FAILED ] 
org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/Users/mattmann/.ivy2/local/org.apache.hadoop/hadoop-core/1.0.3/jars/hadoop-core.jar
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-core/1.0.3/hadoop-core-1.0.3.jar
[ivy:resolve]   [FAILED ] org.hsqldb#hsqldb;2.2.8!hsqldb.jar: 
invalid sha1: expected=8231a3ff71ba5889f9e2d01ce13503cbdd4038e9 
computed=81a7e8d5d1802c7acbc8f8f81d3e4680a4b2441c (523ms)
[ivy:resolve]   [FAILED ] org.hsqldb#hsqldb;2.2.8!hsqldb.jar:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/Users/mattmann/.ivy2/local/org.hsqldb/hsqldb/2.2.8/jars/hsqldb.jar
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/org/hsqldb/hsqldb/2.2.8/hsqldb-2.2.8.jar
[ivy:resolve]   [FAILED ] 
org.apache.lucene#lucene-core;3.4.0!lucene-core.jar: invalid sha1: 
expected=4426bf0764ec5fa634abca236b469d2519c74f65 
computed=112d2454390cba8c7c35b34b8f7a821c6cec3f73 (775ms)
[ivy:resolve]   [FAILED ] 
org.apache.lucene#lucene-core;3.4.0!lucene-core.jar:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/Users/mattmann/.ivy2/local/org.apache.lucene/lucene-core/3.4.0/jars/lucene-core.jar
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/org/apache/lucene/lucene-core/3.4.0/lucene-core-3.4.0.jar
[ivy:resolve]   [FAILED ] com.ibm.icu#icu4j;4.0.1!icu4j.jar: 
invalid sha1: expected=06362db7a2556bb58a04e991029196e2aad632d4 
computed=d9862ffbc6cd6241a03c06b5911bf22a079d2cda (1544ms)
[ivy:resolve]   [FAILED ] com.ibm.icu#icu4j;4.0.1!icu4j.jar:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/Users/mattmann/.ivy2/local/com.ibm.icu/icu4j/4.0.1/jars/icu4j.jar
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/com/ibm/icu/icu4j/4.0.1/icu4j-4.0.1.jar
[ivy:resolve]   [FAILED ] xerces#xercesImpl;2.9.1!xercesImpl.jar: 
invalid sha1: expected=7bc7e49ddfe4fb5f193ed37ecc96c12292c8ceb6 
computed=88931c057b31ba3ff7ac96e53817b25ff355c4a1 (393ms)
[ivy:resolve]   [FAILED ] xerces#xercesImpl;2.9.1!xercesImpl.jar:  
(0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/Users/mattmann/.ivy2/local/xerces/xercesImpl/2.9.1/jars/xercesImpl.jar
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/xerces/xercesImpl/2.9.1/xercesImpl-2.9.1.jar
[ivy:resolve]   [FAILED ] com.google.guava#guava;11.0.2!guava.jar: 
invalid sha1: expected=35a3c69e19d72743cac83778aecbee68680f63eb 
computed=1e8507869d7db99f60f8d949bc5ba2b5410ce2db (355ms)
[ivy:resolve]   [FAILED ] com.google.guava#guava;11.0.2!guava.jar:  
(0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/Users/mattmann/.ivy2/local/com.google.guava/guava/11.0.2/jars/guava.jar
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/com/google/guava/guava/11.0.2/guava-11.0.2.jar
[ivy:resolve]   ::
[ivy:resolve]   ::  FAILED DOWNLOADS::
[ivy:resolve]   :: ^ see resolution messages for details  ^ ::
[ivy:resolve]   ::
[ivy:resolve]   :: org.apache.lucene#lucene-core;3.4.0!lucene-core.jar
[ivy:resolve]   :: org.apache.hadoop#hadoop-core;1.0.3!hadoop-core.jar
[ivy:resolve]   :: org.hsqldb#hsqldb;2.2.8!hsqldb.jar
[ivy:resolve]   :: com.ibm.icu#icu4j;4.0.1!icu4j.jar
[ivy:resolve]   :: xerces#xercesImpl;2.9.1!xercesImpl.jar
[ivy:resolve]   :: com.google.guava#guava;11.0.2!guava.jar
[ivy:resolve]   ::
[ivy:resolve] 
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

BUILD FAILED
/Users/mattmann/tmp/nutch2/apache-nutch-2.0/build.xml:431: impossible to 
resolve dependencies:
resolve failed - see output for details

Total time: 1 minute 56 seconds
[chipotle:~/tmp/nutch2/apache-nutch-2.0] mattmann% 

Any ideas?

Cheers,
Chris


On Jul 3, 2012, at 7:49 AM, Julien Nioche wrote:

 Hi Chris
 
 
 
 [chipotle:~/tmp/nutch2] mattmann% $HOME/bin/verify_gpg_sigs
 Verifying Signature for file apache-nutch-2.0-src.tar.gz.asc
 gpg: Signature made Mon Jun 25 09:28:36 2012 PDT using RSA key ID C601BCA7
 gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
 lewi...@apache.org
 gpg: WARNING: This 

Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-02 Thread Mattmann, Chris A (388J)
I'll try to scope this by tomorrow...thanks Lewis.

Cheers,
Chris

On Jul 2, 2012, at 10:49 AM, Lewis John Mcgibbney wrote:

 Anyone else for this RC?
 
 I've been slighyl distracted with a number of things recently and only
 just getting round to following this one up so apologies about that.
 
 Best
 
 Lewis
 
 On Wed, Jun 27, 2012 at 10:23 AM, Ferdy Galema ferdy.gal...@kalooga.com 
 wrote:
 +1 Crawling with HBaseStore works from injecting to indexing.
 
 Great work Lewis.
 
 On Mon, Jun 25, 2012 at 6:32 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:
 
 Hi Everyone,
 
 A candidate for the Apache Nutch 2.0 RC3 is available at:
 
 http://people.apache.org/~lewismc/apache-nutch-2.0rc3
 
 The release candidate is a src.zip and src.tar.gz ONLY
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc3
 
 We release Nutch 2.0 in this fashion due to the inclusion of
 Apache Gora and the likelihood that users will regularly recompile
 the code to suit dynamic requirements.
 
 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-275
 
 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
 [ ] +1 Release this package as Apache Nutch 2.0
 [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 --
 Lewis
 
 
 
 
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: 1.5.1 release

2012-06-22 Thread Mattmann, Chris A (388J)
Hey Guys,

(sorry for the top post)

There's no reason to freeze trunk during releases. In fact, during the RC, once 
the branch (or tag for that matter)
is created, trunk can continue on, no need to stop. Heck, we can always just 
tag or branch from a specific 
revision too so it's not really a biggie.

Cheers,
Chris

On Jun 21, 2012, at 2:43 PM, Lewis John Mcgibbney wrote:

 Hi Markus,
 
 On Thu, Jun 21, 2012 at 10:02 PM, Markus Jelsma
 markus.jel...@openindex.io wrote:
 It's still not clear to me what 1.5.1 is going to look like. Will it be 
 current trunk incl. the script bugfix or just 1.5 plus the bugfix? I would 
 vote for the latter as it makes more sense for a bugfix release.
 
 I am easy on this one... I suggest we do it the normal way. Lets let
 folks chime in and see where we are on Saturday. It looks like 2.0 is
 going to be shifted with the new commits so do we wish to try and keep
 at least the minimal consistency between both releases?
 
 
 There is another debate behind this, in my opinion, about freezing trunk 
 prior to releases and thus stopping active development. This has been an 
 issue in the past. Is this something for another thread?
 
 
 Yeah I must also agree that we should branch trunk, keep the branch
 for the release then run the RC's from the branch regardless of how
 trunk comes on. My only suggestion for  backporting patches from trunk
 to the release candidate branch is if it is a pretty critical bug fix
 as we've now discovered in 1.5!
 
 Additionally there is another note here as well w.r.t release
 managers. We've relied on the excellent work done by Chris (and
 others) as RM's for a number of releases but during the release period
 (on occasion, more recently) as you mention trunk has frozen
 temporarily. Of course it is the aim to prevent this happening should
 the RC not progress as we would all like. Hopefully we are moving
 towards a more adaptable and sustainable RM process within Nutch where
 the RM responsibility can be undertaken/overseen by more than one
 individual over the entire duration of the process. I think (and hope)
 we can consider the slight struggle we've had for 1.5 as an exception.
 As far back as I can remember RC's have always been efficient and
 smooth and I personally am committed to ensuring we return to the high
 precedent set by previous RM's.
 We've also seen an alternative (and in my opinion an improved)
 publication of Nutch atrifacts for 1.5. For reference I direct you to
 Julien's commentary [0] on this topic. Due to this, we've had to run
 additional RC's which has taken a bit longer than usual and I must
 personally apologise to everyone for at least one RC cock up which
 could have been avoided had I been more familiar with the Nutch
 specific release process.
 
 I think I'm ranting here so I'm going to give it a bye now.
 
 Lewis
 
 [0] http://digitalpebble.blogspot.co.uk/2012/06/whats-new-in-nutch-15.html


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Nutch 1.5 Deploy Mode Doesn't Work like Nutch 1.4 Deploy Mode

2012-06-19 Thread Mattmann, Chris A (388J)
+1!

Cheers,
Chris

On Jun 19, 2012, at 2:26 AM, Julien Nioche wrote:

 Quite annoying that we did not spot this before releasing. What about a 1.5.1 
 soonish with this fix + couple smallish improvements e.g. upgrade to Hadoop 
 1.0.3?
 
 J.
 
 -- Forwarded message --
 From: Julien Nioche lists.digitalpeb...@gmail.com
 Date: 19 June 2012 08:56
 Subject: Re: Nutch 1.5 Deploy Mode Doesn't Work like Nutch 1.4 Deploy Mode
 To: u...@nutch.apache.org
 
 
 Alternatively modify the bin/nutch script to make it more robust
 
 # NUTCH_JOB 
 if [ -f ${NUTCH_HOME}/*nutch*.job ]; then
 local=false
   for f in $NUTCH_HOME/*nutch*.job; do
 NUTCH_JOB=$f;
   done
 fi
 
 On 19 June 2012 00:09, sidbatra siddharthaba...@gmail.com wrote:
 This turns out to be a genuine bug with an easy fix.
 
 build.xml is configured to generate a job file titled apache-nutch-1.5.job
 but the deploy binary is still looking for nutch-1.5.job
 
 
 Renaming apache-nutch-1.5.job to nutch-1.5.job fixes this bug in deploy
 mode.
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Nutch-1-5-Deploy-Mode-Doesn-t-Work-like-Nutch-1-4-Deploy-Mode-tp3990169p3990196.html
 Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: VOTE Apache Nutch 2.0 RC1

2012-06-15 Thread Mattmann, Chris A (388J)
OK you are just making us all look bad now Juls ;)

Super fast!

Cheers,
Chris


On Jun 15, 2012, at 2:54 AM, Julien Nioche wrote:

 see https://issues.apache.org/jira/browse/NUTCH-1396
 
 On 15 June 2012 10:43, Julien Nioche lists.digitalpeb...@gmail.com wrote:
 Before you do, could you check that NutchGora passes ant test successfully. I 
 just tried and got an error related to the parse-tika tests. Am about to open 
 a JIRA to update to the latest version of Tika for NutchGora which should fix 
 the problem and put it at the same level as trunk
 
 J
 
 On 15 June 2012 10:01, Lewis John Mcgibbney lewis.mcgibb...@gmail.com 
 wrote:ly
 
 I'll push this in an hour or so guys.
 
 Thanks for the input.
 
 Lewis
 
 
 On Fri, Jun 15, 2012 at 9:39 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
 +1
 
 
 On 15 June 2012 09:00, Ferdy Galema ferdy.gal...@kalooga.com wrote:
 Agree with only releasing src.
 
 
 On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Or just not ship a bin release at all. Src is the only thing we really VOTE 
 on legally though bin is provided for convenience purposes. Will type more on 
 this later...
 
 Sent from my iPhone
 
 On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:
 
 Hi Julien,
 
 Do you suggest with the binary release that we simply open up all gora-* 
 deps and ship it with every jar available?
 
 Lewis
 
 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
 I disagree. You'd expect a binary release to work out of the box - which is 
 not the case. Plus we'd have to spend more time explaining the workaround, 
 answering the same questions over and over on the ML etc... Fixing this 
 should not be a big deal (i.e. add the gore-x modules for the backends to 
 the ivy deps file).
 
 Julien
 
 
 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hey Guys,
 
 I think the annoyance is probably something folks can live with as they have 
 been
 waiting for an official release of 2.x for years :)
 
 My +1 to roll RC #2 with or without a solution to this and mark it as a 
 TODO. release
 eary, release often :)
 
 Cheers,
 Chris
 
 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:
 
  Aye this is no good at all. Depending on which backend you wish to use 
  with Gora, you will need to go and manually fetch the correct .jar's from 
  maven central.
 
  Does anyone else have either solution or a workaround before I push RC2 
  with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
  wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch 
  running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora we 
  don't supply binary distributions of the code, this is because when using 
  Gora a user may wish/require to recompile the code to accomodate config 
  changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your using 
  the gora-sql dependency, then you wish to switch to HBase and recompile, 
  is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means 
  distributed running of jobs is not supported. I'm not sure if this is a 
  problem (since users can always build one themselves), merely pointing it 
  out. The recently released 1.5 also lacks this job jar, so at least no 
  difference there.
 
  The binary distrib corresponds to runtime/local and as such should NOT 
  have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http

Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Mattmann, Chris A (388J)
Hey Guys,

I think the annoyance is probably something folks can live with as they have 
been
waiting for an official release of 2.x for years :)

My +1 to roll RC #2 with or without a solution to this and mark it as a TODO. 
release
eary, release often :)

Cheers,
Chris

On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

 Aye this is no good at all. Depending on which backend you wish to use with 
 Gora, you will need to go and manually fetch the correct .jar's from maven 
 central.
 
 Does anyone else have either solution or a workaround before I push RC2 with 
 just src dists?
 
 Thanks
 
 Lewis
 
 On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel wastl.na...@googlemail.com 
 wrote:
  We only supply src distributions... 
  Does this principle apply to Nutch 2 as well?
 Maybe, yes.
 The situation with the current binary package is uncomfortable:
 I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running.
 
 2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Hi Guys,
 
 Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't 
 supply binary distributions of the code, this is because when using Gora a 
 user may wish/require to recompile the code to accomodate config changes etc. 
 We only supply src distributions... 
 
 Does this principle apply to Nutch 2 as well? I mean, what if your using the 
 gora-sql dependency, then you wish to switch to HBase and recompile, is this 
 possible within the binary distribution?
 
 Best
 
 Lewis
 
 
 On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
 Ferdy
 
 The Nutch job jar is not present in the binary archive. This means 
 distributed running of jobs is not supported. I'm not sure if this is a 
 problem (since users can always build one themselves), merely pointing it 
 out. The recently released 1.5 also lacks this job jar, so at least no 
 difference there.
 
 The binary distrib corresponds to runtime/local and as such should NOT have 
 the job file there. This is now the norm since 1.5
 
 Will try and do some testing of the RC
 
 Thanks
 
 Julien
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 
 
 
 
 -- 
 Lewis 
 
 
 
 
 
 -- 
 Lewis 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Mattmann, Chris A (388J)
Or just not ship a bin release at all. Src is the only thing we really VOTE on 
legally though bin is provided for convenience purposes. Will type more on this 
later...

Sent from my iPhone

On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com wrote:

Hi Julien,

Do you suggest with the binary release that we simply open up all gora-* deps 
and ship it with every jar available?

Lewis

On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote:
I disagree. You'd expect a binary release to work out of the box - which is not 
the case. Plus we'd have to spend more time explaining the workaround, 
answering the same questions over and over on the ML etc... Fixing this should 
not be a big deal (i.e. add the gore-x modules for the backends to the ivy deps 
file).

Julien


On 14 June 2012 20:27, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
Hey Guys,

I think the annoyance is probably something folks can live with as they have 
been
waiting for an official release of 2.x for years :)

My +1 to roll RC #2 with or without a solution to this and mark it as a TODO. 
release
eary, release often :)

Cheers,
Chris

On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

 Aye this is no good at all. Depending on which backend you wish to use with 
 Gora, you will need to go and manually fetch the correct .jar's from maven 
 central.

 Does anyone else have either solution or a workaround before I push RC2 with 
 just src dists?

 Thanks

 Lewis

 On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.commailto:wastl.na...@googlemail.com wrote:
  We only supply src distributions...
  Does this principle apply to Nutch 2 as well?
 Maybe, yes.
 The situation with the current binary package is uncomfortable:
 I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running.

 2012/6/13 Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com
 Hi Guys,

 Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't 
 supply binary distributions of the code, this is because when using Gora a 
 user may wish/require to recompile the code to accomodate config changes etc. 
 We only supply src distributions...

 Does this principle apply to Nutch 2 as well? I mean, what if your using the 
 gora-sql dependency, then you wish to switch to HBase and recompile, is this 
 possible within the binary distribution?

 Best

 Lewis


 On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote:
 Ferdy

 The Nutch job jar is not present in the binary archive. This means 
 distributed running of jobs is not supported. I'm not sure if this is a 
 problem (since users can always build one themselves), merely pointing it 
 out. The recently released 1.5 also lacks this job jar, so at least no 
 difference there.

 The binary distrib corresponds to runtime/local and as such should NOT have 
 the job file there. This is now the norm since 1.5

 Will try and do some testing of the RC

 Thanks

 Julien



 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 Lewis





 --
 Lewis



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.govmailto:chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble




--
Lewis



Re: Suitable Nutch 2.0 Project Description

2012-06-13 Thread Mattmann, Chris A (388J)
+1 to the description w/o experimental too (I agree with Ferdy).

You guys ROCK.

Cheers,
Chris

On Jun 13, 2012, at 5:29 AM, Lewis John Mcgibbney wrote:

 Hi,
 
 Seeing as we have the ball rolling with the 2.0 RC. I thought I'd ask
 about a suitable project descriptor.
 
 So far on trunk we have
 
 ** Apache Nutch is an open source web-search software project.
 Stemming from Apache Lucene, it now builds on Apache Solr adding
 web-specifics, such as a crawler, a link-graph database and parsing
 support handled by Apache Tika for HTML and and array other document
 formats.
 
 This is merely a pot shot, but I was thinking for Nutch 2.0, something like
 
 ** Apache Nutch 2.X is an experimental branch of the Apache Nutch open
 source web-search software project. It builds on Apache Gora for data
 persistence and Apache Solr for indexing adding web-specifics, such as
 a crawler, a link-graph database and parsing support handled by Apache
 Tika for HTML and and array other document formats.
 
 Although there are not many changes here I just wanted to run it by
 you folks...?
 
 Thanks
 Lewis
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: VOTE Apache Nutch 2.0 RC1

2012-06-12 Thread Mattmann, Chris A (388J)
Hey Lewis,

I will get to this tonight, for sure.

Thanks!

Cheers,
Chris

On Jun 12, 2012, at 1:16 PM, Lewis John Mcgibbney wrote:

 Hi Everyone,
 
 I appreciate that most of the core dev's are using trunk, however I
 would appeal to you guys to at least check out the artifacts and check
 sigs, tests, license headers if possible. Although this does not fully
 satisfy the requirements of a thoroughly reviewed RC, hopefully the
 thorough stuff can be undertaken by those directly using the artifacts
 and code in development/production.
 
 Thanks very much in advance
 
 Best
 
 Lewis
 
 On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org 
 wrote:
 Good Evening Everyone,
 
 A candidate for the Apache Nutch 2.0 RC1 is available at:
 
 http://people.apache.org/~lewismc/nutch-2.0
 
 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1
 
 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-215
 
 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Have a great weekend, Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: VOTE Apache Nutch 2.0 RC1

2012-06-12 Thread Mattmann, Chris A (388J)
Hey Guys,

#2 is probably reason enough for a respin. 

Lewis if you don't have time to do it before Thursday, I could probably
give it a whack. Let me know.

Cheers,
Chris

On Jun 12, 2012, at 3:33 PM, Sebastian Nagel wrote:

 Hi Lewis,
 
 my first steps with 2.0 (to be continued, still struggling).
 
 Two points (I'll try to give a final vote tomorrow):
 
 1 some guidance would be nice. README.txt points
 to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x
 (I'm using 
 http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html)
 
 2 the package contains your nutch-site.xml:
namehttp.agent.email/name
valuelewi...@apache.org/value
 I guess that's not intended :)
 
 Cheers,
 Sebastian
 
 On 06/12/2012 10:16 PM, Lewis John Mcgibbney wrote:
 Hi Everyone,
 
 I appreciate that most of the core dev's are using trunk, however I
 would appeal to you guys to at least check out the artifacts and check
 sigs, tests, license headers if possible. Although this does not fully
 satisfy the requirements of a thoroughly reviewed RC, hopefully the
 thorough stuff can be undertaken by those directly using the artifacts
 and code in development/production.
 
 Thanks very much in advance
 
 Best
 
 Lewis
 
 On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org 
 wrote:
 Good Evening Everyone,
 
 A candidate for the Apache Nutch 2.0 RC1 is available at:
 
 http://people.apache.org/~lewismc/nutch-2.0
 
 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1
 
 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-215
 
 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
 [ ] +1 Release this package as Apache Nutch 2.0
 [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Have a great weekend, Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.5 release-1.5RC4

2012-06-01 Thread Mattmann, Chris A (388J)
Hey Lewis,

+1 from me!

SIGS check out:

[chipotle:nutch-dev/1.5-release/rc4] mattmann% ls
apache-nutch-1.5-bin.tar.gz  apache-nutch-1.5-bin.zip 
apache-nutch-1.5-src.tar.gz  apache-nutch-1.5-src.zip
apache-nutch-1.5-bin.tar.gz.asc  apache-nutch-1.5-bin.zip.asc 
apache-nutch-1.5-src.tar.gz.asc  apache-nutch-1.5-src.zip.asc
apache-nutch-1.5-bin.tar.gz.md5  apache-nutch-1.5-bin.zip.md5 
apache-nutch-1.5-src.tar.gz.md5  apache-nutch-1.5-src.zip.md5
apache-nutch-1.5-bin.tar.gz.sha  apache-nutch-1.5-bin.zip.sha 
apache-nutch-1.5-src.tar.gz.sha  apache-nutch-1.5-src.zip.sha
[chipotle:nutch-dev/1.5-release/rc4] mattmann% $HOME/bin/verify_gpg_sigs 
Verifying Signature for file apache-nutch-1.5-bin.tar.gz.asc
gpg: Signature made Thu May 31 13:24:55 2012 PDT using RSA key ID C601BCA7
gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
lewi...@apache.org
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1  89C1 F45E 7970 C601 BCA7
Verifying Signature for file apache-nutch-1.5-bin.zip.asc
gpg: Signature made Thu May 31 13:25:57 2012 PDT using RSA key ID C601BCA7
gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
lewi...@apache.org
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1  89C1 F45E 7970 C601 BCA7
Verifying Signature for file apache-nutch-1.5-src.tar.gz.asc
gpg: Signature made Thu May 31 13:25:34 2012 PDT using RSA key ID C601BCA7
gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
lewi...@apache.org
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1  89C1 F45E 7970 C601 BCA7
Verifying Signature for file apache-nutch-1.5-src.zip.asc
gpg: Signature made Thu May 31 13:26:15 2012 PDT using RSA key ID C601BCA7
gpg: Good signature from Lewis John McGibbney (CODE SIGNING KEY) 
lewi...@apache.org
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the owner.
Primary key fingerprint: 2A23 D53F 8D27 5CB6 91E1  89C1 F45E 7970 C601 BCA7
[chipotle:nutch-dev/1.5-release/rc4] mattmann% 

checkums check out:

[chipotle:nutch-dev/1.5-release/rc4] mattmann% $HOME/bin/verify_md5_checksums 
md5sum: stat '*.bz2': No such file or directory
apache-nutch-1.5-bin.tar.gz: OK
apache-nutch-1.5-src.tar.gz: OK
apache-nutch-1.5-bin.zip: OK
apache-nutch-1.5-src.zip: OK
[chipotle:nutch-dev/1.5-release/rc4] mattmann% 

Built source. All good!

runtime:
[mkdir] Created dir: 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime
[mkdir] Created dir: 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local
[mkdir] Created dir: 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/deploy
 [copy] Copying 1 file to 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/deploy
 [copy] Copying 1 file to 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/deploy/bin
 [copy] Copying 1 file to 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/lib
 [copy] Copying 1 file to 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/lib/native
 [copy] Copying 21 files to 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/conf
 [copy] Copying 1 file to 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/bin
 [copy] Copying 48 files to 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/lib
 [copy] Copying 123 files to 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/plugins
 [copy] Copied 2 empty directories to 2 empty directories under 
/Users/mattmann/Desktop/Apache/nutch-dev/1.5-release/rc4/apache-nutch-1.5/runtime/local/test

BUILD SUCCESSFUL
Total time: 2 minutes 17 seconds
[chipotle:1.5-release/rc4/apache-nutch-1.5] mattmann% 

Minor nit: source package unzips into the current directory as opposed to prior 
practice of having
it unzip into apache-nutch-X.Y folder. No biggie though. Thanks for stepping up 
and rocking
the release process!

Cheers,
Chris

On May 31, 2012, at 1:37 PM, Lewis John Mcgibbney wrote:

 Good Evening Everyone,
 
 A candidate for the Apache Nutch 1.5 RC4 is available at:
 
 http://people.apache.org/~lewismc/apache-nutch-1.5-rc4/
 
 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:
 
 

Re: [VOTE] Apache Nutch release 1.5 RC3

2012-05-31 Thread Mattmann, Chris A (388J)
Hey Guys,

Does this warrant a respin, or are you +1 Juls?

Cheers,
Chris

On May 31, 2012, at 1:44 AM, Julien Nioche wrote:

 Hi Lewis,
 
 Minor nitpick : the directory /runtime is not necessary as it is built with 
 ANT. Removing it would massively reduce the size of the archive. Could we fix 
 it for the final release?
 
 All fine apart from this. The content of the src archive compiles fine, the 
 pom on the Maven repo looks good.
 
 Thanks a lot 
 
 Julien
 
 
 On 30 May 2012 21:59, lewis john mcgibbney lewi...@apache.org wrote:
 Good Evening Everyone,
 
 A candidate for the Apache Nutch 1.5 RC3 is available at:
 
 http://people.apache.org/~lewismc/apache-nutch-1.5-rc3/
 
 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-1.5-rc3/
 
 Further, a staged Maven repository of the 1.5 sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-167/
 
 Please vote on releasing this package as Apache Nutch 1.5.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
  [ ] +1 Release this package as Apache Nutch 1.5
  [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch release 1.5 RC3

2012-05-31 Thread Mattmann, Chris A (388J)
okey dokey. 

I will try and take the time to review the RC today. Thanks for pushing
this Lewis!

Cheers,
Chris

On May 31, 2012, at 7:36 AM, Julien Nioche wrote:

 Hi, 
 
 Depends on Lewis :-) Let's say I am +1 but if it is not too much hassle it 
 would be nice to fix it 
 
 J.
 
 On 31 May 2012 15:24, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hey Guys,
 
 Does this warrant a respin, or are you +1 Juls?
 
 Cheers,
 Chris
 
 On May 31, 2012, at 1:44 AM, Julien Nioche wrote:
 
  Hi Lewis,
 
  Minor nitpick : the directory /runtime is not necessary as it is built with 
  ANT. Removing it would massively reduce the size of the archive. Could we 
  fix it for the final release?
 
  All fine apart from this. The content of the src archive compiles fine, the 
  pom on the Maven repo looks good.
 
  Thanks a lot
 
  Julien
 
 
  On 30 May 2012 21:59, lewis john mcgibbney lewi...@apache.org wrote:
  Good Evening Everyone,
 
  A candidate for the Apache Nutch 1.5 RC3 is available at:
 
  http://people.apache.org/~lewismc/apache-nutch-1.5-rc3/
 
  The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
  archive of the sources in:
 
  http://svn.apache.org/repos/asf/nutch/tags/release-1.5-rc3/
 
  Further, a staged Maven repository of the 1.5 sources.jar and
  javadoc.jar is available here:
 
  https://repository.apache.org/content/repositories/orgapachenutch-167/
 
  Please vote on releasing this package as Apache Nutch 1.5.
  The vote is open for the next 72 hours and passes if a majority of at
  least three +1 Nutch PMC votes are cast.
 
   [ ] +1 Release this package as Apache Nutch 1.5
   [ ] -1 Do not release this package because...
 
  Many Thanks and heres to plenty more.
 
  Kind Regards,
  Lewis
 
  P.S. Here's my +1.
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch release 1.5 RC3

2012-05-31 Thread Mattmann, Chris A (388J)
Hey Lewis,

Actually if the bits change, in the past, I've been pushed to generate a new
RC (as the SIG files, checksum, etc. will change too).

My +1 for a new RC to accommodate that. If you don't have time today
I would be happy to help.

Cheers,
Chris (who now has more time *grin*)

On May 31, 2012, at 8:42 AM, Lewis John Mcgibbney wrote:

 If I were to change to artifacts to accommodate the removal of the
 runtime dir I don't think it would require a completely new RC.
 
 I am happy to generate the same sources via the tag, sign, then push
 them pending the VOTE result.
 
 Does this comply with release policy?
 
 Thanks
 
 Lewis
 
 On Thu, May 31, 2012 at 3:49 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 okey dokey.
 
 I will try and take the time to review the RC today. Thanks for pushing
 this Lewis!
 
 Cheers,
 Chris
 
 On May 31, 2012, at 7:36 AM, Julien Nioche wrote:
 
 Hi,
 
 Depends on Lewis :-) Let's say I am +1 but if it is not too much hassle it 
 would be nice to fix it
 
 J.
 
 On 31 May 2012 15:24, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hey Guys,
 
 Does this warrant a respin, or are you +1 Juls?
 
 Cheers,
 Chris
 
 On May 31, 2012, at 1:44 AM, Julien Nioche wrote:
 
 Hi Lewis,
 
 Minor nitpick : the directory /runtime is not necessary as it is built 
 with ANT. Removing it would massively reduce the size of the archive. 
 Could we fix it for the final release?
 
 All fine apart from this. The content of the src archive compiles fine, 
 the pom on the Maven repo looks good.
 
 Thanks a lot
 
 Julien
 
 
 On 30 May 2012 21:59, lewis john mcgibbney lewi...@apache.org wrote:
 Good Evening Everyone,
 
 A candidate for the Apache Nutch 1.5 RC3 is available at:
 
 http://people.apache.org/~lewismc/apache-nutch-1.5-rc3/
 
 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-1.5-rc3/
 
 Further, a staged Maven repository of the 1.5 sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-167/
 
 Please vote on releasing this package as Apache Nutch 1.5.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
  [ ] +1 Release this package as Apache Nutch 1.5
  [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 
 
 --
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 --
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: 1.5 RC2

2012-05-22 Thread Mattmann, Chris A (388J)
+1 happy for Lewis to try I've been swamped!

Sent from my iPhone

On May 22, 2012, at 2:16 AM, Julien Nioche 
lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote:

Hi Lewis,

I am sure that Chris will have no problem with you doing the RC2. Chris? It 
would be a good thing to have more than one person who knows how to do it 
anyway :-)
Note that to generate a fresh pom.xml you need to

  *   get maven-ant-tasks-2.1.3.jar and put it in the ivy dir
  *   ant -lib ivy deploy

The resulting pom.xml file should reflect the content of the main ivy.xml. I 
have committed some minor changes to the pom template in trunk, this will need 
to be copied to the 1.5 branch as well. We recently discussed a move to Maven, 
another option would be to manage the dependencies with the Maven Ant task, 
which would save us the hassle of having to keep the ivy.xml and pom.xml in 
sync. We'll see

Thanks

Julien


--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble



Re: 1.5 RC2

2012-05-22 Thread Mattmann, Chris A (388J)
+1

Sent from my iPhone

On May 22, 2012, at 4:43 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com wrote:

Hi,

As I say, I am able to stick time in tonight to roll this RC, however does 
anyone have a problem with me rolling the 2.0 RC tonight after the 1.5RC2?

I would like to get them out the way saving me time during this week if 
possible.

Thanks

Lewis

On Tue, May 22, 2012 at 10:35 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com wrote:
OK doke this sounds fine to me then. I will make the relevant commits to the 
1.5 branch then work at it later this evening.

I'll make a new thread when the stuff is sorted out and we are ready to VOTE on 
the new RC.

Thanks

Lewis


On Tue, May 22, 2012 at 10:15 AM, Julien Nioche 
lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote:
Hi Lewis,

I am sure that Chris will have no problem with you doing the RC2. Chris? It 
would be a good thing to have more than one person who knows how to do it 
anyway :-)
Note that to generate a fresh pom.xml you need to

  *   get maven-ant-tasks-2.1.3.jar and put it in the ivy dir
  *   ant -lib ivy deploy

The resulting pom.xml file should reflect the content of the main ivy.xml. I 
have committed some minor changes to the pom template in trunk, this will need 
to be copied to the 1.5 branch as well. We recently discussed a move to Maven, 
another option would be to manage the dependencies with the Maven Ant task, 
which would save us the hassle of having to keep the ivy.xml and pom.xml in 
sync. We'll see

Thanks

Julien


--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble




--
Lewis




--
Lewis



Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-05-09 Thread Mattmann, Chris A (388J)
Hey Julien,

On May 9, 2012, at 3:11 AM, Julien Nioche wrote:

 Hi Chris
 
 Any chance you could do a RC2 for the trunk soonish? We've been a bit stuck 
 since mid April and it would be nice to move on. If not I can try and spin a 
 RC myself but it is likely to be hilarious :-)

Haha, no worries. I will try and get one going for this weekend. And I'm sure 
you'd do fine! :)

 
 Re-Maven : I am not against moving to Maven at all : it would make it easier 
 to publish the artefacts + nice integration with Eclipse + most devs familiar 
 with it etc... not sure about the best way to deal with the plugins though - 
 treat them as modules? any thoughts on this?

Yeah this is something I would definitely like to explore for 1.6+ -- I think 
we could just do Maven pom.xml files for each plugin and then do a 
multi-aggregator core
project that built core first, then all the plugins post facto. 

I will file an issue to explore this for 1.6.

Thanks!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Suitable naming for Nutchgora branch?

2012-04-25 Thread Mattmann, Chris A (388J)
Great work Lewis, thanks!

Cheers,
Chris

On Apr 25, 2012, at 4:01 PM, Lewis John Mcgibbney wrote:

 Hi Everyone,
 
 As you guys will have seen I've quickly polluted our dev list again 
 (sorry!!!) with set and classify for 2.1.
 
 The open issues for 2.0 are ones which I think we could address within the 
 2.0 release. This is merely my opinion, based upon the assertion that they 
 all contain patches which could be up for review. With the exception of 
 NUTCH-879 which is pretty alarming. I'll test shortly.
 
 I'm now away to bed.
 
 Best
 
 Lewis
 
 On Wed, Apr 25, 2012 at 3:06 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Guys,
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: NUTCH-1129

2012-04-17 Thread Mattmann, Chris A (388J)
Hey Lewis,

On Apr 17, 2012, at 3:35 AM, Lewis John Mcgibbney wrote:

 3) We previously discussed implementing the Any23 parser plugin as a tika 
 wrapper, therefore it would look very similar to parse-tika?

I think it would be super awesome to add the Any23 parsing functionality as a 
Tika parser, and potentially
an extension to the MIME repository to detect microformats, etc. Then in Nutch, 
we could take advantage of
the any23 parser with the existing tika-parser interface.

Thoughts?

Thanks!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-04-16 Thread Mattmann, Chris A (388J)
Hi Julien,

On Apr 16, 2012, at 2:02 AM, Julien Nioche wrote:

 Thanks Chris, 
 
 -1 the versions of the deps for hadoop, tika and possibly others are not 
 correct in the pom.xml found in the src archive and on the mvn repository, 
 which will be a problem for whoever tries to use the pom.xml file e.g. in 
 Eclipse or more annoyingly declare Nutch as a dependency with Ivy / Maven. 
 Did you regenerate the pom file from the ivy one?

I didn't regenerate it -- but will try and do so for RC #2.

 
 I remember that we mentioned delivering the content of runtime/local in the 
 binary archive instead of having the sources + runtime/deploy as well. 
[..snip...]
  I don't think it would take much time to do that, so what about doing it 
 now? We could rename the archive into apache-nutch-1.5-local-bin maybe to 
 make the content clearer.

+1 to the above, but I think we can just have it be apache-nutch-1.5-bin -- no 
need to rename it to local. We can just
reference this ML thread for documentation in the future.

I'll include the above 2 things when I re-roll an RC #2 hopefully in the next 
few days.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-04-16 Thread Mattmann, Chris A (388J)
Hey Sami,

Thanks. I'll fix the 4 license headers you mention below as part of RC #2.

Cheers,
Chris

On Apr 16, 2012, at 3:02 AM, Sami Siren wrote:

 On Mon, Apr 16, 2012 at 8:43 AM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,
 
 A candidate for the Nutch 1.5 release is available at:
 
  http://people.apache.org/~mattmann/apache-nutch-1.5/rc1/
 
 The release candidate is a zip and tar.gz archive of the sources in:
 
  http://svn.apache.org/repos/asf/nutch/tags/release-1.5/
 
 And a binary build suitable for deployment.
 
 A staged Maven repository is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-054/
 
 Please vote on releasing this package as Apache Nutch 1.5.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
  [ ] +1 Release this package as Apache Nutch 1.5
  [ ] -1 Do not release this package because...
 
 
 The basics are good:
 md5 and sha1 checksums for apache-nutch-1.5-bin.tar.gz and
 apache-nutch-1.5-src.tar.gz  match
 ant clean test completes succesfully for the source package
 completed a simple crawl with local mode and a small hadoop 1.0.2
 cluster by using the artifacts in the binary package
 
 but it seems there are some license headers missing from source files:
 [rat:report]  
 ==/home/sam/nutch/apache-nutch-1.5/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
 [rat:report]  
 ==/home/sam/nutch/apache-nutch-1.5/src/plugin/creativecommons/src/web/web.xml
 [rat:report]  
 ==/home/sam/nutch/apache-nutch-1.5/src/plugin/protocol-httpclient/src/test/conf/httpclient-auth-test.xml
 [rat:report]  
 ==/home/sam/nutch/apache-nutch-1.5/src/plugin/protocol-httpclient/src/test/conf/nutch-site-test.xml
 
 -1 because of missing license headers
 
 --
 Sami Siren


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.5 release rc #1

2012-04-16 Thread Mattmann, Chris A (388J)
Hey Lewis,

Hmm, not sure on the MD5 and SHA -- they seem to validate for me
and seemed to work at least Sami (and Markus?). Guys, any idea what's
up with Lewis's verification step here? 

Lewis, you may try re-downloading and verifying them again, but wait
until RC #2 on that. I'll fix the NOTICE file for RC #2 as you mention below
and not sure why the extension was .tar.gz.tar.gz, I'll fix that too.

Cheers,
Chris

On Apr 16, 2012, at 3:12 AM, Lewis John Mcgibbney wrote:

 Hi Chris,
 
 On Mon, Apr 16, 2012 at 6:43 AM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
 Hi Folks,
 
 A candidate for the Nutch 1.5 release is available at:
 
 http://people.apache.org/~mattmann/apache-nutch-1.5/rc1/
 
 
 I used the KEYS file stored on SVN under the 1.5 tag (as below), and got
 the following when verifying the above RC (stored on your p.a.o area)
 
 lewis@lewis-01:~/Desktop$ gpg --import KEYS
 gpg: key A7239D59: Doug Cutting (Lucene guy) cutt...@apache.org not
 changed
 gpg: key 7C491924: public key Piotr Kosiorowski pkosiorow...@apache.org
 imported
 gpg: key 0B7E6CFA: public key Sami Siren si...@apache.org imported
 gpg: key 57163A4D: public key Dennis E. Kubes ku...@apache.org imported
 gpg: key 24BCF054: public key Chris A. Mattmann mattm...@apache.org
 imported
 gpg: Total number processed: 5
 gpg:   imported: 4
 gpg:  unchanged: 1
 gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
 gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
 
 lewis@lewis-01:~/Desktop$ gpg --verify apache-nutch-1.5-bin.tar.tar.gz.asc
 gpg: no signed data
 gpg: can't hash datafile: file open error
 lewis@lewis-01:~/Desktop$ gpg --verify apache-nutch-1.5-bin.zip.asc
 gpg: Signature made Mon 16 Apr 2012 06:00:20 BST using DSA key ID B876884A
 gpg: Can't check signature: public key not found
 lewis@lewis-01:~/Desktop$ gpg --verify apache-nutch-1.5-src.tar.gz.asc
 gpg: Signature made Mon 16 Apr 2012 06:00:18 BST using DSA key ID B876884A
 gpg: Can't check signature: public key not found
 lewis@lewis-01:~/Desktop$ gpg --verify apache-nutch-1.5-src.zip.asc
 gpg: Signature made Mon 16 Apr 2012 06:00:22 BST using DSA key ID B876884A
 gpg: Can't check signature: public key not found
 lewis@lewis-01:~/Desktop$ md5sum apache-nutch-1.5-bin.tar.tar.gz.asc
 e32088205efd59ffc882c79add0bafae  apache-nutch-1.5-bin.tar.tar.gz.asc
 lewis@lewis-01:~/Desktop$ md5sum apache-nutch-1.5-bin.zip.asc
 ff7960b8540673a86756f6b3f53ffd79  apache-nutch-1.5-bin.zip.asc
 lewis@lewis-01:~/Desktop$ md5sum apache-nutch-1.5-src.tar.gz.asc
 9da161bcd5ec0de3f702a12e6bfbf9e6  apache-nutch-1.5-src.tar.gz.asc
 lewis@lewis-01:~/Desktop$ md5sum apache-nutch-1.5-src.zip.asc
 6750bbc93b028776fa888f988df3a614  apache-nutch-1.5-src.zip.asc
 
 Some comments:
 1) I don't think the tar should be appended twice for the
 apache-nutch-1.5-bin.tar.tar.gz artefact and accompanying sigs.
 2) None of my other attempts to verify the other artefacts via gpg worked!
 3) All attempts to verify via md5sum did not match the strings present in
 your p.a.o area!
 4) Really really trivial, but in our NOTICE file, it stated a date of 2009.
 I should have picked this up a while ago when I updated the other dates in
 these files, this one seems to have slipped through the net.
 
 
 The release candidate is a zip and tar.gz archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-1.5/
 
 
 Stuff in SVN tag looks OK apart from the stuff I mentioned above.
 
 
 
 And a binary build suitable for deployment.
 
 A staged Maven repository is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-054/
 
 
 I've not got around to checking the gpg and md5sum verifications yet, as
 I'm waiting for someone to confirm that the above failed verifications are
 correct before I do so. I'm hoping that I've made a mistake somewhere.
 
 
 
 [X ] -1 Do not release this package because...
 
 Because of the above, unless I discover that I've done something wrong
 then I can't VOTE yes. I'm open to discussion on this, if someone can
 display that I've taken a wrong turn somewhere then I might change my VOTE
 however for the time being I need to call this one down.
 
 Thanks for spinning the RC Chris.
 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[VOTE] Apache Nutch 1.5 release rc #1

2012-04-15 Thread Mattmann, Chris A (388J)
Hi Folks,

A candidate for the Nutch 1.5 release is available at:

  http://people.apache.org/~mattmann/apache-nutch-1.5/rc1/

The release candidate is a zip and tar.gz archive of the sources in:

  http://svn.apache.org/repos/asf/nutch/tags/release-1.5/

And a binary build suitable for deployment. 

A staged Maven repository is available here:

https://repository.apache.org/content/repositories/orgapachenutch-054/

Please vote on releasing this package as Apache Nutch 1.5.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Nutch PMC votes are cast.

  [ ] +1 Release this package as Apache Nutch 1.5
  [ ] -1 Do not release this package because...

Thanks!

Cheers,
Chris

P.S. Here's my +1.

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Nutch 1.x trunk release

2012-04-10 Thread Mattmann, Chris A (388J)
Hey Julien,

Yeah my weekend flew by -- this and the SIS RC are the top items on my
opensource TODO :)

Hopefully this week...

Cheers,
Chris

On Apr 10, 2012, at 8:07 AM, Julien Nioche wrote:

 Hi guys, 
 
 Chris - any idea of if / when you'll have the time to do a RC for trunk?
 
 Thanks
 
 Julien
 
 On 3 April 2012 15:30, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Thanks Lewis!
 
 Cheers,
 Chris
 
 P.S. Hopefully by this weekend...
 
 On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:
 
  Hi,
 
  On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io 
  wrote:
 
 
  Seems fine. Only updating KEYS is no longer necessary.
 
  Now sorted.
 
  Thanks whenever you can get round to this Chris.
 
  Best
 
  Lewis
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Mattmann, Chris A (388J)
Hi Markus,

On Apr 3, 2012, at 5:50 AM, Markus Jelsma wrote:

 Cool! 
 
 Next time i'll ask infra to allow to supress notifications.
 
 Chris, will you RM one RC? And if possible list the detailed steps/command in 
 the process in case you don't have to time RM 1.6 when the time comes. The 
 wiki is dated.

Happy to RM it. 

Check the wiki here:

http://wiki.apache.org/nutch/Release_HOWTO

Lewis and I updated this after the last release. It's more or less what's 
required to 
release the project and what I run. It's also really similar to the OODT 
release 
process:

https://cwiki.apache.org/confluence/display/OODT/Release+Process

Was there something specific that you weren't seeing there?

 
 I'm looking forward to yet another big release with lots of fixes and 
 improvements!

Agreed, thanks everyone!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Mattmann, Chris A (388J)
Thanks Lewis!

Cheers,
Chris

P.S. Hopefully by this weekend...

On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

 Hi,
 
 On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io 
 wrote:
 
 
 Seems fine. Only updating KEYS is no longer necessary.
 
 Now sorted.
 
 Thanks whenever you can get round to this Chris.
 
 Best
 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



NutchGora release, and Nutch 1.x trunk release

2012-03-08 Thread Mattmann, Chris A (388J)
Hey Guys,

I've got some cycles this weekend -- anyone up for a 1.5 release off trunk 
(stable), and
a NutchGora branch release? I suggested this before [1] regarding NutchGora.
I'm inclined to say let's do the following:

1. NutchGora: apache-nutch-2.0 - release 2.x series based on this branch
2. Nutch: apache-nutch-1.x - stable trunk branch

Then, when the time comes, we can try and create a:

3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches

Would this make sense? Anyways we don't have to decide anything now that
we can't undo later, but are folks OK with me doing an RC for NutchGora and for
1.x this weekend?

Cheers,
Chris

[1] http://s.apache.org/GD2

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: NutchGora release, and Nutch 1.x trunk release

2012-03-08 Thread Mattmann, Chris A (388J)
Hey Guys,

OK, sounds good. Looks like we need to wait for the Tika 1.1 release (seems to 
be going
well so far), and then try and push Gora 0.2 (which I know Lewis is pushing, 
and which 
I'm happy to RM once we're ready there). So, maybe I'll shoot for next weekend
or the weekend after to push Nutch 1.5 and 2.0 RCs.

Cheers,
Chris

On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

 Yeah I agree Chris  Markus.
 
 On the Nutchgora note, I would like to see Gora 0.2. released before hand, as 
 we have a blocking issue NUTCH-1205 with Ivy retrieving alien Gora 
 0.2-SNAPSHOT dependencies from repository.apache.org. We should be able to 
 overcome this issue by releasing Gora 0.2 to maven central then just pulling 
 those dependencies with Ivy in Nutchgora rather than messing about with 
 chain/multiple/snapshot resolvers in the Ivy configuration.
 
 My 2 cents
 
 On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma markus.jel...@openindex.io 
 wrote:
 +1
 
 1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3. But i'd
 like to integrate Tika 1.1 after its pending release.
 
 Cheers
 
 On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
  Hey Guys,
 
  I've got some cycles this weekend -- anyone up for a 1.5 release off trunk
  (stable), and a NutchGora branch release? I suggested this before [1]
  regarding NutchGora. I'm inclined to say let's do the following:
 
  1. NutchGora: apache-nutch-2.0 - release 2.x series based on this branch
  2. Nutch: apache-nutch-1.x - stable trunk branch
 
  Then, when the time comes, we can try and create a:
 
  3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches
 
  Would this make sense? Anyways we don't have to decide anything now that
  we can't undo later, but are folks OK with me doing an RC for NutchGora and
  for 1.x this weekend?
 
  Cheers,
  Chris
 
  [1] http://s.apache.org/GD2
 
  ++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:   http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 --
 Markus Jelsma - CTO - Openindex
 
 
 
 -- 
 Lewis 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Fwd: Google Summer of Code 2012 upcoming

2012-03-04 Thread Mattmann, Chris A (388J)
Guys, FYI...in case anyone is thinking of GSoC, deadlines are approaching. 
Process
is described below...

Thanks!

Cheers,
Chris

Begin forwarded message:

 From: Ulrich Stärk u...@apache.org
 Date: March 4, 2012 9:01:07 AM PST
 To: p...@apache.org p...@apache.org
 Cc: d...@community.apache.org d...@community.apache.org
 Subject: Google Summer of Code 2012 upcoming
 Reply-To: priv...@hadoop.apache.org priv...@hadoop.apache.org
 
 Hello PMCs,
 
 Google Summer of Code is the ideal opportunity for you to attract new
 contributors to your projects.
 
 If you want to participate with your project you NOW need to
 
 - understand what it means to be a mentor [1]
 - propose your project ideas. Just label your issues with gsoc2012 in JIRA and
  they will show up at [2]. See also [1].
 - subscribe to code-awa...@apache.org (restricted to potential mentors, meant 
 to be used
  as a private list - general discussions on the public
  d...@community.apache.org list as much as possible please)
 
 The ASF will apply as a participating organization with GSoC, your project
 doesn't need to do that. See [3] for more information. Note that the ASF isn't
 accepted yet, nevertheless you *really* should start recording your ideas now.
 
 Last year we had 38 students completing GSoC successfully, some of which are
 now active contributors to the projects they worked on. Let's make this a
 success again this year!
 
 On behalf of the GSoC 2012 admins,
 
 Uli
 
 [1] http://community.apache.org/guide-to-being-a-mentor.html
 [2] http://s.apache.org/gsoc2012tasks
 [3] http://community.apache.org/gsoc.html
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Fwd: [blog post] Accumulo, Nutch, and Gora

2012-02-28 Thread Mattmann, Chris A (388J)
FYI...awesome!

Begin forwarded message:

 From: Jason Trost jason.tr...@gmail.com
 Date: February 28, 2012 5:41:23 PM PST
 To: common-u...@hadoop.apache.org common-u...@hadoop.apache.org
 Subject: [blog post] Accumulo, Nutch, and Gora
 Reply-To: common-u...@hadoop.apache.org common-u...@hadoop.apache.org
 
 Blog post for anyone who's interested.  I cover a basic howto for
 getting Nutch to use Apache Gora to store web crawl data in Accumulo.
 
 Let me know if you have any questions.
 
 Accumulo, Nutch, and GORA
 http://www.covert.io/post/18414889381/accumulo-nutch-and-gora
 
 --Jason


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [DISCUSS] Nutchgora 2.0 release

2012-02-20 Thread Mattmann, Chris A (388J)
+1 guys. Just let me know when you are ready and I can RM it.

Cheers,
Chris

On Feb 20, 2012, at 8:01 AM, Lewis John Mcgibbney wrote:

 Hi,
 
 Not ignoring Chris' comments, but addressing the points below first, please 
 see comments.
 
 On Mon, Feb 20, 2012 at 2:57 PM, Ferdy Galema ferdy.gal...@kalooga.com 
 wrote:
 Aside from the licensing issue, the only thing I really see as a blocker or 
 as something we need to deal with first is Nutch-1205 (upgrade Gora libs). 
 What are we going to do with that one? 
 I'm going to have another crack with these Ivy resolvers, really quite hard 
 to debug. I can only assume the unresolved dependencies are picked up 
 somewhere upstream! As I said I'm going to try and crack this one maybe today 
 if I get the time.
  
 
 About the Nutch API (webapp), my colleague and I have some ideas about how to 
 improve it, in such as way that it is really easy to use. It won't definitely 
 be ready in a upcoming release, especially when there will be a release very 
 soon. Please see the issue[1] for details. I'm not sure what to do with the 
 current webapp implementation, but my suggestion is to to just leave it be as 
 it. (Perhaps mark it as a work-in-progress)
 
 This sounds really encouraging. Somewhere in my crazy pot of thoughts was to 
 progress with establishing this task as a GSoC project. In reflection, I 
 think it would be excellent if the work could be dev/user community driven as 
 it would cater exactly for what we need and want.
 
 Please see here for the most up-to-date work I could get in this stuff. I 
 updated it slightly to reflect some recent findings. I'll report back when I 
 get more time on the blocker you mention above.
 
 http://wiki.apache.org/nutch/NutchAdministrationUserInterface


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [DISCUSS] Nutchgora 2.0 release

2012-02-18 Thread Mattmann, Chris A (388J)
Hey Lewis,

I'd be +1 to roll a Nutchgora 2.0 release.

I could see dealing with this in two ways, neither of which I like better than 
the other:

1. Release the nutchgora branch as apache-nutch-2.0, and then nutchgora 
becomes
the 2.0 branch of the system (and we could create branch-2.0) The 1.x trunk 
branch, as it evolves and gets closer to 
2.0, the last release of it is 1.9, then we do 3.0, which could either be: 
  - a merge or combination of 1.x features and 2.x features
  - simply the next path for 1.x, and independent of 2.x

2. Call the artifact, apache-nutchgora-2.0, independent of the current trunk 
artifact and its release cycle.

Either way, is fine with me.

Cheers,
Chris

On Feb 17, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

 Hi Guys,
 
 Here we are again :0)
 
 What are the perceptions with aiming for a 2.0 release? We have one blocking 
 issue, the webapp, which I got no response from the community at large about. 
 I would like to see this addressed but this is another issue.
 
 Speaking with the future in mind, we are hoping to get a Gora 0.2 release out 
 of the door, once a licensing issue is dealt with (the only blocker) and a 
 few other things. Therefore would it be realistic to aim for a Nutch 2.0 
 release shortly after that?
 
 My justification for raising this thread again, is that we are seeing (some) 
 more users interested in this branch/code, I think it is a real shame that we 
 have not been able to get a release yet. I would really like to get more 
 people using the code and hopefully getting involved in identifying bugs, and 
 fixing them if possible.
 
 The question has been open for ages, so I just wonder if anything has changed 
 now that Gora is doing better as of recent.
 
 Thanks
 
 Lewis
 
 -- 
 Lewis 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Fwd: [Announce] Google Summer of Code 2012

2012-02-05 Thread Mattmann, Chris A (388J)
Any Nutch Devs interested in a GSoC student?

Begin forwarded message:

 From: Luciano Resende luckbr1...@gmail.com
 Date: February 4, 2012 10:40:03 AM PST
 To: d...@community.apache.org d...@community.apache.org, code-awards 
 code-awa...@apache.org
 Subject: Fwd: [Announce] Google Summer of Code 2012
 Reply-To: d...@community.apache.org d...@community.apache.org
 
 -- Forwarded message --
 From: Carol Smith car...@google.com
 Date: Sat, Feb 4, 2012 at 8:44 AM
 Subject: [Announce] Google Summer of Code 2012
 To: Google Summer of Code Discuss
 google-summer-of-code-disc...@googlegroups.com
 
 
 Hi all,
 
 We're pleased to announce that Google Summer of Code will be happening
 for its eighth year this year. Please check out the blog post [1]
 about the program and read the FAQs [2] and Timeline [3] on Melange
 for more information.
 
 Please consider translating the presentations and/or flyers into your
 native language and submitting them directly to me to post on the
 wiki. Localization for our material is integral to reaching the widest
 possible audience around the world.
 
 [1] - 
 http://google-opensource.blogspot.com/2012/02/google-summer-of-code-2012-is-on.html
 [2] - 
 http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2012/faqs
 [3] - http://www.google-melange.com/gsoc/events/google/gsoc2012
 
 Cheers,
 Carol
 
 
 -- 
 Luciano Resende
 http://people.apache.org/~lresende
 http://twitter.com/lresende1975
 http://lresende.blogspot.com/


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Fwd: [Announce] Google Summer of Code 2012

2012-02-05 Thread Mattmann, Chris A (388J)
FYI

Begin forwarded message:

 From: Ross Gardler rgard...@opendirective.com
 Date: February 5, 2012 1:45:18 PM PST
 To: d...@community.apache.org d...@community.apache.org
 Subject: RE: [Announce] Google Summer of Code 2012
 Reply-To: d...@community.apache.org d...@community.apache.org
 
 For those new to GSoC you might want to review the roles defined at
 http://community.apache.org/mentoringprogramme.html and the GSoC specific
 info at http://community.apache.org/gsoc.html (yet to be updated for 2012)
 
 Sent from my mobile device, please forgive errors and brevity.
 On Feb 5, 2012 8:31 PM, Franklin, Matthew B. mfrank...@mitre.org wrote:


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: % of different content types out there on the web

2012-01-31 Thread Mattmann, Chris A (388J)
Hi Markus,

Thanks for the FYI. Any idea at specific %'s for those unwanted suffixes 
compared
to the size of the entire corpus?

Cheers,
Chris

On Jan 31, 2012, at 4:39 AM, Markus Jelsma wrote:

 We only crawl HTML and PDF files for a lot of cc-TLD's so we only have data 
 on 
 those two. However, we also explicitly filter out all/most unwanted suffixes. 
 We do have a lot of suffixes that we encountered so far.
 
 On Saturday 28 January 2012 03:01:26 Mattmann, Chris A (388J) wrote:
 (sorry for the cross post)
 
 Hey Guys,
 
 I'm trying to find a good citation or estimate (if anyone has done one)
 that estimates the breakout (by % or some other metric) of content types
 out there out the web (with a whole web crawl or a meaningful
 representative dataset) that are non HTML.
 
 Anyone have any ideas about this?
 
 Thanks!
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 -- 
 Markus Jelsma - CTO - Openindex


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [DISCUSS] Issues with Fetcher

2012-01-21 Thread Mattmann, Chris A (388J)
Hi Ken,

On Jan 21, 2012, at 10:33 AM, Ken Krugler wrote:
 
 My own personal favorite area would be to integrate with crawler-commons.

+1. Would you crawler-commons guys be interested in bringing that code to 
Apache?
How about bringing it over to Nutch? 

Would that be something you'd be interested in?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [jira] [Commented] (NUTCH-1237) Improve javac arguements for more verbose output

2012-01-06 Thread Mattmann, Chris A (388J)
Yay, all I heard was that it's building again woo hoo!

On Jan 6, 2012, at 9:03 AM, Markus Jelsma wrote:

 Ah, i get 88 warnings now but things build fine. This is indeed quite more 
 verbose :)
 
 On Tuesday 27 December 2011 17:28:31 Lewis John McGibbney (Commented) (JIRA) 
 wrote:
[
 https://issues.apache.org/jira/browse/NUTCH-1237?page=com.atlassian.jira.p
 lugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176220#comm
 ent-13176220 ]
 
 Lewis John McGibbney commented on NUTCH-1237:
 -
 
 If I can get a +1 I'll commit. Thank you
 
 Improve javac arguements for more verbose output
 -
 
Key: NUTCH-1237
URL: https://issues.apache.org/jira/browse/NUTCH-1237
 
Project: Nutch
 
 Issue Type: Improvement
 Components: build
 
   Affects Versions: 1.4, nutchgora
 
   Reporter: Lewis John McGibbney
   Assignee: Lewis John McGibbney
 
Fix For: nutchgora, 1.5
 
Attachments: NUTCH-1237-nutchgora.patch, NUTCH-1237-trunk.patch
 
 When trying to fix another problem I stumbled across this one. I think it
 is important to ensure that the javac outputs info regarding deprecation
 and unchecked operations.
 
 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA
 administrators:
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
 For more information on JIRA, see: http://www.atlassian.com/software/jira
 
 -- 
 Markus Jelsma - CTO - Openindex


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Build failed in Jenkins: Nutch-trunk #1702

2011-12-25 Thread Mattmann, Chris A (388J)
Merry Christmas buddy!

Cheers,
Chris

On Dec 25, 2011, at 9:14 AM, Lewis John Mcgibbney wrote:

 Hi Guys,

 Our trunk builds have been broken since migrating to new Hadoop 0.20.2
 and migrating CrawlDBScanner to new MR API e.g. trunk build [1] 1698.
 Looking to the stack trace, I'm assuming that this has to do with how
 we are specifying the new file reads. Hopefully this shouldn't be too
 hard to solve so maybe we can get on to it at some stage in the near
 future.

 I just want to say Merry Christmas to EVERYONE celebrating and happy
 holidays to everyone else who may not be.

 Best

 Lewis

 [1] https://builds.apache.org/view/M-R/view/Nutch/job/Nutch-trunk/1698/
 On Sat, Dec 24, 2011 at 7:36 AM, Apache Jenkins Server
 jenk...@builds.apache.org wrote:
 See https://builds.apache.org/job/Nutch-trunk/1702/changes

 Changes:

 [markus] Updated pom to reflect Hadoop upgrade

 --
 [...truncated 2836 lines...]
 resolve-default:
 [ivy:resolve] :: loading settings :: file = 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

 compile:
[echo] Compiling plugin: urlfilter-validator
   [javac] 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
  warning: 'includeantruntime' was not set, defaulting to 
 build.sysclasspath=last; set to false for repeatable builds
   [javac] Compiling 1 source file to 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlfilter-validator/classes

 jar:
 [jar] Building jar: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlfilter-validator/urlfilter-validator.jar

 deps-test:

 deploy:
[copy] Copying 1 file to 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-validator

 copy-generated-lib:
[copy] Copying 1 file to 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-validator

 init:
   [mkdir] Created dir: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta
   [mkdir] Created dir: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/classes
   [mkdir] Created dir: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/test
   [mkdir] Created dir: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta

 init-plugin:

 deps-jar:

 clean-lib:

 resolve-default:
 [ivy:resolve] :: loading settings :: file = 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

 compile:
[echo] Compiling plugin: urlmeta
   [javac] 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
  warning: 'includeantruntime' was not set, defaulting to 
 build.sysclasspath=last; set to false for repeatable builds
   [javac] Compiling 2 source files to 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/classes

 jar:
 [jar] Building jar: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/urlmeta.jar

 deps-test:

 deploy:
[copy] Copying 1 file to 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta

 copy-generated-lib:
[copy] Copying 1 file to 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta

 init:
   [mkdir] Created dir: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic
   [mkdir] Created dir: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/classes
   [mkdir] Created dir: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/test
   [mkdir] Created dir: 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-basic

 init-plugin:

 deps-jar:

 clean-lib:

 resolve-default:
 [ivy:resolve] :: loading settings :: file = 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

 compile:
[echo] Compiling plugin: urlnormalizer-basic
   [javac] 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
  warning: 'includeantruntime' was not set, defaulting to 
 build.sysclasspath=last; set to false for repeatable builds
   [javac] Compiling 1 source file to 
 /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/classes

 jar:
 [jar] Building jar: 
 

Re: get rid of outlink code for Tika

2011-12-21 Thread Mattmann, Chris A (388J)
+1 from me -- those 3 Tika content handlers should take care of it...

Cheers,
Chris

On Dec 21, 2011, at 6:51 AM, Markus Jelsma wrote:

 Hi,
 
 For using Boilerpipe we need LinkCH, BoilerpipeCH and TeeCH in Tika. LinkCH 
 returns all URL's with some meta data such as title etc. Fixes for old 
 parsers 
 such as Neko are then obsolete.
 
 I propose to rely on Tika for all outlinks. Right now this means not all 
 types 
 are returned such as area, form and whatelse. Is this a big problem? Rel is 
 also not returned but i patched Tika to do that so we can still do something 
 with nofollow which is important.
 
 Thanks
 
 -- 
 Markus Jelsma - CTO - Openindex


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Improving API Java Documentation

2011-12-12 Thread Mattmann, Chris A (388J)
Hi Lewis,

+1 from me to the update and to logging a JIRA issue. Always nice to see
an associated changelog entry for any (even non trivial) updates, short of 
typos and error corrections in docs/etc. Up to you though, since you're the one
doing the work :-)

Cheers,
Chris

On Dec 12, 2011, at 10:28 AM, Lewis John Mcgibbney wrote:

 Hi Guys,
 
 Been doing some snooping around the code recently and think that the
 API documentation [1] could do with some improving in some areas,
 please see corresponding Jira issue [2]. A real minor discrepancy I've
 encountered early on is that the ${name} variable is set in
 default.properties as ${name} and in build.xml as ${Name}, this means
 that it is not recognized within the Javadocs [1]. I propose to change
 this to ${name}, and to additionally add a Capital to the variable
 value therefore making it Nutch? Any thoughts? Does this require a
 Jira to be logged as well?
 
 Thanks
 
 [1] http://nutch.apache.org/apidocs-1.4/index.html
 [2] https://issues.apache.org/jira/browse/NUTCH-1218
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Best way to get files out of segment directories

2011-11-30 Thread Mattmann, Chris A (388J)
Hey Lewis,

Makes total sense. I'll get a patch going this week.

Cheers,
Chris

On Nov 30, 2011, at 8:05 AM, Lewis John Mcgibbney wrote:

 Hi Chris,
 
 There is absolutely no doubt that this is of use, exactly for the issues 
 Markus highlights.
 I wonder if it is worth adding general options similar to that which are 
 offered by readseg [1]. This would mean that it would be possible to ignore 
 certain directories within a segments directory, therefore reducing overhead 
 on the SegmentContentDumper tool and possibly providing a more accurate 
 content dump. Does this make any sense?
 
 [1] http://wiki.apache.org/nutch/bin/nutch_readseg
 
 On Tue, Nov 29, 2011 at 8:01 AM, Markus Jelsma markus.jel...@openindex.io 
 wrote:
 Sounds useful indeed! Especially with the regex pattern. Reading files from 
 the fs is a lot faster then using segread all the time.
 
 
 
 
 CTO - Openindex.io 
 Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov schreef:
 
 OK, of course, I figured it out, and updated my program :-)
 
 You can see it on Github below. I'm going to clean up and 
 generalize this program because I think it's of general use.
 I'll create an issue shortly. 
 
 I'm thinking the tool could be something like:
 
 ./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
   -segmentRootDir full file path to the root segment directory, e.g., 
 crawl/segments
   -regexUrlPattern a regex URL pattern to select URL keys to dump from the 
 content DB in each segment
   -outputDir The output directory to write file names to.
   -metadata --key=value where key is a Content Metadata key and value is a 
 value to check. If the URL and
 its content metadata have a matching key,value pair, dump it. Allow for regex 
 matching on the value.
 
 This would allow users to unravel the content hidden in segment directories 
 and in sequence files
 into useable files that were downloaded by Nutch.
 
 Do you guys see this as a useful tool? If so, I'll contribute it this week 
 for 1.5.
 
 Cheers,
 Chris
 
 On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote:
 
  Hey Guys,
  
  One more thing. Just to let you know I've followed this blog here:
  
  http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/
  
  And started to write a simple program to read the keys in a 
  Segment file, and then dump out the byte content if the key
  matches the desired URL. You can find my code here:
  
  https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java
  
  Unfortunately, this code keeps dying due to OOM issues, 
  clearly because the data file is too big, and because 
  I likely have to M/R this. 
  
  Just wanted to let you guys know where I'm at, and what
  I've been trying.
  
  Thanks,
  Chris
  
  On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:
  
  Hey Guys,
  
  So, I've completed my crawl of the vault.fbi.gov website for my class that 
  I'm preparing 
  for. I've got:
  
  [chipotle:local/nutch/framework] mattmann% du -hs crawl
  28Gcrawl
  [chipotle:local/nutch/framework] mattmann% 
  
  [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
  total 0
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 2027104947/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 2027104955/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 2027105006/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 2027105251/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 2027125721/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 2027144648/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 2027164220/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 2027184345/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 2027204447/
  drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 2027224816/
  [chipotle:local/nutch/framework] mattmann% 
  
  ./bin/nutch readseg -list -dir crawl/segments/
  NAMEGENERATED  FETCHER STARTFETCHER END FETCHED
  PARSED
  2027104947 12011-11-27T10:49:502011-11-27T10:49:50 
  1   1
  2027104955 31   2011-11-27T10:49:572011-11-27T10:49:58 
  31  31
  2027105006 4898 2011-11-27T10:50:082011-11-27T10:51:40 
  48984890
  2027105251 9890 2011-11-27T10:52:522011-11-27T11:56:06 
  714 713
  2027125721 9202 2011-11-27T12:57:242011-11-27T14:00:17 
  971 686
  2027144648 8261 2011-11-27T14:46:502011-11-27T15:48:25 
  714 712
  2027164220 7575 2011-11-27T16:42:222011-11-27T17:45:50 
  720 718
  2027184345 6871 2011-11-27T18:43:482011-11-27T19:47:11 
  767 766
  2027204447 6116 2011-11-27T20:44:502011-11-27T21:48:07 
  725 724
  2027224816 5406 2011-11-27T22:48:182011-11-27T23:51:33 
  744

Best way to get files out of segment directories

2011-11-28 Thread Mattmann, Chris A (388J)
Hey Guys,

So, I've completed my crawl of the vault.fbi.gov website for my class that I'm 
preparing 
for. I've got:

[chipotle:local/nutch/framework] mattmann% du -hs crawl
 28Gcrawl
[chipotle:local/nutch/framework] mattmann% 

[chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
total 0
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 2027104947/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 2027104955/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 2027105006/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 2027105251/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 2027125721/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 2027144648/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 2027164220/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 2027184345/
drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 2027204447/
drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 2027224816/
[chipotle:local/nutch/framework] mattmann% 

./bin/nutch readseg -list -dir crawl/segments/
NAMEGENERATED   FETCHER START   FETCHER END 
FETCHED PARSED
2027104947  1   2011-11-27T10:49:50 2011-11-27T10:49:50 
1   1
2027104955  31  2011-11-27T10:49:57 2011-11-27T10:49:58 
31  31
2027105006  48982011-11-27T10:50:08 2011-11-27T10:51:40 
48984890
2027105251  98902011-11-27T10:52:52 2011-11-27T11:56:06 
714 713
2027125721  92022011-11-27T12:57:24 2011-11-27T14:00:17 
971 686
2027144648  82612011-11-27T14:46:50 2011-11-27T15:48:25 
714 712
2027164220  75752011-11-27T16:42:22 2011-11-27T17:45:50 
720 718
2027184345  68712011-11-27T18:43:48 2011-11-27T19:47:11 
767 766
2027204447  61162011-11-27T20:44:50 2011-11-27T21:48:07 
725 724
2027224816  54062011-11-27T22:48:18 2011-11-27T23:51:33 
744 744
[chipotle:local/nutch/framework] mattmann% 

So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
extracted PDF files
that are housed in those segments. I've been playing around with ./bin/nutch 
readseg, 
and all I can say based on my initial impressions here are that it's really 
hard to 
get it to fulfill these simple requirements that I want it to do:

1. Iterate over all the segments 
  - pull out URLs that have at_download/file in them
  - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
is the readable PDF name,
the actual URL is a Plone CMS url, with little meaning)

2. for each PDF file anchor name
   - create a file in output_dir with the PDF file data read from the segment

My guess is that even at the scale of data that I'm dealing with (10s of GB), 
that it's impossible
and impractical to do anything that's not M/R here. Unfortunately there isn't a 
tool that will simply
grab me the PDF files out of the segment files and then output those into a 
director, appropriately 
named with the anchor text. Or...is there? ;-)

I'm running in Local mode, with no Hadoop cluster behind me, and with a 
Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working,
intentionally as I don't want it to be a requirement for folks to have a cluster
to do this assignment that I'm working on.

I was talking to Ken Krugler about this, and after picking his brain, I think 
that 
I'm going to have to end up writing a tool to do what I want. So, if that's the 
case, 
fine, but can someone point me in the right direction for a good starting point
for this? Ken also thought Andrzej might have like 10 magic solutions to make 
this happen, so here's hoping he's out there listening :-)

Thanks for the help, guys.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Best way to get files out of segment directories

2011-11-28 Thread Mattmann, Chris A (388J)
Hey Guys,

One more thing. Just to let you know I've followed this blog here:

http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/

And started to write a simple program to read the keys in a 
Segment file, and then dump out the byte content if the key
matches the desired URL. You can find my code here:

https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java

Unfortunately, this code keeps dying due to OOM issues, 
clearly because the data file is too big, and because 
I likely have to M/R this. 

Just wanted to let you guys know where I'm at, and what
I've been trying.

Thanks,
Chris

On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:

 Hey Guys,
 
 So, I've completed my crawl of the vault.fbi.gov website for my class that 
 I'm preparing 
 for. I've got:
 
 [chipotle:local/nutch/framework] mattmann% du -hs crawl
 28G   crawl
 [chipotle:local/nutch/framework] mattmann% 
 
 [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
 total 0
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 2027104947/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 2027104955/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 2027105006/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 2027105251/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 2027125721/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 2027144648/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 2027164220/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 2027184345/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 2027204447/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 2027224816/
 [chipotle:local/nutch/framework] mattmann% 
 
 ./bin/nutch readseg -list -dir crawl/segments/
 NAME  GENERATED   FETCHER START   FETCHER END 
 FETCHED PARSED
 20271049471   2011-11-27T10:49:50 
 2011-11-27T10:49:50 1   1
 202710495531  2011-11-27T10:49:57 
 2011-11-27T10:49:58 31  31
 202710500648982011-11-27T10:50:08 
 2011-11-27T10:51:40 48984890
 202710525198902011-11-27T10:52:52 
 2011-11-27T11:56:06 714 713
 202712572192022011-11-27T12:57:24 
 2011-11-27T14:00:17 971 686
 202714464882612011-11-27T14:46:50 
 2011-11-27T15:48:25 714 712
 202716422075752011-11-27T16:42:22 
 2011-11-27T17:45:50 720 718
 202718434568712011-11-27T18:43:48 
 2011-11-27T19:47:11 767 766
 202720444761162011-11-27T20:44:50 
 2011-11-27T21:48:07 725 724
 202722481654062011-11-27T22:48:18 
 2011-11-27T23:51:33 744 744
 [chipotle:local/nutch/framework] mattmann% 
 
 So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
 extracted PDF files
 that are housed in those segments. I've been playing around with ./bin/nutch 
 readseg, 
 and all I can say based on my initial impressions here are that it's really 
 hard to 
 get it to fulfill these simple requirements that I want it to do:
 
 1. Iterate over all the segments 
  - pull out URLs that have at_download/file in them
  - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
 is the readable PDF name,
 the actual URL is a Plone CMS url, with little meaning)
 
 2. for each PDF file anchor name
   - create a file in output_dir with the PDF file data read from the segment
 
 My guess is that even at the scale of data that I'm dealing with (10s of GB), 
 that it's impossible
 and impractical to do anything that's not M/R here. Unfortunately there isn't 
 a tool that will simply
 grab me the PDF files out of the segment files and then output those into a 
 director, appropriately 
 named with the anchor text. Or...is there? ;-)
 
 I'm running in Local mode, with no Hadoop cluster behind me, and with a 
 Mac Book Pro, 4 core, 2.8 Ghz, with 8 GB RAM behind me to get this working,
 intentionally as I don't want it to be a requirement for folks to have a 
 cluster
 to do this assignment that I'm working on.
 
 I was talking to Ken Krugler about this, and after picking his brain, I think 
 that 
 I'm going to have to end up writing a tool to do what I want. So, if that's 
 the case, 
 fine, but can someone point me in the right direction for a good starting 
 point
 for this? Ken also thought Andrzej might have like 10 magic solutions to make 
 this happen, so here's hoping he's out there listening :-)
 
 Thanks for the help, guys.
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246

Re: Best way to get files out of segment directories

2011-11-28 Thread Mattmann, Chris A (388J)
OK, of course, I figured it out, and updated my program :-)

You can see it on Github below. I'm going to clean up and 
generalize this program because I think it's of general use.
I'll create an issue shortly. 

I'm thinking the tool could be something like:

./bin/nutch org.apache.nutch.tools.SegmentContentDumper [options]
  -segmentRootDir full file path to the root segment directory, e.g., 
crawl/segments
  -regexUrlPattern a regex URL pattern to select URL keys to dump from the 
content DB in each segment
  -outputDir The output directory to write file names to.
  -metadata --key=value where key is a Content Metadata key and value is a 
value to check. If the URL and
its content metadata have a matching key,value pair, dump it. Allow for regex 
matching on the value.

This would allow users to unravel the content hidden in segment directories and 
in sequence files
into useable files that were downloaded by Nutch.

Do you guys see this as a useful tool? If so, I'll contribute it this week for 
1.5.

Cheers,
Chris

On Nov 28, 2011, at 7:32 PM, Mattmann, Chris A (388J) wrote:

 Hey Guys,
 
 One more thing. Just to let you know I've followed this blog here:
 
 http://www.spicylogic.com/allenday/blog/2008/08/29/using-nutch-to-download-large-binary-media-and-image-files/
 
 And started to write a simple program to read the keys in a 
 Segment file, and then dump out the byte content if the key
 matches the desired URL. You can find my code here:
 
 https://github.com/chrismattmann/CSCI-572-Code/blob/master/src/main/java/edu/usc/csci572/hw2/PDFDumper.java
 
 Unfortunately, this code keeps dying due to OOM issues, 
 clearly because the data file is too big, and because 
 I likely have to M/R this. 
 
 Just wanted to let you guys know where I'm at, and what
 I've been trying.
 
 Thanks,
 Chris
 
 On Nov 28, 2011, at 7:23 PM, Mattmann, Chris A (388J) wrote:
 
 Hey Guys,
 
 So, I've completed my crawl of the vault.fbi.gov website for my class that 
 I'm preparing 
 for. I've got:
 
 [chipotle:local/nutch/framework] mattmann% du -hs crawl
 28G  crawl
 [chipotle:local/nutch/framework] mattmann% 
 
 [chipotle:local/nutch/framework] mattmann% ls -l crawl/segments/
 total 0
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:49 2027104947/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:50 2027104955/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 10:52 2027105006/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 12:57 2027105251/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 14:46 2027125721/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 16:42 2027144648/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 18:43 2027164220/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 20:44 2027184345/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 27 22:48 2027204447/
 drwxr-xr-x  8 mattmann  wheel  272 Nov 28 00:50 2027224816/
 [chipotle:local/nutch/framework] mattmann% 
 
 ./bin/nutch readseg -list -dir crawl/segments/
 NAME GENERATED   FETCHER START   FETCHER END 
 FETCHED PARSED
 2027104947   1   2011-11-27T10:49:50 
 2011-11-27T10:49:50 1   1
 2027104955   31  2011-11-27T10:49:57 
 2011-11-27T10:49:58 31  31
 2027105006   48982011-11-27T10:50:08 
 2011-11-27T10:51:40 48984890
 2027105251   98902011-11-27T10:52:52 
 2011-11-27T11:56:06 714 713
 2027125721   92022011-11-27T12:57:24 
 2011-11-27T14:00:17 971 686
 2027144648   82612011-11-27T14:46:50 
 2011-11-27T15:48:25 714 712
 2027164220   75752011-11-27T16:42:22 
 2011-11-27T17:45:50 720 718
 2027184345   68712011-11-27T18:43:48 
 2011-11-27T19:47:11 767 766
 2027204447   61162011-11-27T20:44:50 
 2011-11-27T21:48:07 725 724
 2027224816   54062011-11-27T22:48:18 
 2011-11-27T23:51:33 744 744
 [chipotle:local/nutch/framework] mattmann% 
 
 So the reality is, after crawling vault.fbi.gov, all I really wanted is the 
 extracted PDF files
 that are housed in those segments. I've been playing around with ./bin/nutch 
 readseg, 
 and all I can say based on my initial impressions here are that it's really 
 hard to 
 get it to fulfill these simple requirements that I want it to do:
 
 1. Iterate over all the segments 
 - pull out URLs that have at_download/file in them
 - for each one of those URLs, get their anchor, aka somefile.pdf (the anchor 
 is the readable PDF name,
 the actual URL is a Plone CMS url, with little meaning)
 
 2. for each PDF file anchor name
  - create a file in output_dir with the PDF file data read from the segment
 
 My guess is that even at the scale of data that I'm dealing with (10s of 
 GB), that it's impossible
 and impractical to do anything that's not M/R here. Unfortunately there 
 isn't a tool

[RESULT] [VOTE] Apache Nutch 1.4 release rc #2

2011-11-26 Thread Mattmann, Chris A (388J)
Hi Everyone,

This VOTE has passed:

+1 PMC

Julien Nioche
Markus Jelsma
Lewis John McGibbney
Chris Mattmann

I'll go ahead and update the website and push the release out to the mirrors. 
Thanks
for VOTE'ing and for your patience!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[ANNOUNCE] Apache Nutch 1.4 released

2011-11-26 Thread Mattmann, Chris A (388J)
(...apologies for the cross posting...)

The Apache Nutch project is pleased to announce the release of Apache Nutch
1.4. The release contents have been pushed out to the main Apache release
site so the releases should be available as soon as the mirrors get the
syncs. 

Apache Nutch is an extensible framework for building out large-scale
web-based search. Layered on top of fellow Apache projects Hadoop,
Lucene/Solr, and Tika, Nutch provides an out of the box platform for
fetching web pages, pdf files, word documents, and more. Nutch parses the
content and its relevant information, indexes its metadata, and makes it
available for efficient query and retrieval over modern Internet protocols.

Apache Nutch 1.4 contains a number of improvements and bug fixes. Details
can be found in the changes file:

http://www.apache.org/dist/nutch/CHANGES-1.4.txt

Apache Nutch is available in source and binary form from the following
download page: http://www.apache.org/dyn/closer.cgi/nutch/

Nutch is also available as a Jar dependency from the Central repository:

http://repo2.maven.org/maven2/org/apache/nutch/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the downloads
using signatures found on the Apache site:

http://www.apache.org/dist/nutch/KEYS

For more information on Apache Nutch, visit the project home page:
http://nutch.apache.org

-- Chris Mattmann (on behalf of the Apache Nutch community)

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: 2 things I noticed that I will file JIRA issues + fix

2011-11-25 Thread Mattmann, Chris A (388J)
Hi Markus,

Super +1. Thanks for incorporating it as part of your patch. 

1184 looks good -- my +1 to commit it, even if i progress. 
Then we can close out 1212 at that point.

Thanks!

Cheers,
Chris

On Nov 25, 2011, at 5:16 AM, Markus Jelsma wrote:

 Hi
 
 On Friday 25 November 2011 01:13:47 Mattmann, Chris A (388J) wrote:
 Hi Markus,
 
 On Nov 24, 2011, at 12:03 PM, Markus Jelsma wrote:
 So, what's the point of that initial if(...) block outside of the for
 loop. Isn't it redundant?
 
 This is trunk? I've been and still am working on some issues for a new
 feature in this part of that source file.
 https://issues.apache.org/jira/browse/NUTCH-1184
 https://issues.apache.org/jira/browse/NUTCH-1174
 
 Yep it's trunk alright. I'm fine with you making the update I suggested, or
 with me doing it. 2 questions:
 
 1. Am I right in observing that the code is redundant and should be
 removed?
 I believe so. Ive tested the removal of that part with the code of NUTCH-1184 
 and all goes well.
 
 2. If I am right on #1, do you want me to make the update, or are
 you saying that you want to make it as part of NUTCH-1184 and NUTCH-1174?
 
 1174 is already committed. Ive added a patch for ParseOutputformat to 1184 
 incorporating your newly created patch.
 
 cheers
 
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 -- 
 Markus Jelsma - CTO - Openindex


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



2 things I noticed that I will file JIRA issues + fix

2011-11-24 Thread Mattmann, Chris A (388J)
...after I get back from Thanksgiving dinner :-)

1. In URLFilterChecker, the cmd line tool requires URLs to be fed into it on 
STDIN, but 
that isn't documented anywhere, even in the tool help printed to STDOUT. I'll 
fix that.

2. In ParseOutputFormat, I see a code block:

{code}
  // collect outlinks for subsequent db update
  Outlink[] links = parseData.getOutlinks();
  int outlinksToStore = Math.min(maxOutlinks, links.length);
  if (ignoreExternalLinks) {
try {
  fromHost = new URL(fromUrl).getHost().toLowerCase();
} catch (MalformedURLException e) {
  fromHost = null;
}
  } else {
fromHost = null;
  }
{code}

The if(ignoreExternalLinks) part then gets subsequently set and 
reset in the ensuing for loop:

{code}
  int validCount = 0;
  CrawlDatum adjust = null;
  ListEntryText, CrawlDatum targets = new ArrayListEntryText, 
CrawlDatum(outlinksToStore);
  ListOutlink outlinkList = new ArrayListOutlink(outlinksToStore);
  for (int i = 0; i  links.length  validCount  outlinksToStore; 
i++) {
String toUrl = links[i].getToUrl();
// ignore links to self (or anchors within the page)
if (fromUrl.equals(toUrl)) {
  continue;
}
if (ignoreExternalLinks) {
  try {
toHost = new URL(toUrl).getHost().toLowerCase();
  } catch (MalformedURLException e) {
toHost = null;
  }
  if (toHost == null || !toHost.equals(fromHost)) { // external 
links
continue; // skip it
  }
}
{code}

So, what's the point of that initial if(...) block outside of the for loop. 
Isn't it
redundant?

If so, I'll file an issue and fix that.

Cheers,
Chris

P.S. Happy Thanksgiving to Nutch'ers in the US!


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Dependency Injection

2011-11-22 Thread Mattmann, Chris A (388J)
Hey PJ,

On Nov 22, 2011, at 10:47 AM, PJ Herring wrote:

 Hey Chris,
 
 Thanks for the response. I looked at the documents you sent me, and I really 
 do think incorporating some kind of DI Framework could be a great addition to 
 Nutch.
 
 I have a general plan of attack, but I'll try to write that up more formally 
 and send it out to get some kind of feedback.

+1, would love to see it.

 
 One question I had when looking at this stuff is what is the status of Nutch 
 2? It looks like the architecture has shifted quite a bit from 1.3?

Nutch2 was originally slated to be the Nutch Gora branch (see here [1]). We 
ended up deciding [2] that the trunk was more akin to folks who were
maintaining the 1.x series of Nutch and thus moved the Nutch Gora branch into 
[1]. 

We still have a lot of goals though for Nutch2, which I think we're just 
working to more incrementally, rather than radically, as before. There are 
still folks here working on Nutch Gora though so if you're interested in that, 
check it out.

Cheers,
Chris

[1] http://svn.apache.org/repos/asf/nutch/branches/nutchgora
[2] http://s.apache.org/zX

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Dependency Injection

2011-11-21 Thread Mattmann, Chris A (388J)
Hey PJ,

You aren't being an ass at all. You're asking an important question, and 
something I've been interested in for a while.
Here are some relevant threads to take a look at:

http://wiki.apache.org/nutch/Nutch2Architecture
http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/msg12688.html
http://www.slideshare.net/chrismattmann/lessons-learned-in-the-development-of-a-webscale-search-engine-nutch2-and-beyond
https://issues.apache.org/jira/browse/NUTCH-609
http://osdir.com/ml/user.nutch.apache/2011-07/msg00080.html
http://5341.com/list/48/349985.html

If you're interested in contributing to Apache Nutch, check this great guide 
out written by Dennis Kubes:

wiki.apache.org/nutch/Becoming_A_Nutch_Developer

Before there wasn't a ton of interest in replacing the plugin system since it 
worked and we didn't get a lot of 
complaints or anything. That interest turned into the perception that a DI 
framework wouldn't be welcome. 
On the contrary, I'd say if you figured out how to get a DI framework working 
with the existing plugin system, 
I can personally say I'd dedicate some of my time towards helping you shepherd 
it in and I think the 
rest of the committers would be on board.

Thanks for your interest. If you have any more questions, please ask!

Cheers,
Chris


On Nov 21, 2011, at 1:14 PM, PJ Herring wrote:

 Hey,
 
 So I am admittedly a noob with Nutch, but have spent some time digging 
 through the source code. I am just curious if anyone has talked about, in 
 future developments of Nutch, replacing the whole way we register plugins? I 
 ask because I am using Nutch on a project with Maven. At the moment I have to 
 copy a whole bunch of JAR files with there plugin.xml files into a certain 
 directory on build, which is fine, but is kind of breaking the whole Maven 
 paradigm. It would be nice to have some sort of Maven repository where 
 plugins lived, and then wire up which plugins I want to use using some kind 
 of DI framework, like Spring or Guice. Then instead of writing XML Plugin 
 Descriptor Files, every plugin could write a class extending PluginDescriptor 
 and register its self with the PluginRepo, or something of the sort.
 
 Also, I have never contributed to an open source project, so if I am being an 
 ass I don't mean to be. Just would love to help make a great tool better in 
 any way.
 
 Best,
 PJ Herring
 
 
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: 0.2-SNAPSHOT now on apache repository

2011-11-19 Thread Mattmann, Chris A (388J)
+1 from me, Lewis, great work.

Cheers,
Chris

On Nov 19, 2011, at 4:11 AM, Lewis John Mcgibbney wrote:

 Hi,
 
 Please see here [1], and associated issue logged in Nucth Jira [2]. As I
 explain in the issue, although Gora trunk is not stable there is ongoing
 work to fix this.
 
 Thanks for now
 
 [1] https://repository.apache.org/index.html#nexus-search;quick~gora
 [2] https://issues.apache.org/jira/browse/NUTCH-1205
 -- 
 *Lewis*


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Lewis John McGibbney sent a message via SimilarPages – A web discovery and search add-on

2011-11-17 Thread Mattmann, Chris A (388J)
Awesome news, great to hear!

Cheers,
Chris

On Nov 17, 2011, at 8:57 AM, Lewis John Mcgibbney wrote:

 Hi,
 
 Some more positives here.
 
 Lewis
 
 -- Forwarded message --
 From: Pietro Borradori pietro.borrad...@similarpages.com
 Date: Thu, Nov 17, 2011 at 4:46 PM
 Subject: Fw: Lewis John McGibbney sent a message via SimilarPages – A web 
 discovery and search add-on
 To: lewi...@apache.org lewi...@apache.org
 Cc: Marco Laurita marco.laur...@similarpages.com
 
 
 Hi Lewis,
 
 Thanks for your email... I'm sorry to reply you late...
 Nutch is a fundamental piece of SimilarPages architecture, because of its 
 crawling features and for the solid base on which it is built that is Hadoop. 
 Hadoop allows us to make all the computations on the crawled data, it is 
 really a fantastic project!  Hadoop gives us some headache sometimes when we 
 need large clusters to perform the computation on the crawled data, 
 especially when there are some instances whith hardware failures where Hadoop 
 is supposed to overcome such situations without problems. Marco 
 co-founder/CTO of SimilarPages is at you disposal for any deeper insight re 
 Nutch/Hadoop implementation if helpful.
 
 Here is the page of our site re Nutch/Hadoop
 http://www.similarpages.com/web/index.php?option=com_contentview=articleid=8Itemid=20
 
 We liked Nutch/hadoop projects in our 2 official FB pages:
 http://www.facebook.com/pages/SimilarPagescom/303352486359786?sk=wall
 http://www.facebook.com/pages/SimilarPages-A-web-discovery-and-search-addon/149182788451193
 
 A take a tour video here...
 http://www.similarpages.com/web/index.php?option=com_contentview=articleid=15Itemid=4
 
 You can follow me on twitter @MrCappuccini
 
 We've finally released the beta of the SimilarPages search engine!! Check it 
 out at www.similarpages.com and let us know what you think!! 
 
 my best
 Pietro
  
 Pietro Borradori
 Founder  CEO
 
 
 
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.4 release rc #1

2011-11-16 Thread Mattmann, Chris A (388J)
Thanks for the FYI guys. 

I've got this on my open source radar, along with 
reviewing the Airavata release (incubating), and 
the MRUnit release (incubating) for this week. 

I'll git er' done. Also, since the release updates for rc #2
were largely aesthetic (aka packaging and naming 
of the output folder, I might not even have to create a
new code branch of entry in repository.apache.org
for the Maven artifacts). Yay!

Should be next day or two for rc #2 spin up. Also I pointed
Lewis at the OODT release guide (which is basically my 
generic Apache release guide for most Java projects), 
and he has updated the release wiki for Nutch to be 
based off of this.

Cheers,
Chris

On Nov 16, 2011, at 9:41 AM, Markus Jelsma wrote:

 
 Chris,
 
 Any idea of when you'll be able to push a new RC for 1.4?
 Note : I think some stuff marked as 1.5 has been committed - we might need
 to check the CHANGES
 
 Definately, i've committed several items. When i did my first trunk was 
 already prepared for 1.5.
 
 Here's the list of changes since 1.4, please note that CHANGES also already 
 contained the release note and date.
 
 This is the first rev. for 1.5: 1200344 (NUTCH-1153)
 This is the last rev. for 1.4: 1197319 (NUTCH-1195)
 
 If this has caused any inconvenience then i apologize.
 
 Thanks
 
 * NUTCH-1090 InvertLinks should inform when ignoring internal links (Marek 
 Backmann via markus)
 
 * NUTCH-1174 Outlinks are not properly normalized (markus)
 
 * NUTCH-1203 ParseSegment to show number of milliseconds per parse (markus)
 
 * NUTCH-1185 Decrease solr.commit.size to 250 (markus)
 
 * NUTCH-1180 UpdateDB to backup previous CrawlDB (markus)
 
 * NUTCH-1173 DomainStats doesn't count db_not_modified (markus)
 
 * NUTCH-1155 Host/domain limit in generator is generate.max.count+1 (markus)
 
 * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex 
 (markus)
 
 * NUTCH-1178 Incorrect CSV header CrawlDatumCsvOutputFormat (markus)
 
 * NUTCH-1142 Normalization and filtering in WebGraph (markus)
 
 * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS 
 file (markus)
 
 
 
 Thanks
 
 Julien
 
 On 9 November 2011 10:21, Mattmann, Chris A (388J) 
 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Julien,
 
 Thanks. OK, so I will respin an RC for 1.4 that
 fixes the naming screw up. I already created the KEYS file
 so we're fine there.
 
 Hopefully will get it done this week while at ApacheCon NA.
 BTW, had a great time meeting Lewis in person today, nice
 to meet you dude!
 
 Cheers,
 Chris
 
 On Nov 8, 2011, at 3:27 AM, Julien Nioche wrote:
 Hi Chris
 
 
 Thanks for the review. Would you consider the below blockers, or
 would-be-nice-to-fix? If none are blockers I propose fixing them in 1.5
 and pushing 1.4. Thoughts?
 
 see below
 
 
 I agree on the naming, sorry for the screw-up.
 
 no probs. Do you think this could be fixed for 1.4?
 
 The KEYS file isn't really needed,
 since we just maintain a global keys file at
 
 http://www.apache.org/dist/nutch/KEYS.
 
 1.4? would need to modify build.xml
 
 Odd on the bin version containing the pom.xml file -- wonder why it's
 
 not part of the
 
 src -- I just did an SVN export?
 
 strange indeed.
 
 
 About the runtime/local thing, I think we can do that for 1.5, but I am
 
 totally +1 for it.
 
 OK for 1.5
 
 Thanks a lot
 
 Julien
 
 
 
 Let me know what you think. Thanks!
 
 Cheers,
 Chris
 
 On Nov 7, 2011, at 7:59 AM, Julien Nioche wrote:
 Thanks Chris,
 
 * it would be good to have the same folder name for the src and bin
 versions. They are currently 'nutch-1.4' and 'apache-nutch-1.4'
 * do we really need to include the KEYS file?
 * bin version contains pom.xml, src version does not. Either include
 in both or remove altogether
 * What about having the content of 'runtime/local' as a ready-to-use
 
 'bin'
 
 distrib instead? Doesnt make sense to have runtime/deploy as the
 
 content of
 
 the job file (e.g. nutch-site.xml) would have to be generated from
 the source anyway.
 
 Julien
 
 On 5 November 2011 01:03, Mattmann, Chris A (388J) 
 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Folks,
 
 A candidate for the Nutch 1.4 release is available at:
 http://people.apache.org/~mattmann/apache-nutch-1.4/rc1/
 
 The release candidate is a zip and tar.gz archive of the sources in:
 http://svn.apache.org/repos/asf/nutch/tags/release-1.4/
 
 And a binary build suitable for deployment.
 
 A staged Maven repository is available here:
 https://repository.apache.org/content/repositories/orgapachenutch-161/
 
 Please vote on releasing this package as Apache Nutch 1.4.
 The vote is open for the next 72 hours and passes if a majority of
 at least three +1 Nutch PMC votes are cast.
 
 [ ] +1 Release this package as Apache Nutch 1.4
 [ ] -1 Do not release this package because...
 
 Thanks!
 
 Cheers,
 Chris
 
 P.S. Here's my +1.
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA

Re: Community Comments

2011-11-15 Thread Mattmann, Chris A (388J)
+1 to the GUI comment, even though I haven't made one yet, it's definitely on 
my list of items should I find the cycles to do more besides releasing.

Thanks!

Cheers,
Chris

On Nov 15, 2011, at 1:01 PM, Markus Jelsma wrote:

 Hi Guys,
 
 During ApacheCon I made a point of trying to gauge how people that used
 Nutch found it. From the outset I would like to say that my reasoning
 behind this exercise was not to pick holes in the work that we put in to
 the project as a community, the great ideas, improvements and subsequently
 Apache product which we develop and maintain is a fantastic piece of
 software. I thought it could benefit us if we could get, at least a few
 comments regarding users experience. Here's one for starters
 ---
 Hi Lewis,
 
 Thank you for contacting us regarding Apache Nutch. Yes, we have been
 using Nutch for web crawling, and thank you for making it possible! We
 will gladly share our opinions and comments with you. Here is several
 items that we like and some that we would like to see addressed in future
 Nutch development.
 
 What we like about Nutch:
 
 1. Open source, Apache license
 2. Integrates with Solr
 3. Modular architecture, we are a development shop and value the
 extendability the most
 4. Plans for 2.0 to remove search and index from Nutch and only focus on
 crawling
 
 Clearly good points indeed.
 
 
 What we do not like about Nutch:
 
 1. Lack of incremental index update, needs twice the storage to build a
 new index (will go away in 2.0)
 
 I'm not sure what he/she means. The index is in Solr. Perhaps he/she works 
 with old Nutch?
 
 2. Integration with Hadoop FS, it takes disproportional/large amount of
 space to do segment merging or indexing
 
 Seems like old Nutch indeed with embedded Lucene. Segments merging is not 
 something that is required anymore but may be useful from a maintenenace 
 point 
 of view, not for daily operations.
 
 3. Unstable, out of memory exceptions on large crawls during segment
 merging or indexing, worker threads hang occasionally
 
 OOM's are indeed a possibility, we also sometimes suffer from this. However, 
 if one calculates worst case scenario you will most likely never run OOM 
 during fetch, parse or indexing. We rely on good distribution of pages and 
 our 
 average heap consumption is just right, except once in a while ;)
 
 The problem is that handling and recovering from OOM is extremely difficult 
 if 
 not impossible.
 
 4. Lack of GUI/web management/reporting
 
 Well, i never have and still don't see any useful case for some GUI. It's a 
 complex package of many jobs. What would one want to manage through a GUI? 
 
 
 We hope our comments will help you to continue making Nutch an even better
 Web crawler.
 
 Interesting, i'd like to hear more if there is any.
 
 Thanks
 
 ---
 Any comments guys? I've already explained to the guy that his first point
 4. has been fully addressed in 1.3 onwards. I am curious to get you guys
 opinions on the rest fo the stuff (over and above the obvious GUI/web
 management/reporting) stuff.
 
 Thank you.


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Update to release information tutorial

2011-11-15 Thread Mattmann, Chris A (388J)
WOOT!

Lewis and I talked about updating this at ApacheCon NA and I sent him the OODT 
release guide and he's 
done a masterful job updating ours.

Thanks Lewis you rock man.

Cheers,
Chris

On Nov 15, 2011, at 1:56 PM, Lewis John Mcgibbney wrote:

 Hi guys,
 
 Please see here [1] for my attempt at updating the release stuff. There WILL 
 be mistakes so please correct where you find them.
 
 Thanks
 
 [1] http://wiki.apache.org/nutch/Release_HOWTO
 
 -- 
 Lewis 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



  1   2   >