Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Markus Jelsma

Nutch 1.5 is now ships with Tika 1.1. Thanks Julien!

How about preparing for 1.5 and moving all but blocker issues to 1.6?


On Thu, 8 Mar 2012 07:32:56 -0800, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

Hey Guys,

OK, sounds good. Looks like we need to wait for the Tika 1.1 release
(seems to be going
well so far), and then try and push Gora 0.2 (which I know Lewis is
pushing, and which
I'm happy to RM once we're ready there). So, maybe I'll shoot for
next weekend
or the weekend after to push Nutch 1.5 and 2.0 RCs.

Cheers,
Chris

On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:


Yeah I agree Chris  Markus.

On the Nutchgora note, I would like to see Gora 0.2. released before 
hand, as we have a blocking issue NUTCH-1205 with Ivy retrieving alien 
Gora 0.2-SNAPSHOT dependencies from repository.apache.org. We should 
be able to overcome this issue by releasing Gora 0.2 to maven central 
then just pulling those dependencies with Ivy in Nutchgora rather than 
messing about with chain/multiple/snapshot resolvers in the Ivy 
configuration.


My 2 cents

On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma 
markus.jel...@openindex.io wrote:

+1

1.5 has, again, many fixes and improvements, just as 1.4 had over 
1.3. But i'd

like to integrate Tika 1.1 after its pending release.

Cheers

On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
 Hey Guys,

 I've got some cycles this weekend -- anyone up for a 1.5 release 
off trunk
 (stable), and a NutchGora branch release? I suggested this before 
[1]

 regarding NutchGora. I'm inclined to say let's do the following:

 1. NutchGora: apache-nutch-2.0 - release 2.x series based on this 
branch

 2. Nutch: apache-nutch-1.x - stable trunk branch

 Then, when the time comes, we can try and create a:

 3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches

 Would this make sense? Anyways we don't have to decide anything 
now that
 we can't undo later, but are folks OK with me doing an RC for 
NutchGora and

 for 1.x this weekend?

 Cheers,
 Chris

 [1] http://s.apache.org/GD2

 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++

--
Markus Jelsma - CTO - Openindex



--
Lewis




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++


--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Lewis John Mcgibbney
+1

Lewis

On Tue, Apr 3, 2012 at 11:29 AM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Nutch 1.5 is now ships with Tika 1.1. Thanks Julien!

 How about preparing for 1.5 and moving all but blocker issues to 1.6?



 On Thu, 8 Mar 2012 07:32:56 -0800, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov** wrote:

 Hey Guys,

 OK, sounds good. Looks like we need to wait for the Tika 1.1 release
 (seems to be going
 well so far), and then try and push Gora 0.2 (which I know Lewis is
 pushing, and which
 I'm happy to RM once we're ready there). So, maybe I'll shoot for
 next weekend
 or the weekend after to push Nutch 1.5 and 2.0 RCs.

 Cheers,
 Chris

 On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

  Yeah I agree Chris  Markus.

 On the Nutchgora note, I would like to see Gora 0.2. released before
 hand, as we have a blocking issue NUTCH-1205 with Ivy retrieving alien Gora
 0.2-SNAPSHOT dependencies from repository.apache.org. We should be able
 to overcome this issue by releasing Gora 0.2 to maven central then just
 pulling those dependencies with Ivy in Nutchgora rather than messing about
 with chain/multiple/snapshot resolvers in the Ivy configuration.

 My 2 cents

 On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma 
 markus.jel...@openindex.io wrote:
 +1

 1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3.
 But i'd
 like to integrate Tika 1.1 after its pending release.

 Cheers

 On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
  Hey Guys,
 
  I've got some cycles this weekend -- anyone up for a 1.5 release off
 trunk
  (stable), and a NutchGora branch release? I suggested this before [1]
  regarding NutchGora. I'm inclined to say let's do the following:
 
  1. NutchGora: apache-nutch-2.0 - release 2.x series based on this
 branch
  2. Nutch: apache-nutch-1.x - stable trunk branch
 
  Then, when the time comes, we can try and create a:
 
  3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches
 
  Would this make sense? Anyways we don't have to decide anything now
 that
  we can't undo later, but are folks OK with me doing an RC for
 NutchGora and
  for 1.x this weekend?
 
  Cheers,
  Chris
 
  [1] http://s.apache.org/GD2
 
  ++**++**++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:   
  http://sunset.usc.edu/~**mattmann/http://sunset.usc.edu/%7Emattmann/
  ++**++**++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++**++**++

 --
 Markus Jelsma - CTO - Openindex



 --
 Lewis



 ++**++**++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~**mattmann/http://sunset.usc.edu/%7Emattmann/
 ++**++**++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++**++**++


 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350




-- 
*Lewis*


Re: GSoC : Web page scraper plugin

2012-04-03 Thread Lewis John Mcgibbney
Hi Aamir,

Please excuse me not getting back to you off-list, the message is in my
drafts and I got distracted yesterday.

At this stage if you intend on applying for the issue then I would advise
you to get registered with GSoC, and begin writing up a publicly viewable
draft submission. You have until the 6th to do so, so plenty of time.

On Tue, Apr 3, 2012 at 5:45 AM, Aamir Khan syst3m.w...@gmail.com wrote:


 The project of web scraping at
 https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I
 understood the basic concept of the project but as I'm new to Nutch it will
 take some time to understand it fully in context of NUTCH.


Well you have the summer to get up to speed with Nutch right? So I wouldn't
necessarily worry too much about this just now. Just get your submission
ready and we will take it from there.


 I'm looking forward for guidance from your side, how should I go about
 submitting a proposal for GSoC.


If you feel you need help with any aspect of the issue or the submission
then please get on to user@ and we will try to help out as much over there.
In the meantime please see here [0] for guidance on your application
submission. There is plenty of documentation and guidance over there.

Thanks and again apologies for not getting back to you yesterday.

Lewis

[0] http://community.apache.org/gsoc.html



 Thanks in advance!





 --
 Aamir Khan | 3rd Year  | Computer Science  Engineering | IIT Roorkee






-- 
*Lewis*


Re: GSoC : Web page scraper plugin

2012-04-03 Thread Aamir Khan
On Tue, Apr 3, 2012 at 4:31 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Aamir,

 Please excuse me not getting back to you off-list, the message is in my
 drafts and I got distracted yesterday.


No problem.


 At this stage if you intend on applying for the issue then I would advise
 you to get registered with GSoC, and begin writing up a publicly viewable
 draft submission. You have until the 6th to do so, so plenty of time.

 On Tue, Apr 3, 2012 at 5:45 AM, Aamir Khan syst3m.w...@gmail.com wrote:


 The project of web scraping at
 https://issues.apache.org/jira/browse/NUTCH-978 looks good to me. I
 understood the basic concept of the project but as I'm new to Nutch it will
 take some time to understand it fully in context of NUTCH.


 Well you have the summer to get up to speed with Nutch right? So I
 wouldn't necessarily worry too much about this just now. Just get your
 submission ready and we will take it from there.


Exactly, I will have full summer to understand and get up to speed. But
since my knowledge is very limited my proposal won't be too good.. :)


 I'm looking forward for guidance from your side, how should I go about
 submitting a proposal for GSoC.


 If you feel you need help with any aspect of the issue or the submission
 then please get on to user@ and we will try to help out as much over
 there. In the meantime please see here [0] for guidance on your application
 submission. There is plenty of documentation and guidance over there.


Sure.


 Thanks and again apologies for not getting back to you yesterday.


No problem.. :)


 Lewis

 [0] http://community.apache.org/gsoc.html



 Thanks in advance!





 --
 Aamir Khan | 3rd Year  | Computer Science  Engineering | IIT Roorkee






 --
 *Lewis*




-- 
Aamir Khan | 3rd Year  | Computer Science  Engineering | IIT Roorkee


Re: GSoC : Web page scraper plugin

2012-04-03 Thread Lewis John Mcgibbney
Hi Aamir,

On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan syst3m.w...@gmail.com wrote:


 Exactly, I will have full summer to understand and get up to speed. But
 since my knowledge is very limited my proposal won't be too good.. :)


 This doesn't need to be the case. In fact it is crucial that the
submission is of a reasonable quality. The original issue was pretty well
discussed iirc, and additionally there is also some code uploaded by the
original author so you could have a look at that over the next few days
before making a crack at the submission. I can say one thing for sure
though, this issue might need to be branded more generically... just now
Nutch would benefit more from a generically oriented plugin for scraping
various parts of html. The original author had a use case driven approach
to this issue which meant he had to extract very specific content from news
sites... this may not suit you, and certainly isn't absolutely everyone's
cup of tea within the community. It would be great if you could discuss
both in your application and on the Jira thread how the issue could be
opened up, subsequently enabling more Nutch users to benefit... as you are
stepping up to apply here, how you wish to do this is entirely your own
choice so I would take the positives from the flexibility you have here and
focus on them within your submission. Does this sounds reasonable?

I look forward to seeing any progress you have and will seriously consider
stepping up to be a potential mentor as it was me that added the issue to
GSoC list of projects.

Thank you

Lewis


Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Julien Nioche
Good idea.

On 3 April 2012 11:29, Markus Jelsma markus.jel...@openindex.io wrote:

 Nutch 1.5 is now ships with Tika 1.1. Thanks Julien!

 How about preparing for 1.5 and moving all but blocker issues to 1.6?



 On Thu, 8 Mar 2012 07:32:56 -0800, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov** wrote:

 Hey Guys,

 OK, sounds good. Looks like we need to wait for the Tika 1.1 release
 (seems to be going
 well so far), and then try and push Gora 0.2 (which I know Lewis is
 pushing, and which
 I'm happy to RM once we're ready there). So, maybe I'll shoot for
 next weekend
 or the weekend after to push Nutch 1.5 and 2.0 RCs.

 Cheers,
 Chris

 On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

  Yeah I agree Chris  Markus.

 On the Nutchgora note, I would like to see Gora 0.2. released before
 hand, as we have a blocking issue NUTCH-1205 with Ivy retrieving alien Gora
 0.2-SNAPSHOT dependencies from repository.apache.org. We should be able
 to overcome this issue by releasing Gora 0.2 to maven central then just
 pulling those dependencies with Ivy in Nutchgora rather than messing about
 with chain/multiple/snapshot resolvers in the Ivy configuration.

 My 2 cents

 On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma 
 markus.jel...@openindex.io wrote:
 +1

 1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3.
 But i'd
 like to integrate Tika 1.1 after its pending release.

 Cheers

 On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
  Hey Guys,
 
  I've got some cycles this weekend -- anyone up for a 1.5 release off
 trunk
  (stable), and a NutchGora branch release? I suggested this before [1]
  regarding NutchGora. I'm inclined to say let's do the following:
 
  1. NutchGora: apache-nutch-2.0 - release 2.x series based on this
 branch
  2. Nutch: apache-nutch-1.x - stable trunk branch
 
  Then, when the time comes, we can try and create a:
 
  3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches
 
  Would this make sense? Anyways we don't have to decide anything now
 that
  we can't undo later, but are folks OK with me doing an RC for
 NutchGora and
  for 1.x this weekend?
 
  Cheers,
  Chris
 
  [1] http://s.apache.org/GD2
 
  ++**++**++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:   
  http://sunset.usc.edu/~**mattmann/http://sunset.usc.edu/%7Emattmann/
  ++**++**++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++**++**++

 --
 Markus Jelsma - CTO - Openindex



 --
 Lewis



 ++**++**++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~**mattmann/http://sunset.usc.edu/%7Emattmann/
 ++**++**++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++**++**++


 --
 Markus Jelsma - CTO - Openindex
 http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17
 050-8536600 / 06-50258350




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Resolved] (NUTCH-1222) Upgrade to new Hadoop 0.22.0

2012-04-03 Thread Markus Jelsma (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1222.
--

   Resolution: Won't Fix
Fix Version/s: (was: 1.5)
 Assignee: (was: Markus Jelsma)

Will open new issue when appropriate.

 Upgrade to new Hadoop 0.22.0
 

 Key: NUTCH-1222
 URL: https://issues.apache.org/jira/browse/NUTCH-1222
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Priority: Critical
 Attachments: NUTCH-1222-1.5-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2012-04-03 Thread Markus Jelsma (Resolved) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1225.
--

   Resolution: Won't Fix
Fix Version/s: (was: 1.5)
 Assignee: (was: Markus Jelsma)

CrawlDBScanner tool is deprecated in favor of the CrawlDBReader tool.

 Migrate CrawlDBScanner to MapReduce API
 ---

 Key: NUTCH-1225
 URL: https://issues.apache.org/jira/browse/NUTCH-1225
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
 Attachments: NUTCH-1225-1.5-1.patch, NUTCH-1225-1.5-2.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-717) Make Nutch Solr integration easier

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-717:


Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Make Nutch Solr integration easier
 --

 Key: NUTCH-717
 URL: https://issues.apache.org/jira/browse/NUTCH-717
 Project: Nutch
  Issue Type: New Feature
Reporter: Sami Siren
Priority: Critical
 Fix For: 1.6


 Erik Hatcher proposed we should provide a full solr config dir to be used 
 with Nutch-Solr. Now we only provide index schema. It would be considerably 
 easier to setup nutch-solr if we provided the whole conf dir that you could 
 use with solr like:
 java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1245:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
 and is generated over and over again
 

 Key: NUTCH-1245
 URL: https://issues.apache.org/jira/browse/NUTCH-1245
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel
Priority: Critical
 Fix For: 1.6


 A document gone with 404 after db.fetch.interval.max (90 days) has passed
 is fetched over and over again but although fetch status is fetch_gone
 its status in CrawlDb keeps db_unfetched. Consequently, this document will
 be generated and fetched from now on in every cycle.
 To reproduce:
 # create a CrawlDatum in CrawlDb which retry interval hits 
 db.fetch.interval.max (I manipulated the shouldFetch() in 
 AbstractFetchSchedule to achieve this)
 # now this URL is fetched again
 # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
 db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
 days)
 # this does not change with every generate-fetch-update cycle, here for two 
 segments:
 {noformat}
 /tmp/testcrawl/segments/20120105161430
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:14:21 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:14:48 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 /tmp/testcrawl/segments/20120105161631
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:16:23 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:20:05 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 {noformat}
 As far as I can see it's caused by setPageGoneSchedule() in 
 AbstractFetchSchedule. Some pseudo-code:
 {code}
 setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
 datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
 maxInterval
 datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
 if (maxInterval  datum.fetchInterval) // necessarily true
forceRefetch()
 forceRefetch:
 if (datum.fetchInterval  maxInterval) // true because it's 1.35 * 
 maxInterval
datum.fetchInterval = 0.9 * maxInterval
 datum.status = db_unfetched // 
 shouldFetch (called from generate / Generator.map):
 if ((datum.fetchTime - curTime)  maxInterval)
// always true if the crawler is launched in short intervals
// (lower than 0.35 * maxInterval)
datum.fetchTime = curTime // forces a refetch
 {code}
 After setPageGoneSchedule is called via update the state is db_unfetched and 
 the retry interval 0.9 * db.fetch.interval.max (81 days). 
 Although the fetch time in the CrawlDb is far in the future
 {noformat}
 % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
 URL: http://localhost/page_gone
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Sun May 06 05:20:05 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Score: 1.0
 Signature: null
 Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
 {noformat}
 the URL is generated again because (fetch time - current time) is larger than 
 db.fetch.interval.max.
 The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
 the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
 It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on 

[jira] [Updated] (NUTCH-1318) Parse time outs crash parsing fetcher

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1318:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Parse time outs crash parsing fetcher
 -

 Key: NUTCH-1318
 URL: https://issues.apache.org/jira/browse/NUTCH-1318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.6


 Some fetch lists can never be fetched and parsed successfully because a 
 single timing out record can cause most and eventually all subsequeny records 
 to time out as well. Finally the mapper will hang completely and so killing 
 the entire fetch job, loosing 99% of the records that were processed.
 I'm not sure what's going on, something may be leaking somewhere.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1219:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Upgrade all jobs to new MapReduce API
 -

 Key: NUTCH-1219
 URL: https://issues.apache.org/jira/browse/NUTCH-1219
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Priority: Critical
 Fix For: 1.6


 We should upgrade to the new Hadoop API for Nutch trunk as already has been 
 done for the Nutchgora branch. If i'm not mistaken we can already upgrade to 
 the latest 0.20.5 version that still carries the legacy API so we can, 
 without immediately upgrading to 0.21 or higher, port the jobs to the new API 
 without having the need for a separate branch to work on.
 To the committers who created/ported jobs in NutchGora, please write down 
 your advice and experience.
 http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1251:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Deletion of duplicates fails with 
 org.apache.solr.client.solrj.SolrServerException
 --

 Key: NUTCH-1251
 URL: https://issues.apache.org/jira/browse/NUTCH-1251
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
 Environment: Any crawl where the number of URLs in Solr exceeds 1024 
 (the default max number of clusters in Lucene boolean query).  
Reporter: Arkadi Kosmynin
Priority: Critical
 Fix For: 1.6


 Deletion of duplicates fails. This happens because the get all query used 
 to get Solr index size is id:[* TO *], which is a range query. Lucene is 
 trying to expand it to a Boolean query and gets as many clauses as there are 
 ids in the index. This is too many in a real situation and it throws an 
 exception. 
 To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
 \*:\*, which is the standard Solr get all query.
 Indexing log extract:
 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
 executing query
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
 query
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
   ... 3 more
 Caused by: org.apache.solr.common.SolrException: Internal Server Error
 Internal Server Error
 request: http://localhost:8081/arch/select?q=id:[* TO 
 *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-578:


Fix Version/s: (was: 1.5)
   1.6

 URL fetched with 403 is generated over and over again
 -

 Key: NUTCH-578
 URL: https://issues.apache.org/jira/browse/NUTCH-578
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
 have checked out the most recent version of the trunk as of Nov 20, 2007
Reporter: Nathaniel Powell
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, 
 NUTCH-578_v4.patch, crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, 
 urls.txt


 I have not changed the following parameter in the nutch-default.xml:
 property
   namedb.fetch.retry.max/name
   value3/value
   descriptionThe maximum number of times a url that has encountered
   recoverable errors is generated for fetch./description
 /property
 However, there is a URL which is on the site that I'm crawling, 
 www.teachertube.com, which keeps being generated over and over again for 
 almost every segment (many more times than 3):
 fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
 url=http://www.teachertube.com/images/
 This is a bug, right?
 Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: GSoC : Web page scraper plugin

2012-04-03 Thread Aamir Khan
On Tue, Apr 3, 2012 at 4:45 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Aamir,


 On Tue, Apr 3, 2012 at 12:05 PM, Aamir Khan syst3m.w...@gmail.com wrote:


 Exactly, I will have full summer to understand and get up to speed. But
 since my knowledge is very limited my proposal won't be too good.. :)


 This doesn't need to be the case. In fact it is crucial that the
 submission is of a reasonable quality. The original issue was pretty well
 discussed iirc, and additionally there is also some code uploaded by the
 original author so you could have a look at that over the next few days
 before making a crack at the submission. I can say one thing for sure
 though, this issue might need to be branded more generically... just now
 Nutch would benefit more from a generically oriented plugin for scraping
 various parts of html. The original author had a use case driven approach
 to this issue which meant he had to extract very specific content from news
 sites... this may not suit you, and certainly isn't absolutely everyone's
 cup of tea within the community. It would be great if you could discuss
 both in your application and on the Jira thread how the issue could be
 opened up, subsequently enabling more Nutch users to benefit... as you are
 stepping up to apply here, how you wish to do this is entirely your own
 choice so I would take the positives from the flexibility you have here and
 focus on them within your submission. Does this sounds reasonable?


Sounds good to me. Where can I chat with nutch-developers ? not many people
are present on IRC channel #nutch. BTW, I created a rough draft with all my
personal bio and other necessary information and submitted to
google-melange [1]. I will update the project schedule soon preferably
after having some discussions.

[1] =
http://google-melange.appspot.com/gsoc/proposal/review/google/gsoc2012/syst3mw0rm/9001


 I look forward to seeing any progress you have and will seriously consider
 stepping up to be a potential mentor as it was me that added the issue to
 GSoC list of projects.


that would be great!!


 Thank you

 Lewis





-- 
Aamir Khan | 3rd Year  | Computer Science  Engineering | IIT Roorkee


[jira] [Updated] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1249:
-

Affects Version/s: (was: 1.5)
Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Resolve all issues flagged up by adding javac -Xlint arguement
 --

 Key: NUTCH-1249
 URL: https://issues.apache.org/jira/browse/NUTCH-1249
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6


 There are a heap of issues flagged up by NUTCH-1237, I think over time it 
 would be great to get these addressed and resolved.
 What is interesting is that adding the same arguements to 
 /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
 Some of this stuff is documented in the link below
 http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1273:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fix [deprecation] javac warnings
 

 Key: NUTCH-1273
 URL: https://issues.apache.org/jira/browse/NUTCH-1273
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch, 
 NUTCH-1273-v2-trunk.patch


 As part of this task, these warnings should be resolved, however this 
 particular strand of warnings can either be resolved by adding
 {code}
 @SuppressWarnings(deprecation)
 {code}
 or by actually upgrading our class usage to rely upon non-deprecated classes. 
 Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
 Fix For: 1.6

 Attachments: merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1116) Write JUnit tests for all plugins

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1116:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Write JUnit tests for all plugins  
 ---

 Key: NUTCH-1116
 URL: https://issues.apache.org/jira/browse/NUTCH-1116
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is a step towards covering the parts of our plugin codebase which 
 are currently missing JUnit test cases. Each plugin will have its own 
 sub-issue meaning that this parent issue should not be deemed complete until 
 all existing (and newly contributed) plugins have the appropriate test cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1084) ReadDB url throws exception

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1084:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 ReadDB url throws exception
 ---

 Key: NUTCH-1084
 URL: https://issues.apache.org/jira/browse/NUTCH-1084
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 Readdb -url suffers from two problems:
 1. it trips over the _SUCCESS file generated by newer Hadoop version
 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
 The first problem can be remedied by not allowing the injector or updater to 
 write the _SUCCESS file. Until now that's the solution implemented for 
 similar issues. I've not been successful as to make the Hadoop readers simply 
 skip the file.
 The second issue seems a bit strange and did not happen on a local check out. 
 I'm not yet sure whether this is a Hadoop issue or something being corrupt in 
 the CrawlDB. Here's the stack trace:
 {code}
 Exception in thread main java.io.IOException: can't find class: 
 org.apache.nutch.protocol.ProtocolStatus because 
 org.apache.nutch.protocol.ProtocolStatus
 at 
 org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
 at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
 at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
 at 
 org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
 at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
 at 
 org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
 at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
 at 
 org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
 at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1150:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 http.redirect.max can lead to multiple parses of the same url
 -

 Key: NUTCH-1150
 URL: https://issues.apache.org/jira/browse/NUTCH-1150
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3, 1.4
Reporter: Markus Jelsma
 Fix For: 1.6


 With http.redirect.max  0 it's possible that a document is parsed multiple 
 times. This is the case when several url's from the same fetch redirect to a 
 shared location.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1147) WebGraph nodeDumper uses only 1 reducer

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1147:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 WebGraph nodeDumper uses only 1 reducer
 ---

 Key: NUTCH-1147
 URL: https://issues.apache.org/jira/browse/NUTCH-1147
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6

 Attachments: NUTCH-1147-1.5-1.patch


 The noderDumper is restricted to only one reducer, making it slow and 
 producing too large files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1194) CrawlDB lock should be released earlier

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1194:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 CrawlDB lock should be released earlier
 ---

 Key: NUTCH-1194
 URL: https://issues.apache.org/jira/browse/NUTCH-1194
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 Lock on the CrawlDB is released when everything is finished. But when 
 generating many segments, the lock remains in place while it's not neccessary 
 anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately 
 after the selector has finished.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1201) Allow for different FetcherThread impls

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1201:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: CustomFetcher.java, NUTCH-1201-1.5-wip.patch


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1183) Summary task for adding command line usage instructions to webgraph classes

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1183:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Summary task for adding command line usage instructions to webgraph classes
 ---

 Key: NUTCH-1183
 URL: https://issues.apache.org/jira/browse/NUTCH-1183
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 The following files should provide output when called innacurately from the 
 command line. Something similar to 
 {code}
 Usage: class -arg1, -arg2, etc etc
 {code}
 * webgraph
 * linkrank
 * scoreupdater
 * nodedumper
 * nodereader
 If anyone would like to see further classes included in this task please add 
 to the above list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1176) Fix all javadoc warnings from nightly builds

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1176:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fix all javadoc warnings from nightly builds
 

 Key: NUTCH-1176
 URL: https://issues.apache.org/jira/browse/NUTCH-1176
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 The warnings can clearly be seen from the javadoc target (near bottom) of any 
 successful nightly build. An example is provided below.
 https://builds.apache.org/job/nutch-trunk/1638/console

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1040) Backport REST-API from 2.0

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1040:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Backport REST-API from 2.0
 --

 Key: NUTCH-1040
 URL: https://issues.apache.org/jira/browse/NUTCH-1040
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Reporter: Julien Nioche
 Fix For: 1.6


 See https://issues.apache.org/jira/browse/NUTCH-880 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1274:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fix [cast] javac warnings
 -

 Key: NUTCH-1274
 URL: https://issues.apache.org/jira/browse/NUTCH-1274
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 A typical example of this is
 {code}
 trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] 
 redundant cast to int
 [javac] res ^= (int)(signature[i]  24 + signature[i+1]  16 + 
 {code}
 these should all be fixed by replacing with the correct implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1233:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Rely on Tika for outlink extraction
 ---

 Key: NUTCH-1233
 URL: https://issues.apache.org/jira/browse/NUTCH-1233
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1233-1.5-wip.patch


 Tika provides outlink extraction features that are not used in Nutch. To be 
 able to use it in Nutch we need Tika to return the rel attr value of each 
 link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
 is included in Tika and we upgraded to that new version this issue can be 
 worked on. Here's preliminary code that does both Tika and current outlink 
 extraction. This also includes parts of the Boilerpipe code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1014) Migrate from Apache ORO to java.util.regex

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1014:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Migrate from Apache ORO to java.util.regex
 --

 Key: NUTCH-1014
 URL: https://issues.apache.org/jira/browse/NUTCH-1014
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
 Fix For: 1.6


 A separate issue tracking migration of all components from Apache ORO to 
 java.util.regex. Components involved are:
 - RegexURLNormalzier
 - OutlinkExtractor
 - JSParseFilter
 - MoreIndexingFilter
 - BasicURLNormalizer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1063) OutlinkExtractor test generates an exception but does not fail

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1063:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 OutlinkExtractor test generates an exception but does not fail
 --

 Key: NUTCH-1063
 URL: https://issues.apache.org/jira/browse/NUTCH-1063
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Julien Nioche
 Fix For: 1.6


 Testsuite: org.apache.nutch.parse.TestOutlinkExtractor
 Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.043 sec
 - Standard Output ---
 2011-07-19 15:06:36,073 ERROR parse.OutlinkExtractor 
 (OutlinkExtractor.java:getOutlinks(121)) - getOutlinks
 java.lang.NullPointerException
   at org.apache.oro.text.regex.PatternMatcherInput.init(Unknown Source)
   at 
 org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:95)
   at 
 org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:72)
   at 
 org.apache.nutch.parse.TestOutlinkExtractor.testGetNoOutlinks(TestOutlinkExtractor.java:40)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at junit.framework.TestCase.runTest(TestCase.java:168)
   at junit.framework.TestCase.runBare(TestCase.java:134)
   at junit.framework.TestResult$1.protect(TestResult.java:110)
   at junit.framework.TestResult.runProtected(TestResult.java:128)
   at junit.framework.TestResult.run(TestResult.java:113)
   at junit.framework.TestCase.run(TestCase.java:124)
   at junit.framework.TestSuite.runTest(TestSuite.java:232)
   at junit.framework.TestSuite.run(TestSuite.java:227)
   at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79)
   at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
   at 
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422)
   at 
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931)
   at 
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1220) Upgrade Solr deps

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1220:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Upgrade Solr deps
 -

 Key: NUTCH-1220
 URL: https://issues.apache.org/jira/browse/NUTCH-1220
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 SlfJ4 needs to be part of upgrade to Solr 3.5 but that breaks something else. 
 Likely Hadoop has a different Slf4J version?
 {code}
 Exception in thread main java.lang.NoSuchMethodError: 
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
 at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:136)
 at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:180)
 at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
 at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395)
 at 
 org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1418)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1319)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:109)
 at 
 org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:544)
 at 
 org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:339)
 at 
 org.apache.nutch.util.domain.DomainStatistics.run(DomainStatistics.java:108)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.util.domain.DomainStatistics.main(DomainStatistics.java:215)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1123) JUnit test for scoring-link

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1123:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for scoring-link
 ---

 Key: NUTCH-1123
 URL: https://issues.apache.org/jira/browse/NUTCH-1123
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-865) Format source code in unique style

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-865:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Format source code in unique style
 --

 Key: NUTCH-865
 URL: https://issues.apache.org/jira/browse/NUTCH-865
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Pham Tuan Minh
Assignee: Lewis John McGibbney
 Fix For: 1.6

 Attachments: NUTCH-865-nutchgora-rev1188268.patch, 
 NUTCH-865-trunk-rev1188252.patch, NUTCH-865.patch


 We should define a standard format rules for source code/comments, then using 
 eclipse tool to format the whole source code in the same style. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1120) JUnit test for microformats-reltag

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1120:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for microformats-reltag
 --

 Key: NUTCH-1120
 URL: https://issues.apache.org/jira/browse/NUTCH-1120
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1186) FreeGenerator always normalizes

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1186:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 FreeGenerator always normalizes
 ---

 Key: NUTCH-1186
 URL: https://issues.apache.org/jira/browse/NUTCH-1186
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 The FreeGenerator does not honor the -normalize option, it always normalizes 
 all URL's in the input directory. The -filter option is respected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1308) Unnecessary truncate content configuration, and logging in parse-zip/ZipParser

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1308:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Unnecessary truncate content configuration, and logging in 
 parse-zip/ZipParser  
 

 Key: NUTCH-1308
 URL: https://issues.apache.org/jira/browse/NUTCH-1308
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
 Fix For: 1.6


 Two issues here...
 1) Recently ferdy committed NUTCH-965 which skips parsing of truncated 
 documents. Parse zip has it's own implementation for the same when it should 
 really draw on the aforementioned implementation.
 2) If (in the offending piece of code mentioned above) truncation occurs, we 
 get an incorrect log message the Parser can't handle incomplete pdf 
 files!!! This is incorrect, shouldn't be there, and should be removed.
 {code}
 72  if (contentLen != null  contentInBytes.length != len) {
 73return new ParseStatus(ParseStatus.FAILED,
 74ParseStatus.FAILED_TRUNCATED, Content truncated at 
 75+ contentInBytes.length
 76+  bytes. Parser can't handle incomplete pdf file.)
 77.getEmptyParseResult(content.getUrl(), getConf());
 78}
 {code}
 For clarity, the issue is present in both Nutchgora branch[1] and Nutch 
 trunk[2]
 [1] 
 https://svn.apache.org/viewvc/nutch/branches/nutchgora/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java?diff_format=hview=markup
 [2] 
 https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java?diff_format=hview=markup
 [2] 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1252) SegmentReader -get shows wrong data

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1252:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 SegmentReader -get shows wrong data
 ---

 Key: NUTCH-1252
 URL: https://issues.apache.org/jira/browse/NUTCH-1252
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel
 Fix For: 1.6

 Attachments: NUTCH-1252-v2.patch, NUTCH-1252.patch


 The command/option -get of the SegmentReader may show wrong data associated 
 with the given URL. 
 To reproduce:
 {code}
 % mkdir -p test_readseg/urls
 % echo -e 
 http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0;
   test_readseg/urls/seeds
 % nutch inject test_readseg/crawldb test_readseg/urls
 Injector: starting at 2012-01-18 09:32:25
 Injector: crawlDb: test_readseg/crawldb
 Injector: urlDir: test_readseg/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03
 % nutch generate test_readseg/crawldb test_readseg/segments/
 Generator: starting at 2012-01-18 09:32:30
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: test_readseg/segments/20120118093232
 Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03
 % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' 
 -nocontent -noparse -nofetch -noparsedata -noparsetext
 SegmentReader: get 'http://nutch.apache.org/'
 Crawl Generate::
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Wed Jan 18 09:32:26 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 10.0
 Signature: null
 Metadata: _ngt_: 1326875550401test: AbcTest
 {code}
 The metadata and the score indicate that the CrawlDatum shown is the wrong 
 one (that associated to http://abc.test/ but not to http://nutch.apache.org/).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1121) JUnit test for parse-js

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1121:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for parse-js
 ---

 Key: NUTCH-1121
 URL: https://issues.apache.org/jira/browse/NUTCH-1121
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-809) Parse-metatags plugin

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-809:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.6

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1046) Add tests for indexing to SOLR

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1046:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Add tests for indexing to SOLR
 --

 Key: NUTCH-1046
 URL: https://issues.apache.org/jira/browse/NUTCH-1046
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.6


 We currently have no tests for checking that the indexing to SOLR works as 
 expected. Running an embedded SOLR Server within the tests would be good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1228:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Change mapred.task.timeout to mapreduce.task.timeout in fetcher
 ---

 Key: NUTCH-1228
 URL: https://issues.apache.org/jira/browse/NUTCH-1228
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1001:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 bin/nutch fetch/parse handle crawl/segments directory
 -

 Key: NUTCH-1001
 URL: https://issues.apache.org/jira/browse/NUTCH-1001
 Project: Nutch
  Issue Type: Improvement
Reporter: Gabriele Kahlout
Priority: Minor
 Fix For: 1.6

 Attachments: Fetcher.java, NUTCH-1001.patch, nutch1001v2.patch


 I'm having issues porting scripts across different systems to support the 
 step of extracting the latest/only segments resulting from the generate phase.
 Variants include:
 $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1]
 $ s1=`ls -d crawl/segments/2* | tail -1` #[2]
 $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o 
 [a-zA-Z0-9/\-]* |tail -1`
 $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o 
 [a-zA-Z0-9/\-]* |tail -1`
 And I'm not sure what windows users would have to do. Some users may also do 
 with:
 bin/nutch fetch with crawl/segments/2*
 But I don't see a need in having the user extract/worry-about the latest/only 
 segment, and have it a described step in every nutch tutorial. More over only 
 fetch and parse expect a segment while other commands are fine with the 
 directory of segments.
 Therefore, I think it's beneficial if fetch and parse also handle directories 
 of segments. 
 [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1060:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 URL filters to produce regexes to be used by OutlinkExtractor.
 --

 Key: NUTCH-1060
 URL: https://issues.apache.org/jira/browse/NUTCH-1060
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
 Fix For: 1.6


 The problem:
 OutlinkExtractor produces many URL's from plain text using an advanced 
 regular expression:
 {code}
 ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@~=%-]{0,1000}))?)
 {code}
 This expression does not take into account the various non-regex-based URL 
 filters such as prefix, domain and suffix and thus produces URL's that are 
 going to be filtered out by some filter. This, however, becomes a problem 
 when parsing millions of documents that are being processed by the 
 OutlinkExtractor (when case parse-html|parse-tika do not produce any 
 outlinks). Large bodies of full text usually contain a lot of sequences that 
 are extracted as URL's. Many of which are thought to be part of an URI schema 
 such as:
 id:123
 says:what
 user:doe
 update:tue-19-jul
 The above examples can be easily remedied by using a configured prefix URL 
 filter. It may, however, be an even better idea to prevent the extraction of 
 these URL's at the first place. No extraction means filtering less URL's and 
 potentially saving a lot of data.
 Comments? I'll see if i can produce a patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1100) SolrDedup broken

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1100:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 SolrDedup broken
 

 Key: NUTCH-1100
 URL: https://issues.apache.org/jira/browse/NUTCH-1100
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.6


 Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
 Nutch will throw the exception below. There are no peculiarities to be found 
 in the Solr logs, the queries are normal and seem to succeed.
 {code}
 java.lang.NullPointerException
 at org.apache.hadoop.io.Text.encode(Text.java:388)
 at org.apache.hadoop.io.Text.set(Text.java:178)
 at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
 at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
 at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
 at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1124) JUnit test for scoring-opic

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1124:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for scoring-opic
 ---

 Key: NUTCH-1124
 URL: https://issues.apache.org/jira/browse/NUTCH-1124
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1197:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Add statically configured field values to solrindex-mapping.xml
 ---

 Key: NUTCH-1197
 URL: https://issues.apache.org/jira/browse/NUTCH-1197
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.6

 Attachments: NUTCH-1197.patch


 In some cases it's useful to be able to add to every document sent to Solr a 
 set of predefined fields with static values. This could be implemented on the 
 Solr side (with a custom UpdateRequestProcessor), but it may be less 
 cumbersome to add them on the Nutch side.
 Example: let's say I have several Nutch configurations all indexing to the 
 same Solr instance, and I want each of them to add its identifier as a field 
 in all documents, e.g. origin=web_crawl_1, origin=file_crawl, 
 origin=unlimited_crawl, etc...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1122) JUnit test for protocol-ftp

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1122:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for protocol-ftp
 ---

 Key: NUTCH-1122
 URL: https://issues.apache.org/jira/browse/NUTCH-1122
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1127) JUnit test for urlfilter-validator

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1127:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for urlfilter-validator
 --

 Key: NUTCH-1127
 URL: https://issues.apache.org/jira/browse/NUTCH-1127
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1247) CrawlDatum.retries should be int

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1247:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-208) http: proxy exception list:

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-208:


Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 http: proxy exception list:
 ---

 Key: NUTCH-208
 URL: https://issues.apache.org/jira/browse/NUTCH-208
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8, 1.3, nutchgora
Reporter: Matthias Günter
Assignee: Lewis John McGibbney
Priority: Trivial
  Labels: patch
 Fix For: 1.6

 Attachments: NUTCH-208-branch-1.4-20110210-v3.patch, 
 NUTCH-208-branch-1.4-20110807.patch, NUTCH-208-branch-1.4-20110809-v2.patch, 
 NUTCH-208-trunk-2.0-20110810-v2.patch, NUTCH-208-trunk-2.0-20110810.patch, 
 patch.txt, patch.txt, proxy_exception_list-0.8.diff


 I suggest that a parameter is added to nutch-default.xml which allows to 
 generate a proxy exception list. 
 property
   namehttp.proxy.exception.list/name
   value/value
   descriptionURL's and hosts that don't use the proxy (e.g. 
 intranets)/description
 /property
 This is useful when scanning intranet/internet combinations from behind a 
 firewall. A preliminary patch is added to this extend to this request, 
 showing the changes. We will test it and update it if necessary. this also 
 reflects the reality in web browsers, where there is in most cases an 
 exception list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1031:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.6


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1107) Log slow parse entries

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1107:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Log slow parse entries
 --

 Key: NUTCH-1107
 URL: https://issues.apache.org/jira/browse/NUTCH-1107
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6


 Parse mapper and outputformat should have a facility to log (configurable) 
 slow entries. This is useful for debugging slow parses. Logging parser keys 
 only is not good enough, especially in a distributed environment.
 Sometimes the actual parse (mapper) is very slow and sometimes the 
 normalization and filtering of an entry's outlinks is slow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-585:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1320:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1126) JUnit test for urlfilter-prefix

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1126:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for urlfilter-prefix
 ---

 Key: NUTCH-1126
 URL: https://issues.apache.org/jira/browse/NUTCH-1126
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1087:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Deprecate crawl command and replace with example script
 ---

 Key: NUTCH-1087
 URL: https://issues.apache.org/jira/browse/NUTCH-1087
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.4
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 * remove the crawl command
 * add basic crawl shell script
 See thread:
 http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1128) JUnit test for urlmeta

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1128:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for urlmeta
 --

 Key: NUTCH-1128
 URL: https://issues.apache.org/jira/browse/NUTCH-1128
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1034) Create Solr Velocity templates

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1034:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Create Solr Velocity templates
 --

 Key: NUTCH-1034
 URL: https://issues.apache.org/jira/browse/NUTCH-1034
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: doc.vm.patch, facets.vm.patch


 Solr has Velocity integration and provides an easy method for creating HTML 
 based front-ends for the search engine. This issue tracks the development of 
 Velocity templates specifically for Nutch users.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1179) Option to restrict generated records by metadata

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1179:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Option to restrict generated records by metadata
 

 Key: NUTCH-1179
 URL: https://issues.apache.org/jira/browse/NUTCH-1179
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 The generator should be able to select entries based on a metadata key/value 
 pair.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1300:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Indexer to normalize URL's
 --

 Key: NUTCH-1300
 URL: https://issues.apache.org/jira/browse/NUTCH-1300
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1300-1.5-1.patch


 Indexers should be able to normalize URL's. This is useful when a new 
 normalizer is applied to the entire CrawlDB. Without it, some or all records 
 in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1130) JUnit test for Any23 RDF plugin

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1130:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for Any23 RDF plugin
 ---

 Key: NUTCH-1130
 URL: https://issues.apache.org/jira/browse/NUTCH-1130
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 The JUnit test should be written prior to the progression of the Any23 Nutch 
 plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1047:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.6


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1035) Tune Solr config for Nutch users

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1035:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Tune Solr config for Nutch users
 

 Key: NUTCH-1035
 URL: https://issues.apache.org/jira/browse/NUTCH-1035
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: solrconfig.xml


 To improve and ease integration with Solr we should provide a solrconfig.xml 
 specifically for Nutch integration including a request handler with a 
 Velocity response writer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1226) Migrate CrawlDbReader to MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1226:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Migrate CrawlDbReader to MapReduce API
 --

 Key: NUTCH-1226
 URL: https://issues.apache.org/jira/browse/NUTCH-1226
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1226-1.5-1.patch


 Hadoop 0.21 only!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1223) Migrate WebGraph to MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1223:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Migrate WebGraph to MapReduce API
 -

 Key: NUTCH-1223
 URL: https://issues.apache.org/jira/browse/NUTCH-1223
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
 Fix For: 1.6




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1202) Fetcher timebomb kills long waiting fetch jobs

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1202:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Fetcher timebomb kills long waiting fetch jobs
 --

 Key: NUTCH-1202
 URL: https://issues.apache.org/jira/browse/NUTCH-1202
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Markus Jelsma
 Fix For: 1.6


 The timebomb feature kills of mappers of jobs that have been waiting too long 
 in the job queue. The timebomb feature should start at mapper initialization 
 instead, not in job init.
 Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1079) StringBuffer converted to StringBuilder

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1079:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 StringBuffer converted to StringBuilder
 ---

 Key: NUTCH-1079
 URL: https://issues.apache.org/jira/browse/NUTCH-1079
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, indexer
Reporter: Karthik K
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch


 All across the codebase, it contains StringBuffer, when thread-safety is 
 probably not intended. 
 This patch replaces StringBuffer to StringBuilder, as applicable. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1319) HostNormalizer

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1319:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 HostNormalizer
 --

 Key: NUTCH-1319
 URL: https://issues.apache.org/jira/browse/NUTCH-1319
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1319-1.5-1.patch


 Nutch would benefit from having a host normalizer. A host normalizer maps a 
 given host to the desired host. A basic example is to map www.apache.org to 
 apache.org. The Apache website is one of many on the internet that has a 
 duplicate website on the same domain just because it allows both www and 
 non-www to return HTTP 200 and proper content.
 It is also able to handle wildcards such as *.example.org to example.org if 
 there are multiple sub domains that actually point to the same website.
 Large internet crawls tend to get polluted very quickly due to these 
 problems. It also leads to skewed scores in the webgraph as different 
 websites link to different versions of the same duplicate website.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1140:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 index-more plugin, resetTitle method creates multiple values in the Title 
 field
 ---

 Key: NUTCH-1140
 URL: https://issues.apache.org/jira/browse/NUTCH-1140
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
Reporter: Joe Liedtke
Priority: Minor
 Fix For: 1.6

 Attachments: MoreIndexingFilter.093011.patch


 From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
 to reset the Title field of a document if it contains a Content-Disposition 
 header. The current behavior is to add a Title regardless of whether one 
 exists or not, which can cause issues down the line with the Solr Indexing 
 process, and based on a thread in the nutch user list it appears that this is 
 causing some users to mark the title as multi-valued in the schema:
   
 http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
 The following patch removes the title field before adding a new one, which 
 has resolved the issue for me:
 --- MoreIndexingFilter.old2011-09-30 11:44:35.0 +
 +++ MoreIndexingFilter.java   2011-09-30 09:58:48.0 +
 @@ -276,6 +276,7 @@
  for (int i=0; ipatterns.length; i++) {
if (matcher.contains(contentDisposition,patterns[i])) {
  result = matcher.getMatch();
 +doc.removeField(title);
  doc.add(title, result.group(1));
  break;
}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1021:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Migrate OutlinkExtractor from Apache ORO to java.util.regex 
 

 Key: NUTCH-1021
 URL: https://issues.apache.org/jira/browse/NUTCH-1021
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1021-1.4-2.patch, NUTCH-1021-1.4-4.patch, 
 NUTCH-1021-1.4.patch


 Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1039) Fetcher fails for pages without content-length header

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1039:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Fetcher fails for pages without content-length header
 -

 Key: NUTCH-1039
 URL: https://issues.apache.org/jira/browse/NUTCH-1039
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 Fetcher fails:
 2011-07-11 14:45:34,764 ERROR http.Http - 
 org.apache.nutch.protocol.http.api.HttpException: bad content length:
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:218)
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:158)
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:79)
 Both fetcher and indexing filter checker fail sometimes. I'm unsure whether 
 this is something in Nutch or whether the remote server only returns 
 content-length incidentally.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1275) Fix [unchecked] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1275:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fix [unchecked] javac warnings
 --

 Key: NUTCH-1275
 URL: https://issues.apache.org/jira/browse/NUTCH-1275
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 We can simply suppress these warnings using  
 {code}
 SuppressWarnings [unchecked]
 {code}
 However if there is a another method for resolving these warnings then they 
 should be implemented if deemed beneficial to code quality.
 Some resources 
 http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1151) Index-anchor to add numInlinks count

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1151:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Index-anchor to add numInlinks count
 

 Key: NUTCH-1151
 URL: https://issues.apache.org/jira/browse/NUTCH-1151
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6

 Attachments: NUTCH-1151-1.5-1.patch


 Issue to improve in index-anchor to add the number of inlinks per document. 
 This count is useful for calculating some authority metric in the search 
 server.T

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1053:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.6

 Attachments: nutch-1053.patch, seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
 NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
 NUTCH-961-1.5-1.patch, NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1118) JUnit test for index-basic

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1118:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for index-basic
 --

 Key: NUTCH-1118
 URL: https://issues.apache.org/jira/browse/NUTCH-1118
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1284:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.6


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1149) DomainStats should process numeric CrawlDB metadata

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1149:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 DomainStats should process numeric CrawlDB metadata
 ---

 Key: NUTCH-1149
 URL: https://issues.apache.org/jira/browse/NUTCH-1149
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6


 Right now the DomainStats program only outputs the sum of fetched records per 
 domain or host. It should also be able to output processed numerics of meta 
 data in order to get the average size (content length) for a given domain or 
 host. This is also useful for generating a metric for adult material (by 
 domain or host) when using a plugin that stores a propability factor of adult 
 material per URL in the Crawl DB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1181) Indexer to use webgraph inlinks

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1181:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Indexer to use webgraph inlinks
 ---

 Key: NUTCH-1181
 URL: https://issues.apache.org/jira/browse/NUTCH-1181
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 Indexers currently rely on the LinkDB for anchor indexing while the WebGraph 
 provides the same data as an inverted link DB. An inlinkDB created by the 
 WebGraph program with non-zero LinkRank scores on the nodes also provide an 
 improved set ordered by popularity.
 This issue must:
 - let IndexerMapReduce understand the new format;
 - allow for indexing only popular anchors.
 The goal is todeprecate all code associated with invertlinks and ultimately 
 remove it from the codebase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1117) JUnit test for index-anchor

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1117:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for index-anchor
 ---

 Key: NUTCH-1117
 URL: https://issues.apache.org/jira/browse/NUTCH-1117
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1024:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
 NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, 
 adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1317) Max content length by MIME-type

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1317:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Max content length by MIME-type
 ---

 Key: NUTCH-1317
 URL: https://issues.apache.org/jira/browse/NUTCH-1317
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 The good old http.content.length directive is not sufficient in large 
 internet crawls. For example, a 5MB PDF file may be parsed without issues but 
 a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1277) Fix [fallthrough] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1277:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Fix [fallthrough] javac warnings
 

 Key: NUTCH-1277
 URL: https://issues.apache.org/jira/browse/NUTCH-1277
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
 Fix For: nutchgora, 1.6


 This usually occurs when we have an instance where a switch statement(s) fall 
 through (that is, one or more break statements are missing).
 We need to determine where a simple
 {code}
 @SuppressWarnings(fallthrough)
 {code}
 is required or whether we need to include the break statements in switch 
 blocks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1215) UpdateDB should not require segment as input

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1215:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 UpdateDB should not require segment as input
 

 Key: NUTCH-1215
 URL: https://issues.apache.org/jira/browse/NUTCH-1215
 Project: Nutch
  Issue Type: Bug
  Components: linkdb
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1215-1.5-1.patch


 UpdateDB requires an input segment. This causes the metrics for the records 
 of the segment to change, e.g. from fetched to not_modified and changes an 
 adaptive fetch schedule accordingly. This should not happen when one needs to 
 update for filtering of normalizing or other maintenance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1103) Port protocol-sftp to 1.4

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1103:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Port protocol-sftp to 1.4
 -

 Key: NUTCH-1103
 URL: https://issues.apache.org/jira/browse/NUTCH-1103
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 Port protocol-sftp from trunk back to 1.4

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1088) Write Solr XML documents

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1088:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Write Solr XML documents
 

 Key: NUTCH-1088
 URL: https://issues.apache.org/jira/browse/NUTCH-1088
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 Documents need to be reindexed when index-time analysis is modified. Indexing 
 individual segments from Nutch is tedious, especially for small segments. 
 This issue should add a feature that can write XML batches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-828) Fetch Filter

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-828:


Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fetch Filter
 

 Key: NUTCH-828
 URL: https://issues.apache.org/jira/browse/NUTCH-828
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.6

 Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch


 Adds a Nutch extension point for a fetch filter.  The fetch filter allows 
 filtering content and parse data/text after it is fetched but before it is 
 written to segments.  The fliter can return true if content is to be written 
 or false if it is not.  
 Some use cases for this filter would be topical search engines that only want 
 to fetch/index certain types of content, for example a news or sports only 
 search engine.  In these types of situations the only way to determine if 
 content belongs to a particular set is to fetch the page and then analyze the 
 content.  If the content passes, meaning belongs to the set of say sports 
 pages, then we want to include it.  If it doesn't then we want to ignore it, 
 never fetch that same page in the future, and ignore any urls on that page.  
 If content is rejected due to a fetch filter then its status is written to 
 the CrawlDb as gone and its content is ignored and not written to segments.  
 This effectively stop crawling along the crawl path of that page and the urls 
 from that page.  An example filter, fetch-safe, is provided that allows 
 fetching content that does not contain a list of bad words.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Markus Jelsma
Remaining issue for 1.5:
NUTCH-1208 Don't include KEYS file in bin distribution

I obviously couldn't supress e-mail notifications. My sincere apologies for 
the deluge of e-mail!

On Tuesday 03 April 2012 13:22:17 Julien Nioche wrote:
 Good idea.
 
 On 3 April 2012 11:29, Markus Jelsma markus.jel...@openindex.io wrote:
  Nutch 1.5 is now ships with Tika 1.1. Thanks Julien!
  
  How about preparing for 1.5 and moving all but blocker issues to 1.6?
  
  
  
  On Thu, 8 Mar 2012 07:32:56 -0800, Mattmann, Chris A (388J) 
  
  chris.a.mattm...@jpl.nasa.gov** wrote:
  Hey Guys,
  
  OK, sounds good. Looks like we need to wait for the Tika 1.1 release
  (seems to be going
  well so far), and then try and push Gora 0.2 (which I know Lewis is
  pushing, and which
  I'm happy to RM once we're ready there). So, maybe I'll shoot for
  next weekend
  or the weekend after to push Nutch 1.5 and 2.0 RCs.
  
  Cheers,
  Chris
  
  On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:
   Yeah I agree Chris  Markus.
   
  On the Nutchgora note, I would like to see Gora 0.2. released before
  hand, as we have a blocking issue NUTCH-1205 with Ivy retrieving alien
  Gora 0.2-SNAPSHOT dependencies from repository.apache.org. We should
  be able to overcome this issue by releasing Gora 0.2 to maven central
  then just pulling those dependencies with Ivy in Nutchgora rather than
  messing about with chain/multiple/snapshot resolvers in the Ivy
  configuration.
  
  My 2 cents
  
  On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma 
  markus.jel...@openindex.io wrote:
  +1
  
  1.5 has, again, many fixes and improvements, just as 1.4 had over 1.3.
  But i'd
  like to integrate Tika 1.1 after its pending release.
  
  Cheers
  
  On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
   Hey Guys,
   
   I've got some cycles this weekend -- anyone up for a 1.5 release off
  
  trunk
  
   (stable), and a NutchGora branch release? I suggested this before [1]
   regarding NutchGora. I'm inclined to say let's do the following:
   
   1. NutchGora: apache-nutch-2.0 - release 2.x series based on this
  
  branch
  
   2. Nutch: apache-nutch-1.x - stable trunk branch
   
   Then, when the time comes, we can try and create a:
   
   3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature branches
   
   Would this make sense? Anyways we don't have to decide anything now
  
  that
  
   we can't undo later, but are folks OK with me doing an RC for
  
  NutchGora and
  
   for 1.x this weekend?
   
   Cheers,
   Chris
   
   [1] http://s.apache.org/GD2
   
   ++**++**+
   + Chris Mattmann, Ph.D.
   Senior Computer Scientist
   NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
   Office: 171-266B, Mailstop: 171-246
   Email: chris.a.mattm...@nasa.gov
   WWW:  
   http://sunset.usc.edu/~**mattmann/http://sunset.usc.edu/%7Emattmann
   /
   ++**++**
   ++ Adjunct Assistant Professor, Computer Science Department
   University of Southern California, Los Angeles, CA 90089 USA
   ++**++**+
   +
  
  --
  Markus Jelsma - CTO - Openindex
  
  
  
  --
  Lewis
  
  ++**++**++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:  
  http://sunset.usc.edu/~**mattmann/http://sunset.usc.edu/%7Emattmann/
  ++**++**++
  Adjunct Assistant Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++**++**++
  
  --
  Markus Jelsma - CTO - Openindex
  http://www.linkedin.com/in/**markus17http://www.linkedin.com/in/markus17
   050-8536600 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex


[jira] [Commented] (NUTCH-1270) some of Deflate encoded pages not fetched

2012-04-03 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245259#comment-13245259
 ] 

behnam nikbakht commented on NUTCH-1270:


for example, with the site:
http://www.noormags.com/view/fa/default
when fetch the first page, and dump from segment, see that there is a problem 
with fetch,
when i replace 
byte[] content = DeflateUtils.inflateBestEffort(compressed, getMaxContent());
with
byte[] content = DeflateUtils.inflateBestEffort(compressed, 999);
it's work

 some of Deflate encoded pages not fetched
 -

 Key: NUTCH-1270
 URL: https://issues.apache.org/jira/browse/NUTCH-1270
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: fetch, processDeflateEncoded
 Attachments: NUTCH-1270.patch


 it is a problem with some of web pages that fetched but their content can not 
 retrived
 after this change, this error fixed
 we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
   public byte[] processDeflateEncoded(byte[] compressed, URL url) throws 
 IOException {
 if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); }
 byte[] content = DeflateUtils.inflateBestEffort(compressed, 
 getMaxContent());
 +if(content==null)
 + content = DeflateUtils.inflateBestEffort(compressed, 20);
 if (content == null)
   throw new IOException(inflateBestEffort returned null);
 if (LOGGER.isTraceEnabled()) {
   LOGGER.trace(fetched  + compressed.length
  +  bytes of compressed content (expanded to 
  + content.length +  bytes) from  + url);
 }
 return content;
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Markus Jelsma
Cool! 

Next time i'll ask infra to allow to supress notifications.

Chris, will you RM one RC? And if possible list the detailed steps/command in 
the process in case you don't have to time RM 1.6 when the time comes. The 
wiki is dated.

I'm looking forward to yet another big release with lots of fixes and 
improvements!

Thanks all!

On Tuesday 03 April 2012 14:42:31 Julien Nioche wrote:
  Remaining issue for 1.5:
  NUTCH-1208 Don't include KEYS file in bin distribution
 
 done! thanks
 
  I obviously couldn't supress e-mail notifications. My sincere apologies
  for the deluge of e-mail!
 
 no probs
 
  On Tuesday 03 April 2012 13:22:17 Julien Nioche wrote:
   Good idea.
   
   On 3 April 2012 11:29, Markus Jelsma markus.jel...@openindex.io wrote:
Nutch 1.5 is now ships with Tika 1.1. Thanks Julien!

How about preparing for 1.5 and moving all but blocker issues to 1.6?



On Thu, 8 Mar 2012 07:32:56 -0800, Mattmann, Chris A (388J) 

chris.a.mattm...@jpl.nasa.gov** wrote:
Hey Guys,

OK, sounds good. Looks like we need to wait for the Tika 1.1 release
(seems to be going
well so far), and then try and push Gora 0.2 (which I know Lewis is
pushing, and which
I'm happy to RM once we're ready there). So, maybe I'll shoot for
next weekend
or the weekend after to push Nutch 1.5 and 2.0 RCs.

Cheers,
Chris

On Mar 8, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:
 Yeah I agree Chris  Markus.
 
On the Nutchgora note, I would like to see Gora 0.2. released
before hand, as we have a blocking issue NUTCH-1205 with Ivy
retrieving
  
  alien
  
Gora 0.2-SNAPSHOT dependencies from repository.apache.org. We
should be able to overcome this issue by releasing Gora 0.2 to
maven central then just pulling those dependencies with Ivy in
Nutchgora rather
  
  than
  
messing about with chain/multiple/snapshot resolvers in the Ivy
configuration.

My 2 cents

On Thu, Mar 8, 2012 at 3:03 PM, Markus Jelsma 
markus.jel...@openindex.io wrote:
+1

1.5 has, again, many fixes and improvements, just as 1.4 had over
  
  1.3.
  
But i'd
like to integrate Tika 1.1 after its pending release.

Cheers

On Thursday 08 March 2012 15:38:15 Mattmann, Chris A (388J) wrote:
 Hey Guys,
 
 I've got some cycles this weekend -- anyone up for a 1.5 release
  
  off
  
trunk

 (stable), and a NutchGora branch release? I suggested this before
  
  [1]
  
 regarding NutchGora. I'm inclined to say let's do the following:
 
 1. NutchGora: apache-nutch-2.0 - release 2.x series based on this

branch

 2. Nutch: apache-nutch-1.x - stable trunk branch
 
 Then, when the time comes, we can try and create a:
 
 3. Nutch: apache-nutch-3.x - merge of 1.x and 2.x feature
 branches
 
 Would this make sense? Anyways we don't have to decide anything
 now

that

 we can't undo later, but are folks OK with me doing an RC for

NutchGora and

 for 1.x this weekend?
 
 Cheers,
 Chris
 
 [1] http://s.apache.org/GD2
  
  ++**++**+
  
 + Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:
 http://sunset.usc.edu/~**mattmann/
  
  http://sunset.usc.edu/%7Emattmann
  
 /
  
  ++**++**
  
 ++ Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
  
  ++**++**+
  
 +

--
Markus Jelsma - CTO - Openindex



--
Lewis

++**++**
++ Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:
http://sunset.usc.edu/~**mattmann/http://sunset.usc.edu/%7Emattmann
/

++**++**
++ Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++**++**
++

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/**markus17
  
  http://www.linkedin.com/in/markus17
  
 050-8536600 / 06-50258350
  
  --
  Markus Jelsma - CTO - Openindex

-- 
Markus Jelsma - CTO - Openindex


[jira] [Commented] (NUTCH-1208) Don't include KEYS file in bin distribution

2012-04-03 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245295#comment-13245295
 ] 

Hudson commented on NUTCH-1208:
---

Integrated in nutch-trunk-maven #224 (See 
[https://builds.apache.org/job/nutch-trunk-maven/224/])
NUTCH-1208 Don't include KEYS file in bin distribution (Revision 1308865)

 Result = SUCCESS
jnioche : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/build.xml


 Don't include KEYS file in bin distribution
 ---

 Key: NUTCH-1208
 URL: https://issues.apache.org/jira/browse/NUTCH-1208
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.5


 We should get rid of the KEYS file in the bin packaging (zip/tar) in 1.5.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Mattmann, Chris A (388J)
Hi Markus,

On Apr 3, 2012, at 5:50 AM, Markus Jelsma wrote:

 Cool! 
 
 Next time i'll ask infra to allow to supress notifications.
 
 Chris, will you RM one RC? And if possible list the detailed steps/command in 
 the process in case you don't have to time RM 1.6 when the time comes. The 
 wiki is dated.

Happy to RM it. 

Check the wiki here:

http://wiki.apache.org/nutch/Release_HOWTO

Lewis and I updated this after the last release. It's more or less what's 
required to 
release the project and what I run. It's also really similar to the OODT 
release 
process:

https://cwiki.apache.org/confluence/display/OODT/Release+Process

Was there something specific that you weren't seeing there?

 
 I'm looking forward to yet another big release with lots of fixes and 
 improvements!

Agreed, thanks everyone!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Markus Jelsma


On Tuesday 03 April 2012 15:58:54 you wrote:
 Hi Markus,
 
 On Apr 3, 2012, at 5:50 AM, Markus Jelsma wrote:
  Cool!
  
  Next time i'll ask infra to allow to supress notifications.
  
  Chris, will you RM one RC? And if possible list the detailed
  steps/command in the process in case you don't have to time RM 1.6 when
  the time comes. The wiki is dated.
 
 Happy to RM it.

Great!

 
 Check the wiki here:
 
 http://wiki.apache.org/nutch/Release_HOWTO
 
 Lewis and I updated this after the last release. It's more or less what's
 required to release the project and what I run. It's also really similar
 to the OODT release process:

Great!

 
 https://cwiki.apache.org/confluence/display/OODT/Release+Process
 
 Was there something specific that you weren't seeing there?

Seems fine. Only updating KEYS is no longer necessary.

Thanks!

 
  I'm looking forward to yet another big release with lots of fixes and
  improvements!
 
 Agreed, thanks everyone!
 
 Cheers,
 Chris
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++

-- 
Markus Jelsma - CTO - Openindex


[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney

2012-04-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The FrontPage page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FrontPage?action=diffrev1=238rev2=239

  === Tutorials ===
   * NutchTutorial - How to configure Nutch to crawl in local mode and post to 
Apache Solr for search/index.
   * [[http://hadoop.apache.org/common/docs/stable/|Hadoop Tutorial]] Nutch 
being based Hadoop, it helps to have a better understanding of Hadoop.
-  * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch 
in deploy mode over a Hadoop cluster. /!\ :This tutorial is in development: /!\
+  * [[NutchHadoopTutorial|Nutch Hadoop Tutorial]] - How to setup and run Nutch 
in deploy mode over a Hadoop cluster. 
   * RunNutchInEclipse - How to configure, build, crawl and debug Nutch within 
Eclipse
   * [[IntranetDocumentSearch|Intranet Document Search]] - Index and search 
Microsoft Office, PDF etc. documents in a file system hierarchy with a Solr 
backend.
  


[Nutch Wiki] Trivial Update of Release_HOWTO by LewisJohnMcgibbney

2012-04-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Nutch Wiki for change 
notification.

The Release_HOWTO page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/Release_HOWTO?action=diffrev1=12rev2=13

  = Preparation =
  1. Create a new release in JIRA. If you do not already have these 
privileges ask your PMC Chair.
  1. Push off all open issues to the next release; any critical or 
blocker issues should be resolved on mailing list. Discuss any issues that you 
are unsure of on the mailing list.
- 1. Update KEYS file in current trunk, make sure that you create an 
Apache Key. If you don't have a key already create one by following 
[[http://www.apache.org/dev/openpgp.html#generate-key|this]].
+ 
  = Making the Release =
1. Update version numbers (from X.Y-dev to X.Y) for release in:
* nutch-default.xml - http.agent.version property


Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Lewis John Mcgibbney
Hi,

On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.iowrote:



 Seems fine. Only updating KEYS is no longer necessary.


Now sorted.

Thanks whenever you can get round to this Chris.

Best

Lewis


Re: NutchGora release, and Nutch 1.x trunk release

2012-04-03 Thread Mattmann, Chris A (388J)
Thanks Lewis!

Cheers,
Chris

P.S. Hopefully by this weekend...

On Apr 3, 2012, at 7:23 AM, Lewis John Mcgibbney wrote:

 Hi,
 
 On Tue, Apr 3, 2012 at 3:12 PM, Markus Jelsma markus.jel...@openindex.io 
 wrote:
 
 
 Seems fine. Only updating KEYS is no longer necessary.
 
 Now sorted.
 
 Thanks whenever you can get round to this Chris.
 
 Best
 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Commented] (NUTCH-1306) Commit after finished writing to solr index

2012-04-03 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245798#comment-13245798
 ] 

Lewis John McGibbney commented on NUTCH-1306:
-

Hi Dan. In trunk, we have a number of nice features which I would like to bring 
to your attention. Maybe you can comment on whether you would like to see some 
of them go into Nutchgora?

Namely, NUTCH-1185, NUTCH-1000, NUTCH-996, NUTCH-991 and NUTCH-799

wdyt?

 Commit after finished writing to solr index
 ---

 Key: NUTCH-1306
 URL: https://issues.apache.org/jira/browse/NUTCH-1306
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Trivial
 Fix For: nutchgora

 Attachments: NUTCH-1306.patch


 Commit after finished writing to solr index - otherwise a bit confusing not 
 seeing the number of docs we expect in solr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-04-03 Thread Arkadi Kosmynin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245877#comment-13245877
 ] 

Arkadi Kosmynin commented on NUTCH-1251:


Thanks Markus!



 Deletion of duplicates fails with 
 org.apache.solr.client.solrj.SolrServerException
 --

 Key: NUTCH-1251
 URL: https://issues.apache.org/jira/browse/NUTCH-1251
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
 Environment: Any crawl where the number of URLs in Solr exceeds 1024 
 (the default max number of clusters in Lucene boolean query).  
Reporter: Arkadi Kosmynin
Priority: Critical
 Fix For: 1.6


 Deletion of duplicates fails. This happens because the get all query used 
 to get Solr index size is id:[* TO *], which is a range query. Lucene is 
 trying to expand it to a Boolean query and gets as many clauses as there are 
 ids in the index. This is too many in a real situation and it throws an 
 exception. 
 To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
 \*:\*, which is the standard Solr get all query.
 Indexing log extract:
 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
 executing query
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
 query
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
   ... 3 more
 Caused by: org.apache.solr.common.SolrException: Internal Server Error
 Internal Server Error
 request: http://localhost:8081/arch/select?q=id:[* TO 
 *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >