from:"Markus Jelsma \(Updated\) \(JIRA\)"

[jira] [Updated] (NUTCH-1341) NotModified time set to now but page not modified

2012-04-19 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1341:
-

Attachment: NUTCH-1341-1.6-1.patch

Here's a patch for 1.6. It simply resets the modifiedTime to the CrawlDatum's 
previous value right after the reducers sets a STATUS_DB_NOTMODIFIED status 
value. Since i believe the status is correct i assume the modifiedTime value 
can be reset here as well.

Please comment. Did i overlook something? Implement it differently?

Thanks

 NotModified time set to now but page not modified
 -

 Key: NUTCH-1341
 URL: https://issues.apache.org/jira/browse/NUTCH-1341
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1341-1.6-1.patch


 Servers tend to respond with incorrect or no value for LastModified. By 
 comparing signatures or when (fetch.getStatus() == 
 CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the 
 db_notmodified status for the CrawlDatum. The modifiedTime value, however, is 
 not set accordingly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1336) Optionally not index db_notmodified pages

2012-04-17 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1336:
-

Attachment: NUTCH-1336-1.6-1.patch

Patch for 1.6.

 Optionally not index db_notmodified pages
 -

 Key: NUTCH-1336
 URL: https://issues.apache.org/jira/browse/NUTCH-1336
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1336-1.6-1.patch


 IndexerMapReduce already skips pages with fetch_notmodified as status. 
 However, despite the fetch status, we may still consider a page not modified 
 if status is db_notmodified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1335) OutlinkDB to collect unique URL's only

2012-04-17 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1335:
-

Description: The aggregating code in the Outlink reducer does not take care 
of incoming duplicates. When the input segments contain duplicates of a single 
URL they are collected.  (was: The OutlinkDB may contain duplicates if a 
segment is added more than once. The aggregating code in the reducer is does 
not take care of removing duplicates.

See: 
http://mail-archives.apache.org/mod_mbox/nutch-user/201204.mbox/%3c39d7bed10f572c3211c3ad91c8a37...@openindex.io%3E)
 Patch Info: Patch Available
Summary: OutlinkDB to collect unique URL's only  (was: OutlinkDB to 
emit unique URL's only)

 OutlinkDB to collect unique URL's only
 --

 Key: NUTCH-1335
 URL: https://issues.apache.org/jira/browse/NUTCH-1335
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 The aggregating code in the Outlink reducer does not take care of incoming 
 duplicates. When the input segments contain duplicates of a single URL they 
 are collected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1335) OutlinkDB to collect unique URL's only

2012-04-17 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1335:
-

Attachment: NUTCH-1335-1.6-1.patch

Patch for 1.5. The reducer now only collects records that are equal to or 
higher than mostRecent timestamp. This can still result in duplicates in the 
aggregated collection but not a significant amount.

This patch seems to work as the troubled reducer finished nicely. I'll test 
with a few more runs with each a very large amount of input records also 
containing duplicates.

 OutlinkDB to collect unique URL's only
 --

 Key: NUTCH-1335
 URL: https://issues.apache.org/jira/browse/NUTCH-1335
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.5
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1335-1.6-1.patch


 The aggregating code in the Outlink reducer does not take care of incoming 
 duplicates. When the input segments contain duplicates of a single URL they 
 are collected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1330) OutlinkDB to preserve back up

2012-04-10 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1330:
-

Attachment: NUTCH-1330-1.6-2.patch

Previous patch is bad and came from an old checkout. This is the proper patch.

 OutlinkDB to preserve back up
 -

 Key: NUTCH-1330
 URL: https://issues.apache.org/jira/browse/NUTCH-1330
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1330-1.6-1.patch, NUTCH-1330-1.6-2.patch


 The webgraph's outlinkDB is the single source for all scoring jobs and GB's 
 that eventually come out. In case of disaster, that didn't happen yet, it 
 should be able to preserve back up just like other DB's. This means users 
 with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to 
 crawl/webgraphdb/outlinks/current/.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1330) OutlinkDB to preserve back up

2012-04-06 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1330:
-

Attachment: NUTCH-1330-1.6-1.patch

Patch for 1.6!

 OutlinkDB to preserve back up
 -

 Key: NUTCH-1330
 URL: https://issues.apache.org/jira/browse/NUTCH-1330
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1330-1.6-1.patch


 The webgraph's outlinkDB is the single source for all scoring jobs and GB's 
 that eventually come out. In case of disaster, that didn't happen yet, it 
 should be able to preserve back up just like other DB's. This means users 
 with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to 
 crawl/webgraphdb/outlinks/current/.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-717) Make Nutch Solr integration easier

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-717:


Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Make Nutch Solr integration easier
 --

 Key: NUTCH-717
 URL: https://issues.apache.org/jira/browse/NUTCH-717
 Project: Nutch
  Issue Type: New Feature
Reporter: Sami Siren
Priority: Critical
 Fix For: 1.6


 Erik Hatcher proposed we should provide a full solr config dir to be used 
 with Nutch-Solr. Now we only provide index schema. It would be considerably 
 easier to setup nutch-solr if we provided the whole conf dir that you could 
 use with solr like:
 java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1245:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb 
 and is generated over and over again
 

 Key: NUTCH-1245
 URL: https://issues.apache.org/jira/browse/NUTCH-1245
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel
Priority: Critical
 Fix For: 1.6


 A document gone with 404 after db.fetch.interval.max (90 days) has passed
 is fetched over and over again but although fetch status is fetch_gone
 its status in CrawlDb keeps db_unfetched. Consequently, this document will
 be generated and fetched from now on in every cycle.
 To reproduce:
 # create a CrawlDatum in CrawlDb which retry interval hits 
 db.fetch.interval.max (I manipulated the shouldFetch() in 
 AbstractFetchSchedule to achieve this)
 # now this URL is fetched again
 # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to 
 db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 
 days)
 # this does not change with every generate-fetch-update cycle, here for two 
 segments:
 {noformat}
 /tmp/testcrawl/segments/20120105161430
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:14:21 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:14:48 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 /tmp/testcrawl/segments/20120105161631
 SegmentReader: get 'http://localhost/page_gone'
 Crawl Generate::
 Status: 1 (db_unfetched)
 Fetch time: Thu Jan 05 16:16:23 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 Crawl Fetch::
 Status: 37 (fetch_gone)
 Fetch time: Thu Jan 05 16:20:05 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: 
 http://localhost/page_gone
 {noformat}
 As far as I can see it's caused by setPageGoneSchedule() in 
 AbstractFetchSchedule. Some pseudo-code:
 {code}
 setPageGoneSchedule (called from update / CrawlDbReducer.reduce):
 datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * 
 maxInterval
 datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516
 if (maxInterval  datum.fetchInterval) // necessarily true
forceRefetch()
 forceRefetch:
 if (datum.fetchInterval  maxInterval) // true because it's 1.35 * 
 maxInterval
datum.fetchInterval = 0.9 * maxInterval
 datum.status = db_unfetched // 
 shouldFetch (called from generate / Generator.map):
 if ((datum.fetchTime - curTime)  maxInterval)
// always true if the crawler is launched in short intervals
// (lower than 0.35 * maxInterval)
datum.fetchTime = curTime // forces a refetch
 {code}
 After setPageGoneSchedule is called via update the state is db_unfetched and 
 the retry interval 0.9 * db.fetch.interval.max (81 days). 
 Although the fetch time in the CrawlDb is far in the future
 {noformat}
 % nutch readdb testcrawl/crawldb -url http://localhost/page_gone
 URL: http://localhost/page_gone
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Sun May 06 05:20:05 CEST 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 6998400 seconds (81 days)
 Score: 1.0
 Signature: null
 Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone
 {noformat}
 the URL is generated again because (fetch time - current time) is larger than 
 db.fetch.interval.max.
 The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and 
 the fetch time is always close to current time + 1.35 * db.fetch.interval.max.
 It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on

[jira] [Updated] (NUTCH-1318) Parse time outs crash parsing fetcher

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1318:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Parse time outs crash parsing fetcher
 -

 Key: NUTCH-1318
 URL: https://issues.apache.org/jira/browse/NUTCH-1318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.6


 Some fetch lists can never be fetched and parsed successfully because a 
 single timing out record can cause most and eventually all subsequeny records 
 to time out as well. Finally the mapper will hang completely and so killing 
 the entire fetch job, loosing 99% of the records that were processed.
 I'm not sure what's going on, something may be leaking somewhere.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1219:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Upgrade all jobs to new MapReduce API
 -

 Key: NUTCH-1219
 URL: https://issues.apache.org/jira/browse/NUTCH-1219
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Priority: Critical
 Fix For: 1.6


 We should upgrade to the new Hadoop API for Nutch trunk as already has been 
 done for the Nutchgora branch. If i'm not mistaken we can already upgrade to 
 the latest 0.20.5 version that still carries the legacy API so we can, 
 without immediately upgrading to 0.21 or higher, port the jobs to the new API 
 without having the need for a separate branch to work on.
 To the committers who created/ported jobs in NutchGora, please write down 
 your advice and experience.
 http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1251:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Deletion of duplicates fails with 
 org.apache.solr.client.solrj.SolrServerException
 --

 Key: NUTCH-1251
 URL: https://issues.apache.org/jira/browse/NUTCH-1251
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
 Environment: Any crawl where the number of URLs in Solr exceeds 1024 
 (the default max number of clusters in Lucene boolean query).  
Reporter: Arkadi Kosmynin
Priority: Critical
 Fix For: 1.6


 Deletion of duplicates fails. This happens because the get all query used 
 to get Solr index size is id:[* TO *], which is a range query. Lucene is 
 trying to expand it to a Boolean query and gets as many clauses as there are 
 ids in the index. This is too many in a real situation and it throws an 
 exception. 
 To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to 
 \*:\*, which is the standard Solr get all query.
 Indexing log extract:
 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error 
 executing query
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing 
 query
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
   at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
   at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234)
   ... 3 more
 Caused by: org.apache.solr.common.SolrException: Internal Server Error
 Internal Server Error
 request: http://localhost:8081/arch/select?q=id:[* TO 
 *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
   at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
   at 
 org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
   ... 5 more

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-578:


Fix Version/s: (was: 1.5)
   1.6

 URL fetched with 403 is generated over and over again
 -

 Key: NUTCH-578
 URL: https://issues.apache.org/jira/browse/NUTCH-578
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0
 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I 
 have checked out the most recent version of the trunk as of Nov 20, 2007
Reporter: Nathaniel Powell
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, 
 NUTCH-578_v4.patch, crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, 
 urls.txt


 I have not changed the following parameter in the nutch-default.xml:
 property
   namedb.fetch.retry.max/name
   value3/value
   descriptionThe maximum number of times a url that has encountered
   recoverable errors is generated for fetch./description
 /property
 However, there is a URL which is on the site that I'm crawling, 
 www.teachertube.com, which keeps being generated over and over again for 
 almost every segment (many more times than 3):
 fetch of http://www.teachertube.com/images/ failed with: Http code=403, 
 url=http://www.teachertube.com/images/
 This is a bug, right?
 Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1249:
-

Affects Version/s: (was: 1.5)
Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Resolve all issues flagged up by adding javac -Xlint arguement
 --

 Key: NUTCH-1249
 URL: https://issues.apache.org/jira/browse/NUTCH-1249
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.6


 There are a heap of issues flagged up by NUTCH-1237, I think over time it 
 would be great to get these addressed and resolved.
 What is interesting is that adding the same arguements to 
 /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
 Some of this stuff is documented in the link below
 http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1273:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fix [deprecation] javac warnings
 

 Key: NUTCH-1273
 URL: https://issues.apache.org/jira/browse/NUTCH-1273
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch, 
 NUTCH-1273-v2-trunk.patch


 As part of this task, these warnings should be resolved, however this 
 particular strand of warnings can either be resolved by adding
 {code}
 @SuppressWarnings(deprecation)
 {code}
 or by actually upgrading our class usage to rely upon non-deprecated classes. 
 Which option is more appropriate for the project?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1113:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Merging segments causes URLs to vanish from crawldb/index?
 --

 Key: NUTCH-1113
 URL: https://issues.apache.org/jira/browse/NUTCH-1113
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Edward Drapkin
 Fix For: 1.6

 Attachments: merged_segment_output.txt, unmerged_segment_output.txt


 When I run Nutch, I use the following steps:
 nutch inject crawldb/ url.txt
 repeated 3 times:
 nutch generate crawldb/ segments/ -normalize
 nutch fetch `ls -d segments/* | tail -1`
 nutch parse `ls -d segments/* | tail -1`
 nutch update crawldb `ls -d segments/* | tail -1`
 nutch mergesegs merged/ -dir segments/
 nutch invertlinks linkdb/ -dir merged/
 nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
 indexing code from Nutch 1.1).
 When I crawl with merging segments, I lose about 20% of the URLs that wind up 
 in the index vs. when I crawl without merging the segments.  Somehow the 
 segment merger causes me to lose ~20% of my crawl database!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1116) Write JUnit tests for all plugins

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1116:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Write JUnit tests for all plugins  
 ---

 Key: NUTCH-1116
 URL: https://issues.apache.org/jira/browse/NUTCH-1116
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is a step towards covering the parts of our plugin codebase which 
 are currently missing JUnit test cases. Each plugin will have its own 
 sub-issue meaning that this parent issue should not be deemed complete until 
 all existing (and newly contributed) plugins have the appropriate test cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1084) ReadDB url throws exception

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1084:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 ReadDB url throws exception
 ---

 Key: NUTCH-1084
 URL: https://issues.apache.org/jira/browse/NUTCH-1084
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 Readdb -url suffers from two problems:
 1. it trips over the _SUCCESS file generated by newer Hadoop version
 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
 The first problem can be remedied by not allowing the injector or updater to 
 write the _SUCCESS file. Until now that's the solution implemented for 
 similar issues. I've not been successful as to make the Hadoop readers simply 
 skip the file.
 The second issue seems a bit strange and did not happen on a local check out. 
 I'm not yet sure whether this is a Hadoop issue or something being corrupt in 
 the CrawlDB. Here's the stack trace:
 {code}
 Exception in thread main java.io.IOException: can't find class: 
 org.apache.nutch.protocol.ProtocolStatus because 
 org.apache.nutch.protocol.ProtocolStatus
 at 
 org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
 at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
 at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
 at 
 org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
 at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
 at 
 org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
 at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
 at 
 org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
 at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1150:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 http.redirect.max can lead to multiple parses of the same url
 -

 Key: NUTCH-1150
 URL: https://issues.apache.org/jira/browse/NUTCH-1150
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.3, 1.4
Reporter: Markus Jelsma
 Fix For: 1.6


 With http.redirect.max  0 it's possible that a document is parsed multiple 
 times. This is the case when several url's from the same fetch redirect to a 
 shared location.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1147) WebGraph nodeDumper uses only 1 reducer

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1147:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 WebGraph nodeDumper uses only 1 reducer
 ---

 Key: NUTCH-1147
 URL: https://issues.apache.org/jira/browse/NUTCH-1147
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6

 Attachments: NUTCH-1147-1.5-1.patch


 The noderDumper is restricted to only one reducer, making it slow and 
 producing too large files.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1194) CrawlDB lock should be released earlier

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1194:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 CrawlDB lock should be released earlier
 ---

 Key: NUTCH-1194
 URL: https://issues.apache.org/jira/browse/NUTCH-1194
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 Lock on the CrawlDB is released when everything is finished. But when 
 generating many segments, the lock remains in place while it's not neccessary 
 anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately 
 after the selector has finished.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1201) Allow for different FetcherThread impls

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1201:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Allow for different FetcherThread impls
 ---

 Key: NUTCH-1201
 URL: https://issues.apache.org/jira/browse/NUTCH-1201
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: CustomFetcher.java, NUTCH-1201-1.5-wip.patch


 For certain cases we need to modify parts in FetcherThread and make it 
 pluggable. This introduces a new config directive fetcher.impl that takes a 
 FQCN and uses that setting Fetcher.fetch to load a class to use for 
 job.setMapRunnerClass(). This new class has to extend Fetcher and and inner 
 class FetcherThread. This allows for overriding methods in FetcherThread but 
 also methods in Fetcher itself if required.
 A follow up on this issue would be to refactor parts of FetcherThread to make 
 it easier to override small sections instead of copying the entire method 
 body for a small change, which is now the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1183) Summary task for adding command line usage instructions to webgraph classes

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1183:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Summary task for adding command line usage instructions to webgraph classes
 ---

 Key: NUTCH-1183
 URL: https://issues.apache.org/jira/browse/NUTCH-1183
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 The following files should provide output when called innacurately from the 
 command line. Something similar to 
 {code}
 Usage: class -arg1, -arg2, etc etc
 {code}
 * webgraph
 * linkrank
 * scoreupdater
 * nodedumper
 * nodereader
 If anyone would like to see further classes included in this task please add 
 to the above list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1176) Fix all javadoc warnings from nightly builds

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1176:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fix all javadoc warnings from nightly builds
 

 Key: NUTCH-1176
 URL: https://issues.apache.org/jira/browse/NUTCH-1176
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 The warnings can clearly be seen from the javadoc target (near bottom) of any 
 successful nightly build. An example is provided below.
 https://builds.apache.org/job/nutch-trunk/1638/console

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1040) Backport REST-API from 2.0

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1040:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Backport REST-API from 2.0
 --

 Key: NUTCH-1040
 URL: https://issues.apache.org/jira/browse/NUTCH-1040
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Reporter: Julien Nioche
 Fix For: 1.6


 See https://issues.apache.org/jira/browse/NUTCH-880 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1274:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fix [cast] javac warnings
 -

 Key: NUTCH-1274
 URL: https://issues.apache.org/jira/browse/NUTCH-1274
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 A typical example of this is
 {code}
 trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] 
 redundant cast to int
 [javac] res ^= (int)(signature[i]  24 + signature[i+1]  16 + 
 {code}
 these should all be fixed by replacing with the correct implementations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1233:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Rely on Tika for outlink extraction
 ---

 Key: NUTCH-1233
 URL: https://issues.apache.org/jira/browse/NUTCH-1233
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1233-1.5-wip.patch


 Tika provides outlink extraction features that are not used in Nutch. To be 
 able to use it in Nutch we need Tika to return the rel attr value of each 
 link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
 is included in Tika and we upgraded to that new version this issue can be 
 worked on. Here's preliminary code that does both Tika and current outlink 
 extraction. This also includes parts of the Boilerpipe code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1014) Migrate from Apache ORO to java.util.regex

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1014:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Migrate from Apache ORO to java.util.regex
 --

 Key: NUTCH-1014
 URL: https://issues.apache.org/jira/browse/NUTCH-1014
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
 Fix For: 1.6


 A separate issue tracking migration of all components from Apache ORO to 
 java.util.regex. Components involved are:
 - RegexURLNormalzier
 - OutlinkExtractor
 - JSParseFilter
 - MoreIndexingFilter
 - BasicURLNormalizer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1063) OutlinkExtractor test generates an exception but does not fail

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1063:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 OutlinkExtractor test generates an exception but does not fail
 --

 Key: NUTCH-1063
 URL: https://issues.apache.org/jira/browse/NUTCH-1063
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Julien Nioche
 Fix For: 1.6


 Testsuite: org.apache.nutch.parse.TestOutlinkExtractor
 Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.043 sec
 - Standard Output ---
 2011-07-19 15:06:36,073 ERROR parse.OutlinkExtractor 
 (OutlinkExtractor.java:getOutlinks(121)) - getOutlinks
 java.lang.NullPointerException
   at org.apache.oro.text.regex.PatternMatcherInput.init(Unknown Source)
   at 
 org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:95)
   at 
 org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:72)
   at 
 org.apache.nutch.parse.TestOutlinkExtractor.testGetNoOutlinks(TestOutlinkExtractor.java:40)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at junit.framework.TestCase.runTest(TestCase.java:168)
   at junit.framework.TestCase.runBare(TestCase.java:134)
   at junit.framework.TestResult$1.protect(TestResult.java:110)
   at junit.framework.TestResult.runProtected(TestResult.java:128)
   at junit.framework.TestResult.run(TestResult.java:113)
   at junit.framework.TestCase.run(TestCase.java:124)
   at junit.framework.TestSuite.runTest(TestSuite.java:232)
   at junit.framework.TestSuite.run(TestSuite.java:227)
   at 
 org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79)
   at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
   at 
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422)
   at 
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931)
   at 
 org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1220) Upgrade Solr deps

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1220:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Upgrade Solr deps
 -

 Key: NUTCH-1220
 URL: https://issues.apache.org/jira/browse/NUTCH-1220
 Project: Nutch
  Issue Type: Task
  Components: build
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 SlfJ4 needs to be part of upgrade to Solr 3.5 but that breaks something else. 
 Likely Hadoop has a different Slf4J version?
 {code}
 Exception in thread main java.lang.NoSuchMethodError: 
 org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
 at 
 org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133)
 at 
 org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:136)
 at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:180)
 at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159)
 at 
 org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395)
 at 
 org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1418)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1319)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:109)
 at 
 org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:544)
 at 
 org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:339)
 at 
 org.apache.nutch.util.domain.DomainStatistics.run(DomainStatistics.java:108)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.util.domain.DomainStatistics.main(DomainStatistics.java:215)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1123) JUnit test for scoring-link

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1123:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for scoring-link
 ---

 Key: NUTCH-1123
 URL: https://issues.apache.org/jira/browse/NUTCH-1123
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-865) Format source code in unique style

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-865:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Format source code in unique style
 --

 Key: NUTCH-865
 URL: https://issues.apache.org/jira/browse/NUTCH-865
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Pham Tuan Minh
Assignee: Lewis John McGibbney
 Fix For: 1.6

 Attachments: NUTCH-865-nutchgora-rev1188268.patch, 
 NUTCH-865-trunk-rev1188252.patch, NUTCH-865.patch


 We should define a standard format rules for source code/comments, then using 
 eclipse tool to format the whole source code in the same style. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1120) JUnit test for microformats-reltag

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1120:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for microformats-reltag
 --

 Key: NUTCH-1120
 URL: https://issues.apache.org/jira/browse/NUTCH-1120
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1186) FreeGenerator always normalizes

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1186:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 FreeGenerator always normalizes
 ---

 Key: NUTCH-1186
 URL: https://issues.apache.org/jira/browse/NUTCH-1186
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 The FreeGenerator does not honor the -normalize option, it always normalizes 
 all URL's in the input directory. The -filter option is respected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1308) Unnecessary truncate content configuration, and logging in parse-zip/ZipParser

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1308:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Unnecessary truncate content configuration, and logging in 
 parse-zip/ZipParser  
 

 Key: NUTCH-1308
 URL: https://issues.apache.org/jira/browse/NUTCH-1308
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
 Fix For: 1.6


 Two issues here...
 1) Recently ferdy committed NUTCH-965 which skips parsing of truncated 
 documents. Parse zip has it's own implementation for the same when it should 
 really draw on the aforementioned implementation.
 2) If (in the offending piece of code mentioned above) truncation occurs, we 
 get an incorrect log message the Parser can't handle incomplete pdf 
 files!!! This is incorrect, shouldn't be there, and should be removed.
 {code}
 72  if (contentLen != null  contentInBytes.length != len) {
 73return new ParseStatus(ParseStatus.FAILED,
 74ParseStatus.FAILED_TRUNCATED, Content truncated at 
 75+ contentInBytes.length
 76+  bytes. Parser can't handle incomplete pdf file.)
 77.getEmptyParseResult(content.getUrl(), getConf());
 78}
 {code}
 For clarity, the issue is present in both Nutchgora branch[1] and Nutch 
 trunk[2]
 [1] 
 https://svn.apache.org/viewvc/nutch/branches/nutchgora/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java?diff_format=hview=markup
 [2] 
 https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java?diff_format=hview=markup
 [2] 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1252) SegmentReader -get shows wrong data

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1252:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 SegmentReader -get shows wrong data
 ---

 Key: NUTCH-1252
 URL: https://issues.apache.org/jira/browse/NUTCH-1252
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4, 1.5
Reporter: Sebastian Nagel
 Fix For: 1.6

 Attachments: NUTCH-1252-v2.patch, NUTCH-1252.patch


 The command/option -get of the SegmentReader may show wrong data associated 
 with the given URL. 
 To reproduce:
 {code}
 % mkdir -p test_readseg/urls
 % echo -e 
 http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0;
   test_readseg/urls/seeds
 % nutch inject test_readseg/crawldb test_readseg/urls
 Injector: starting at 2012-01-18 09:32:25
 Injector: crawlDb: test_readseg/crawldb
 Injector: urlDir: test_readseg/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: Merging injected urls into crawl db.
 Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03
 % nutch generate test_readseg/crawldb test_readseg/segments/
 Generator: starting at 2012-01-18 09:32:30
 Generator: Selecting best-scoring urls due for fetch.
 Generator: filtering: true
 Generator: normalizing: true
 Generator: jobtracker is 'local', generating exactly one partition.
 Generator: Partitioning selected urls for politeness.
 Generator: segment: test_readseg/segments/20120118093232
 Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03
 % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' 
 -nocontent -noparse -nofetch -noparsedata -noparsetext
 SegmentReader: get 'http://nutch.apache.org/'
 Crawl Generate::
 Version: 7
 Status: 1 (db_unfetched)
 Fetch time: Wed Jan 18 09:32:26 CET 2012
 Modified time: Thu Jan 01 01:00:00 CET 1970
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 10.0
 Signature: null
 Metadata: _ngt_: 1326875550401test: AbcTest
 {code}
 The metadata and the score indicate that the CrawlDatum shown is the wrong 
 one (that associated to http://abc.test/ but not to http://nutch.apache.org/).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1121) JUnit test for parse-js

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1121:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for parse-js
 ---

 Key: NUTCH-1121
 URL: https://issues.apache.org/jira/browse/NUTCH-1121
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-809) Parse-metatags plugin

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-809:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Parse-metatags plugin
 -

 Key: NUTCH-809
 URL: https://issues.apache.org/jira/browse/NUTCH-809
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4, nutchgora
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.6

 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
 NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip


 h2. Parse-metatags plugin
 The parse-metatags plugin consists of a HTMLParserFilter which takes as 
 parameter a list of metatag names with '*' as default value. The values are 
 separated by ';'.
 In order to extract the values of the metatags description and keywords, you 
 must specify in nutch-site.xml
 {code:xml}
 property
   namemetatags.names/name
   valuedescription;keywords/value
 /property
 {code}
 The MetatagIndexer uses the output of the parsing above to create two fields 
 'keywords' and 'description'. Note that keywords is multivalued.
 The query-basic plugin is used to include these fields in the search e.g. in 
 nutch-site.xml
 {code:xml}
 property
   namequery.basic.description.boost/name
   value2.0/value
 /property
 property
   namequery.basic.keywords.boost/name
   value2.0/value
 /property
 {code}
 This code has been developed by DigitalPebble Ltd and offered to the 
 community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1046) Add tests for indexing to SOLR

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1046:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Add tests for indexing to SOLR
 --

 Key: NUTCH-1046
 URL: https://issues.apache.org/jira/browse/NUTCH-1046
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.6


 We currently have no tests for checking that the indexing to SOLR works as 
 expected. Running an embedded SOLR Server within the tests would be good.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1228:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Change mapred.task.timeout to mapreduce.task.timeout in fetcher
 ---

 Key: NUTCH-1228
 URL: https://issues.apache.org/jira/browse/NUTCH-1228
 Project: Nutch
  Issue Type: Task
  Components: fetcher
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1001:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 bin/nutch fetch/parse handle crawl/segments directory
 -

 Key: NUTCH-1001
 URL: https://issues.apache.org/jira/browse/NUTCH-1001
 Project: Nutch
  Issue Type: Improvement
Reporter: Gabriele Kahlout
Priority: Minor
 Fix For: 1.6

 Attachments: Fetcher.java, NUTCH-1001.patch, nutch1001v2.patch


 I'm having issues porting scripts across different systems to support the 
 step of extracting the latest/only segments resulting from the generate phase.
 Variants include:
 $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1]
 $ s1=`ls -d crawl/segments/2* | tail -1` #[2]
 $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o 
 [a-zA-Z0-9/\-]* |tail -1`
 $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o 
 [a-zA-Z0-9/\-]* |tail -1`
 And I'm not sure what windows users would have to do. Some users may also do 
 with:
 bin/nutch fetch with crawl/segments/2*
 But I don't see a need in having the user extract/worry-about the latest/only 
 segment, and have it a described step in every nutch tutorial. More over only 
 fetch and parse expect a segment while other commands are fine with the 
 directory of segments.
 Therefore, I think it's beneficial if fetch and parse also handle directories 
 of segments. 
 [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
 [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1060:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 URL filters to produce regexes to be used by OutlinkExtractor.
 --

 Key: NUTCH-1060
 URL: https://issues.apache.org/jira/browse/NUTCH-1060
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
 Fix For: 1.6


 The problem:
 OutlinkExtractor produces many URL's from plain text using an advanced 
 regular expression:
 {code}
 ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@~=%-]{0,1000}))?)
 {code}
 This expression does not take into account the various non-regex-based URL 
 filters such as prefix, domain and suffix and thus produces URL's that are 
 going to be filtered out by some filter. This, however, becomes a problem 
 when parsing millions of documents that are being processed by the 
 OutlinkExtractor (when case parse-html|parse-tika do not produce any 
 outlinks). Large bodies of full text usually contain a lot of sequences that 
 are extracted as URL's. Many of which are thought to be part of an URI schema 
 such as:
 id:123
 says:what
 user:doe
 update:tue-19-jul
 The above examples can be easily remedied by using a configured prefix URL 
 filter. It may, however, be an even better idea to prevent the extraction of 
 these URL's at the first place. No extraction means filtering less URL's and 
 potentially saving a lot of data.
 Comments? I'll see if i can produce a patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1100) SolrDedup broken

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1100:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 SolrDedup broken
 

 Key: NUTCH-1100
 URL: https://issues.apache.org/jira/browse/NUTCH-1100
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.6


 Some Solr indices are unable to be deduped from Nutch. For unknown reasons 
 Nutch will throw the exception below. There are no peculiarities to be found 
 in the Solr logs, the queries are normal and seem to succeed.
 {code}
 java.lang.NullPointerException
 at org.apache.hadoop.io.Text.encode(Text.java:388)
 at org.apache.hadoop.io.Text.set(Text.java:178)
 at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272)
 at 
 org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243)
 at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
 at 
 org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1124) JUnit test for scoring-opic

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1124:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for scoring-opic
 ---

 Key: NUTCH-1124
 URL: https://issues.apache.org/jira/browse/NUTCH-1124
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1197:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Add statically configured field values to solrindex-mapping.xml
 ---

 Key: NUTCH-1197
 URL: https://issues.apache.org/jira/browse/NUTCH-1197
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.6

 Attachments: NUTCH-1197.patch


 In some cases it's useful to be able to add to every document sent to Solr a 
 set of predefined fields with static values. This could be implemented on the 
 Solr side (with a custom UpdateRequestProcessor), but it may be less 
 cumbersome to add them on the Nutch side.
 Example: let's say I have several Nutch configurations all indexing to the 
 same Solr instance, and I want each of them to add its identifier as a field 
 in all documents, e.g. origin=web_crawl_1, origin=file_crawl, 
 origin=unlimited_crawl, etc...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1122) JUnit test for protocol-ftp

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1122:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for protocol-ftp
 ---

 Key: NUTCH-1122
 URL: https://issues.apache.org/jira/browse/NUTCH-1122
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1127) JUnit test for urlfilter-validator

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1127:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for urlfilter-validator
 --

 Key: NUTCH-1127
 URL: https://issues.apache.org/jira/browse/NUTCH-1127
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1247) CrawlDatum.retries should be int

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1247:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 CrawlDatum.retries should be int
 

 Key: NUTCH-1247
 URL: https://issues.apache.org/jira/browse/NUTCH-1247
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B


 CrawlDatum.retries is a byte and goes bad with larger values.
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1
 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-208) http: proxy exception list:

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-208:


Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 http: proxy exception list:
 ---

 Key: NUTCH-208
 URL: https://issues.apache.org/jira/browse/NUTCH-208
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8, 1.3, nutchgora
Reporter: Matthias Günter
Assignee: Lewis John McGibbney
Priority: Trivial
  Labels: patch
 Fix For: 1.6

 Attachments: NUTCH-208-branch-1.4-20110210-v3.patch, 
 NUTCH-208-branch-1.4-20110807.patch, NUTCH-208-branch-1.4-20110809-v2.patch, 
 NUTCH-208-trunk-2.0-20110810-v2.patch, NUTCH-208-trunk-2.0-20110810.patch, 
 patch.txt, patch.txt, proxy_exception_list-0.8.diff


 I suggest that a parameter is added to nutch-default.xml which allows to 
 generate a proxy exception list. 
 property
   namehttp.proxy.exception.list/name
   value/value
   descriptionURL's and hosts that don't use the proxy (e.g. 
 intranets)/description
 /property
 This is useful when scanning intranet/internet combinations from behind a 
 firewall. A preliminary patch is added to this extend to this request, 
 showing the changes. We will test it and update it if necessary. this also 
 reflects the reality in web browsers, where there is in most cases an 
 exception list.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1031:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.6


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1107) Log slow parse entries

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1107:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Log slow parse entries
 --

 Key: NUTCH-1107
 URL: https://issues.apache.org/jira/browse/NUTCH-1107
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6


 Parse mapper and outputformat should have a facility to log (configurable) 
 slow entries. This is useful for debugging slow parses. Logging parser keys 
 only is not good enough, especially in a distributed environment.
 Sometimes the actual parse (mapper) is very slow and sometimes the 
 normalization and filtering of an entry's outlinks is slow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-585:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1320:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1126) JUnit test for urlfilter-prefix

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1126:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for urlfilter-prefix
 ---

 Key: NUTCH-1126
 URL: https://issues.apache.org/jira/browse/NUTCH-1126
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1087:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Deprecate crawl command and replace with example script
 ---

 Key: NUTCH-1087
 URL: https://issues.apache.org/jira/browse/NUTCH-1087
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.4
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 * remove the crawl command
 * add basic crawl shell script
 See thread:
 http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1128) JUnit test for urlmeta

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1128:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for urlmeta
 --

 Key: NUTCH-1128
 URL: https://issues.apache.org/jira/browse/NUTCH-1128
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1034) Create Solr Velocity templates

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1034:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Create Solr Velocity templates
 --

 Key: NUTCH-1034
 URL: https://issues.apache.org/jira/browse/NUTCH-1034
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: doc.vm.patch, facets.vm.patch


 Solr has Velocity integration and provides an easy method for creating HTML 
 based front-ends for the search engine. This issue tracks the development of 
 Velocity templates specifically for Nutch users.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1179) Option to restrict generated records by metadata

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1179:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Option to restrict generated records by metadata
 

 Key: NUTCH-1179
 URL: https://issues.apache.org/jira/browse/NUTCH-1179
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 The generator should be able to select entries based on a metadata key/value 
 pair.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1300:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Indexer to normalize URL's
 --

 Key: NUTCH-1300
 URL: https://issues.apache.org/jira/browse/NUTCH-1300
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1300-1.5-1.patch


 Indexers should be able to normalize URL's. This is useful when a new 
 normalizer is applied to the entire CrawlDB. Without it, some or all records 
 in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1130) JUnit test for Any23 RDF plugin

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1130:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for Any23 RDF plugin
 ---

 Key: NUTCH-1130
 URL: https://issues.apache.org/jira/browse/NUTCH-1130
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 The JUnit test should be written prior to the progression of the Any23 Nutch 
 plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1047:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Pluggable indexing backends
 ---

 Key: NUTCH-1047
 URL: https://issues.apache.org/jira/browse/NUTCH-1047
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
  Labels: indexing
 Fix For: 1.6


 One possible feature would be to add a new endpoint for indexing-backends and 
 make the indexing plugable. at the moment we are hardwired to SOLR - which is 
 OK - but as other resources like ElasticSearch are becoming more popular it 
 would be better to handle this as plugins. Not sure about the name of the 
 endpoint though : we already have indexing-plugins (which are about 
 generating fields sent to the backends) and moreover the backends are not 
 necessarily for indexing / searching but could be just an external storage 
 e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
 could be pertaining to the storage in GORA. 'indexing-backend' is the best 
 name that came to my mind so far - please suggest better ones.
 We should come up with generic map/reduce jobs for indexing, deduplicating 
 and cleaning and maybe add a Nutch extension point there so we can easily 
 hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1035) Tune Solr config for Nutch users

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1035:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Tune Solr config for Nutch users
 

 Key: NUTCH-1035
 URL: https://issues.apache.org/jira/browse/NUTCH-1035
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: solrconfig.xml


 To improve and ease integration with Solr we should provide a solrconfig.xml 
 specifically for Nutch integration including a request handler with a 
 Velocity response writer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1226) Migrate CrawlDbReader to MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1226:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Migrate CrawlDbReader to MapReduce API
 --

 Key: NUTCH-1226
 URL: https://issues.apache.org/jira/browse/NUTCH-1226
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1226-1.5-1.patch


 Hadoop 0.21 only!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1223) Migrate WebGraph to MapReduce API

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1223:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Migrate WebGraph to MapReduce API
 -

 Key: NUTCH-1223
 URL: https://issues.apache.org/jira/browse/NUTCH-1223
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
 Fix For: 1.6




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1202) Fetcher timebomb kills long waiting fetch jobs

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1202:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Fetcher timebomb kills long waiting fetch jobs
 --

 Key: NUTCH-1202
 URL: https://issues.apache.org/jira/browse/NUTCH-1202
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Reporter: Markus Jelsma
 Fix For: 1.6


 The timebomb feature kills of mappers of jobs that have been waiting too long 
 in the job queue. The timebomb feature should start at mapper initialization 
 instead, not in job init.
 Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1079) StringBuffer converted to StringBuilder

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1079:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 StringBuffer converted to StringBuilder
 ---

 Key: NUTCH-1079
 URL: https://issues.apache.org/jira/browse/NUTCH-1079
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, indexer
Reporter: Karthik K
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch


 All across the codebase, it contains StringBuffer, when thread-safety is 
 probably not intended. 
 This patch replaces StringBuffer to StringBuilder, as applicable. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1319) HostNormalizer

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1319:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 HostNormalizer
 --

 Key: NUTCH-1319
 URL: https://issues.apache.org/jira/browse/NUTCH-1319
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1319-1.5-1.patch


 Nutch would benefit from having a host normalizer. A host normalizer maps a 
 given host to the desired host. A basic example is to map www.apache.org to 
 apache.org. The Apache website is one of many on the internet that has a 
 duplicate website on the same domain just because it allows both www and 
 non-www to return HTTP 200 and proper content.
 It is also able to handle wildcards such as *.example.org to example.org if 
 there are multiple sub domains that actually point to the same website.
 Large internet crawls tend to get polluted very quickly due to these 
 problems. It also leads to skewed scores in the webgraph as different 
 websites link to different versions of the same duplicate website.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1140:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 index-more plugin, resetTitle method creates multiple values in the Title 
 field
 ---

 Key: NUTCH-1140
 URL: https://issues.apache.org/jira/browse/NUTCH-1140
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.3
Reporter: Joe Liedtke
Priority: Minor
 Fix For: 1.6

 Attachments: MoreIndexingFilter.093011.patch


 From the comments in MoreIndexingFilter.java, the index-more plugin is meant 
 to reset the Title field of a document if it contains a Content-Disposition 
 header. The current behavior is to add a Title regardless of whether one 
 exists or not, which can cause issues down the line with the Solr Indexing 
 process, and based on a thread in the nutch user list it appears that this is 
 causing some users to mark the title as multi-valued in the schema:
   
 http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8
 The following patch removes the title field before adding a new one, which 
 has resolved the issue for me:
 --- MoreIndexingFilter.old2011-09-30 11:44:35.0 +
 +++ MoreIndexingFilter.java   2011-09-30 09:58:48.0 +
 @@ -276,6 +276,7 @@
  for (int i=0; ipatterns.length; i++) {
if (matcher.contains(contentDisposition,patterns[i])) {
  result = matcher.getMatch();
 +doc.removeField(title);
  doc.add(title, result.group(1));
  break;
}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1021:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Migrate OutlinkExtractor from Apache ORO to java.util.regex 
 

 Key: NUTCH-1021
 URL: https://issues.apache.org/jira/browse/NUTCH-1021
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1021-1.4-2.patch, NUTCH-1021-1.4-4.patch, 
 NUTCH-1021-1.4.patch


 Migrate from deprecated ORO to Java util regex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1039) Fetcher fails for pages without content-length header

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1039:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Fetcher fails for pages without content-length header
 -

 Key: NUTCH-1039
 URL: https://issues.apache.org/jira/browse/NUTCH-1039
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 Fetcher fails:
 2011-07-11 14:45:34,764 ERROR http.Http - 
 org.apache.nutch.protocol.http.api.HttpException: bad content length:
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:218)
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:158)
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.protocol.http.Http.getResponse(Http.java:64)
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138)
 2011-07-11 14:45:34,765 ERROR http.Http - at 
 org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:79)
 Both fetcher and indexing filter checker fail sometimes. I'm unsure whether 
 this is something in Nutch or whether the remote server only returns 
 content-length incidentally.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1275) Fix [unchecked] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1275:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Fix [unchecked] javac warnings
 --

 Key: NUTCH-1275
 URL: https://issues.apache.org/jira/browse/NUTCH-1275
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 We can simply suppress these warnings using  
 {code}
 SuppressWarnings [unchecked]
 {code}
 However if there is a another method for resolving these warnings then they 
 should be implemented if deemed beneficial to code quality.
 Some resources 
 http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1151) Index-anchor to add numInlinks count

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1151:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Index-anchor to add numInlinks count
 

 Key: NUTCH-1151
 URL: https://issues.apache.org/jira/browse/NUTCH-1151
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6

 Attachments: NUTCH-1151-1.5-1.patch


 Issue to improve in index-anchor to add the number of inlinks per document. 
 This count is useful for calculating some authority metric in the search 
 server.T

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1053:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Parsing of RSS feeds fails 
 ---

 Key: NUTCH-1053
 URL: https://issues.apache.org/jira/browse/NUTCH-1053
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.6

 Attachments: nutch-1053.patch, seed.txt


 See discussion on 
 http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
 Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-961:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Expose Tika's boilerpipe support
 

 Key: NUTCH-961
 URL: https://issues.apache.org/jira/browse/NUTCH-961
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6

 Attachments: BoilerpipeExtractorRepository.java, 
 NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, 
 NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, 
 NUTCH-961-1.5-1.patch, NUTCH-961v2.patch


 Tika 0.8 comes with the Boilerpipe content handler which can be used to 
 extract boilerplate content from HTML pages. We should see how we can expose 
 Boilerplate in the Nutch cofiguration.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1118) JUnit test for index-basic

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1118:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for index-basic
 --

 Key: NUTCH-1118
 URL: https://issues.apache.org/jira/browse/NUTCH-1118
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1284:
-

Fix Version/s: (was: 1.5)
   (was: nutchgora)
   1.6

20120304-push-1.6

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Trivial
 Fix For: 1.6


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1149) DomainStats should process numeric CrawlDB metadata

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1149:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 DomainStats should process numeric CrawlDB metadata
 ---

 Key: NUTCH-1149
 URL: https://issues.apache.org/jira/browse/NUTCH-1149
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.6


 Right now the DomainStats program only outputs the sum of fetched records per 
 domain or host. It should also be able to output processed numerics of meta 
 data in order to get the average size (content length) for a given domain or 
 host. This is also useful for generating a metric for adult material (by 
 domain or host) when using a plugin that stores a propability factor of adult 
 material per URL in the Crawl DB.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1181) Indexer to use webgraph inlinks

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1181:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Indexer to use webgraph inlinks
 ---

 Key: NUTCH-1181
 URL: https://issues.apache.org/jira/browse/NUTCH-1181
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 Indexers currently rely on the LinkDB for anchor indexing while the WebGraph 
 provides the same data as an inverted link DB. An inlinkDB created by the 
 WebGraph program with non-zero LinkRank scores on the nodes also provide an 
 improved set ordered by popularity.
 This issue must:
 - let IndexerMapReduce understand the new format;
 - allow for indexing only popular anchors.
 The goal is todeprecate all code associated with invertlinks and ultimately 
 remove it from the codebase.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1117) JUnit test for index-anchor

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1117:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 JUnit test for index-anchor
 ---

 Key: NUTCH-1117
 URL: https://issues.apache.org/jira/browse/NUTCH-1117
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1024:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
 NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, 
 adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1317) Max content length by MIME-type

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1317:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Max content length by MIME-type
 ---

 Key: NUTCH-1317
 URL: https://issues.apache.org/jira/browse/NUTCH-1317
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.6


 The good old http.content.length directive is not sufficient in large 
 internet crawls. For example, a 5MB PDF file may be parsed without issues but 
 a 5MB HTML file may time out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1277) Fix [fallthrough] javac warnings

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1277:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Fix [fallthrough] javac warnings
 

 Key: NUTCH-1277
 URL: https://issues.apache.org/jira/browse/NUTCH-1277
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
 Fix For: nutchgora, 1.6


 This usually occurs when we have an instance where a switch statement(s) fall 
 through (that is, one or more break statements are missing).
 We need to determine where a simple
 {code}
 @SuppressWarnings(fallthrough)
 {code}
 is required or whether we need to include the break statements in switch 
 blocks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1215) UpdateDB should not require segment as input

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1215:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 UpdateDB should not require segment as input
 

 Key: NUTCH-1215
 URL: https://issues.apache.org/jira/browse/NUTCH-1215
 Project: Nutch
  Issue Type: Bug
  Components: linkdb
Affects Versions: 1.4
Reporter: Markus Jelsma
 Fix For: 1.6

 Attachments: NUTCH-1215-1.5-1.patch


 UpdateDB requires an input segment. This causes the metrics for the records 
 of the segment to change, e.g. from fetched to not_modified and changes an 
 adaptive fetch schedule accordingly. This should not happen when one needs to 
 update for filtering of normalizing or other maintenance.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1103) Port protocol-sftp to 1.4

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1103:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Port protocol-sftp to 1.4
 -

 Key: NUTCH-1103
 URL: https://issues.apache.org/jira/browse/NUTCH-1103
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 Port protocol-sftp from trunk back to 1.4

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1088) Write Solr XML documents

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1088:
-

Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 Write Solr XML documents
 

 Key: NUTCH-1088
 URL: https://issues.apache.org/jira/browse/NUTCH-1088
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6


 Documents need to be reindexed when index-time analysis is modified. Indexing 
 individual segments from Nutch is tedious, especially for small segments. 
 This issue should add a feature that can write XML batches.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-828) Fetch Filter

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-828:

Fix Version/s: (was: 1.5)
(was: nutchgora)
1.6

20120304-push-1.6

Fetch Filter

Key: NUTCH-828
URL: https://issues.apache.org/jira/browse/NUTCH-828
Project: Nutch
Issue Type: New Feature
Components: fetcher
Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Fix For: 1.6

Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch

Adds a Nutch extension point for a fetch filter. The fetch filter allows
filtering content and parse data/text after it is fetched but before it is
written to segments. The fliter can return true if content is to be written
or false if it is not.
Some use cases for this filter would be topical search engines that only want
to fetch/index certain types of content, for example a news or sports only
search engine. In these types of situations the only way to determine if
content belongs to a particular set is to fetch the page and then analyze the
content. If the content passes, meaning belongs to the set of say sports
pages, then we want to include it. If it doesn't then we want to ignore it,
never fetch that same page in the future, and ignore any urls on that page.
If content is rejected due to a fetch filter then its status is written to
the CrawlDb as gone and its content is ignored and not written to segments.
This effectively stop crawling along the crawl path of that page and the urls
from that page. An example filter, fetch-safe, is provided that allows
fetching content that does not contain a list of bad words.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-30 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1024:
-

Attachment: NUTCH-1024-1.5-3.patch

New patch with proper logging and configuration files.

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
 NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, 
 adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-30 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1024:
-

Attachment: NUTCH-1024-1.5-3.patch

Something went wrong here. 

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
 NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, 
 adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-30 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1024:
-

Attachment: (was: NUTCH-1024-1.5-3.patch)

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
 NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, 
 adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-29 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1024:
-

Attachment: NUTCH-1024-1.5-2.patch

New patch for 1.5 with modifications as per Julien's comments.

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
 NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-03-27 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1320:
-

Attachment: NUTCH-1320-1.5-1.patch

Patch for 1.5. URLUtil now has a toASCII and toUnicode method wrapping the 
java.net.IDN methods. These take an URL and return a normalized one.

 IndexChecker and ParseChecker choke on IDN's
 

 Key: NUTCH-1320
 URL: https://issues.apache.org/jira/browse/NUTCH-1320
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1320-1.5-1.patch


 These handy debug tools do not handle IDN's and throw an NPE
 bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
 {code}
 Exception in thread main java.lang.NullPointerException
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at 
 org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1234) Upgrade to Tika 1.1

2012-03-26 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1234:
-

Attachment: NUTCH-1234-1.5-1.patch

Patch for 1.5 upgrading to Tika-core 1.1 and upgrading Hadoop test to 1.0.0 and 
all tests pass. Will commit shortly unless there are objections.

 Upgrade to Tika 1.1
 ---

 Key: NUTCH-1234
 URL: https://issues.apache.org/jira/browse/NUTCH-1234
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1234-1.5-1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1319) HostNormalizer

2012-03-22 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1319:
-

Patch Info: Patch Available

 HostNormalizer
 --

 Key: NUTCH-1319
 URL: https://issues.apache.org/jira/browse/NUTCH-1319
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1319-1.5-1.patch


 Nutch would benefit from having a host normalizer. A host normalizer maps a 
 given host to the desired host. A basic example is to map www.apache.org to 
 apache.org. The Apache website is one of many on the internet that has a 
 duplicate website on the same domain just because it allows both www and 
 non-www to return HTTP 200 and proper content.
 It is also able to handle wildcards such as *.example.org to example.org if 
 there are multiple sub domains that actually point to the same website.
 Large internet crawls tend to get polluted very quickly due to these 
 problems. It also leads to skewed scores in the webgraph as different 
 websites link to different versions of the same duplicate website.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1319) HostNormalizer

2012-03-22 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1319:
-

Attachment: NUTCH-1319-1.5-1.patch

Patch for 1.5.

 HostNormalizer
 --

 Key: NUTCH-1319
 URL: https://issues.apache.org/jira/browse/NUTCH-1319
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1319-1.5-1.patch


 Nutch would benefit from having a host normalizer. A host normalizer maps a 
 given host to the desired host. A basic example is to map www.apache.org to 
 apache.org. The Apache website is one of many on the internet that has a 
 duplicate website on the same domain just because it allows both www and 
 non-www to return HTTP 200 and proper content.
 It is also able to handle wildcards such as *.example.org to example.org if 
 there are multiple sub domains that actually point to the same website.
 Large internet crawls tend to get polluted very quickly due to these 
 problems. It also leads to skewed scores in the webgraph as different 
 websites link to different versions of the same duplicate website.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries

2012-03-08 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1305:
-

Attachment: NUTCH-1305-1.5-1.patch

Patch for 1.5. Fixes the issue.

 Domain(blacklist)URLFilter to trim entries
 --

 Key: NUTCH-1305
 URL: https://issues.apache.org/jira/browse/NUTCH-1305
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1305-1.5-1.patch


 Both filters should handle entries with trailing whitespace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's

2012-03-07 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1300:
-

Attachment: NUTCH-1300-1.5-1.patch

Patch for 1.5.

 Indexer to normalize URL's
 --

 Key: NUTCH-1300
 URL: https://issues.apache.org/jira/browse/NUTCH-1300
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: NUTCH-1300-1.5-1.patch


 Indexers should be able to normalize URL's. This is useful when a new 
 normalizer is applied to the entire CrawlDB. Without it, some or all records 
 in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1299) NPE in LinkRank inverter

2012-03-06 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1299:
-

Patch Info: Patch Available

 NPE in LinkRank inverter
 

 Key: NUTCH-1299
 URL: https://issues.apache.org/jira/browse/NUTCH-1299
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.5


 No Node object is passed from the inverter's mapper to the reducer, which 
 expects one, causing the following exception:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:409)
 at 
 org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:356)
 at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 This never happens unless you have a funky web graph. Our web graph changes 
 frequently, adding and deleting records. It's likely a large number of 
 records deleted from the outlink database is responsible for this. This 
 error, however, only showed up now, a great deal of time after we began 
 deleting records.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1299) NPE in LinkRank inverter

2012-03-06 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1299:
-

Attachment: NUTCH-1299-1.5-1.patch

Most likely solution is to check whether a LoopSet enters the reducer without 
an accompanying Node or LinkDatum object, which are mandatory.

 NPE in LinkRank inverter
 

 Key: NUTCH-1299
 URL: https://issues.apache.org/jira/browse/NUTCH-1299
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.5

 Attachments: NUTCH-1299-1.5-1.patch


 No Node object is passed from the inverter's mapper to the reducer, which 
 expects one, causing the following exception:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:409)
 at 
 org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:356)
 at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 This never happens unless you have a funky web graph. Our web graph changes 
 frequently, adding and deleting records. It's likely a large number of 
 records deleted from the outlink database is responsible for this. This 
 error, however, only showed up now, a great deal of time after we began 
 deleting records.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1299) NPE in LinkRank inverter

2012-03-06 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1299:
-

Attachment: NUTCH-1299-1.5-2.patch

New patch logs warning with proper error message.

 NPE in LinkRank inverter
 

 Key: NUTCH-1299
 URL: https://issues.apache.org/jira/browse/NUTCH-1299
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Critical
 Fix For: 1.5

 Attachments: NUTCH-1299-1.5-1.patch, NUTCH-1299-1.5-2.patch


 No Node object is passed from the inverter's mapper to the reducer, which 
 expects one, causing the following exception:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:409)
 at 
 org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:356)
 at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 This never happens unless you have a funky web graph. Our web graph changes 
 frequently, adding and deleting records. It's likely a large number of 
 records deleted from the outlink database is responsible for this. This 
 error, however, only showed up now, a great deal of time after we began 
 deleting records.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1299) LinkRank inverter to ignore records without Node

2012-03-06 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1299:
-

Priority: Major  (was: Critical)
 Summary: LinkRank inverter to ignore records without Node  (was: NPE in 
LinkRank inverter)

 LinkRank inverter to ignore records without Node
 

 Key: NUTCH-1299
 URL: https://issues.apache.org/jira/browse/NUTCH-1299
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.5

 Attachments: NUTCH-1299-1.5-1.patch, NUTCH-1299-1.5-2.patch


 No Node object is passed from the inverter's mapper to the reducer, which 
 expects one, causing the following exception:
 {code}
 java.lang.NullPointerException
 at 
 org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:409)
 at 
 org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:356)
 at 
 org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
 {code}
 This never happens unless you have a funky web graph. Our web graph changes 
 frequently, adding and deleting records. It's likely a large number of 
 records deleted from the outlink database is responsible for this. This 
 error, however, only showed up now, a great deal of time after we began 
 deleting records.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-03-02 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1024:
-

Attachment: NUTCH-1024-1.5-1.patch

New patch for trunk! This also includes a change to the injector where injected 
fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected 
interval overrides anything else.

 Dynamically set fetchInterval by MIME-type
 --

 Key: NUTCH-1024
 URL: https://issues.apache.org/jira/browse/NUTCH-1024
 Project: Nutch
  Issue Type: New Feature
  Components: generator
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: AdaptiveFetchSchedule.patch, 
 MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, 
 adaptive-mimetypes.txt


 Add facility to configure default or fixed fetchInterval values by MIME-type. 
 This is useful for conserving resources for files that are known to change 
 frequently or never and everything in between.
 * simple key\tvalue\n configuration file
 * only set fetchInterval for new documents
 * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 3 >

1 - 100 of 216 matches

Mail list logo