[jira] [Updated] (NUTCH-1341) NotModified time set to now but page not modified
[ https://issues.apache.org/jira/browse/NUTCH-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1341: - Attachment: NUTCH-1341-1.6-1.patch Here's a patch for 1.6. It simply resets the modifiedTime to the CrawlDatum's previous value right after the reducers sets a STATUS_DB_NOTMODIFIED status value. Since i believe the status is correct i assume the modifiedTime value can be reset here as well. Please comment. Did i overlook something? Implement it differently? Thanks NotModified time set to now but page not modified - Key: NUTCH-1341 URL: https://issues.apache.org/jira/browse/NUTCH-1341 Project: Nutch Issue Type: Bug Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1341-1.6-1.patch Servers tend to respond with incorrect or no value for LastModified. By comparing signatures or when (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the db_notmodified status for the CrawlDatum. The modifiedTime value, however, is not set accordingly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1336) Optionally not index db_notmodified pages
[ https://issues.apache.org/jira/browse/NUTCH-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1336: - Attachment: NUTCH-1336-1.6-1.patch Patch for 1.6. Optionally not index db_notmodified pages - Key: NUTCH-1336 URL: https://issues.apache.org/jira/browse/NUTCH-1336 Project: Nutch Issue Type: New Feature Components: indexer Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1336-1.6-1.patch IndexerMapReduce already skips pages with fetch_notmodified as status. However, despite the fetch status, we may still consider a page not modified if status is db_notmodified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1335) OutlinkDB to collect unique URL's only
[ https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1335: - Description: The aggregating code in the Outlink reducer does not take care of incoming duplicates. When the input segments contain duplicates of a single URL they are collected. (was: The OutlinkDB may contain duplicates if a segment is added more than once. The aggregating code in the reducer is does not take care of removing duplicates. See: http://mail-archives.apache.org/mod_mbox/nutch-user/201204.mbox/%3c39d7bed10f572c3211c3ad91c8a37...@openindex.io%3E) Patch Info: Patch Available Summary: OutlinkDB to collect unique URL's only (was: OutlinkDB to emit unique URL's only) OutlinkDB to collect unique URL's only -- Key: NUTCH-1335 URL: https://issues.apache.org/jira/browse/NUTCH-1335 Project: Nutch Issue Type: Bug Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 The aggregating code in the Outlink reducer does not take care of incoming duplicates. When the input segments contain duplicates of a single URL they are collected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1335) OutlinkDB to collect unique URL's only
[ https://issues.apache.org/jira/browse/NUTCH-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1335: - Attachment: NUTCH-1335-1.6-1.patch Patch for 1.5. The reducer now only collects records that are equal to or higher than mostRecent timestamp. This can still result in duplicates in the aggregated collection but not a significant amount. This patch seems to work as the troubled reducer finished nicely. I'll test with a few more runs with each a very large amount of input records also containing duplicates. OutlinkDB to collect unique URL's only -- Key: NUTCH-1335 URL: https://issues.apache.org/jira/browse/NUTCH-1335 Project: Nutch Issue Type: Bug Affects Versions: 1.5 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1335-1.6-1.patch The aggregating code in the Outlink reducer does not take care of incoming duplicates. When the input segments contain duplicates of a single URL they are collected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1330) OutlinkDB to preserve back up
[ https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1330: - Attachment: NUTCH-1330-1.6-2.patch Previous patch is bad and came from an old checkout. This is the proper patch. OutlinkDB to preserve back up - Key: NUTCH-1330 URL: https://issues.apache.org/jira/browse/NUTCH-1330 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1330-1.6-1.patch, NUTCH-1330-1.6-2.patch The webgraph's outlinkDB is the single source for all scoring jobs and GB's that eventually come out. In case of disaster, that didn't happen yet, it should be able to preserve back up just like other DB's. This means users with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to crawl/webgraphdb/outlinks/current/. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1330) OutlinkDB to preserve back up
[ https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1330: - Attachment: NUTCH-1330-1.6-1.patch Patch for 1.6! OutlinkDB to preserve back up - Key: NUTCH-1330 URL: https://issues.apache.org/jira/browse/NUTCH-1330 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1330-1.6-1.patch The webgraph's outlinkDB is the single source for all scoring jobs and GB's that eventually come out. In case of disaster, that didn't happen yet, it should be able to preserve back up just like other DB's. This means users with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to crawl/webgraphdb/outlinks/current/. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-717) Make Nutch Solr integration easier
[ https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-717: Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Make Nutch Solr integration easier -- Key: NUTCH-717 URL: https://issues.apache.org/jira/browse/NUTCH-717 Project: Nutch Issue Type: New Feature Reporter: Sami Siren Priority: Critical Fix For: 1.6 Erik Hatcher proposed we should provide a full solr config dir to be used with Nutch-Solr. Now we only provide index schema. It would be considerably easier to setup nutch-solr if we provided the whole conf dir that you could use with solr like: java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1245: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again Key: NUTCH-1245 URL: https://issues.apache.org/jira/browse/NUTCH-1245 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Priority: Critical Fix For: 1.6 A document gone with 404 after db.fetch.interval.max (90 days) has passed is fetched over and over again but although fetch status is fetch_gone its status in CrawlDb keeps db_unfetched. Consequently, this document will be generated and fetched from now on in every cycle. To reproduce: # create a CrawlDatum in CrawlDb which retry interval hits db.fetch.interval.max (I manipulated the shouldFetch() in AbstractFetchSchedule to achieve this) # now this URL is fetched again # but when updating CrawlDb with the fetch_gone the CrawlDatum is reset to db_unfetched, the retry interval is fixed to 0.9 * db.fetch.interval.max (81 days) # this does not change with every generate-fetch-update cycle, here for two segments: {noformat} /tmp/testcrawl/segments/20120105161430 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:14:21 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:14:48 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776461784_pst_: notfound(14), lastModified=0: http://localhost/page_gone /tmp/testcrawl/segments/20120105161631 SegmentReader: get 'http://localhost/page_gone' Crawl Generate:: Status: 1 (db_unfetched) Fetch time: Thu Jan 05 16:16:23 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone Crawl Fetch:: Status: 37 (fetch_gone) Fetch time: Thu Jan 05 16:20:05 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Metadata: _ngt_: 1325776583451_pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} As far as I can see it's caused by setPageGoneSchedule() in AbstractFetchSchedule. Some pseudo-code: {code} setPageGoneSchedule (called from update / CrawlDbReducer.reduce): datum.fetchInterval = 1.5 * datum.fetchInterval // now 1.5 * 0.9 * maxInterval datum.fetchTime = fetchTime + datum.fetchInterval // see NUTCH-516 if (maxInterval datum.fetchInterval) // necessarily true forceRefetch() forceRefetch: if (datum.fetchInterval maxInterval) // true because it's 1.35 * maxInterval datum.fetchInterval = 0.9 * maxInterval datum.status = db_unfetched // shouldFetch (called from generate / Generator.map): if ((datum.fetchTime - curTime) maxInterval) // always true if the crawler is launched in short intervals // (lower than 0.35 * maxInterval) datum.fetchTime = curTime // forces a refetch {code} After setPageGoneSchedule is called via update the state is db_unfetched and the retry interval 0.9 * db.fetch.interval.max (81 days). Although the fetch time in the CrawlDb is far in the future {noformat} % nutch readdb testcrawl/crawldb -url http://localhost/page_gone URL: http://localhost/page_gone Version: 7 Status: 1 (db_unfetched) Fetch time: Sun May 06 05:20:05 CEST 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 6998400 seconds (81 days) Score: 1.0 Signature: null Metadata: _pst_: notfound(14), lastModified=0: http://localhost/page_gone {noformat} the URL is generated again because (fetch time - current time) is larger than db.fetch.interval.max. The retry interval (datum.fetchInterval) oscillates between 0.9 and 1.35, and the fetch time is always close to current time + 1.35 * db.fetch.interval.max. It's possibly a side effect of NUTCH-516, and may be related to NUTCH-578 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on
[jira] [Updated] (NUTCH-1318) Parse time outs crash parsing fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1318: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Parse time outs crash parsing fetcher - Key: NUTCH-1318 URL: https://issues.apache.org/jira/browse/NUTCH-1318 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.6 Some fetch lists can never be fetched and parsed successfully because a single timing out record can cause most and eventually all subsequeny records to time out as well. Finally the mapper will hang completely and so killing the entire fetch job, loosing 99% of the records that were processed. I'm not sure what's going on, something may be leaking somewhere. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1219: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Upgrade all jobs to new MapReduce API - Key: NUTCH-1219 URL: https://issues.apache.org/jira/browse/NUTCH-1219 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Priority: Critical Fix For: 1.6 We should upgrade to the new Hadoop API for Nutch trunk as already has been done for the Nutchgora branch. If i'm not mistaken we can already upgrade to the latest 0.20.5 version that still carries the legacy API so we can, without immediately upgrading to 0.21 or higher, port the jobs to the new API without having the need for a separate branch to work on. To the committers who created/ported jobs in NutchGora, please write down your advice and experience. http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1251: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException -- Key: NUTCH-1251 URL: https://issues.apache.org/jira/browse/NUTCH-1251 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.4 Environment: Any crawl where the number of URLs in Solr exceeds 1024 (the default max number of clusters in Lucene boolean query). Reporter: Arkadi Kosmynin Priority: Critical Fix For: 1.6 Deletion of duplicates fails. This happens because the get all query used to get Solr index size is id:[* TO *], which is a range query. Lucene is trying to expand it to a Boolean query and gets as many clauses as there are ids in the index. This is too many in a real situation and it throws an exception. To correct this problem, change the get all query (SOLR_GET_ALL_QUERY) to \*:\*, which is the standard Solr get all query. Indexing log extract: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:236) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRecordReader(SolrDeleteDuplicates.java:234) ... 3 more Caused by: org.apache.solr.common.SolrException: Internal Server Error Internal Server Error request: http://localhost:8081/arch/select?q=id:[* TO *]fl=id,boost,tstamp,digeststart=0rows=82938wt=javabinversion=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 5 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-578) URL fetched with 403 is generated over and over again
[ https://issues.apache.org/jira/browse/NUTCH-578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-578: Fix Version/s: (was: 1.5) 1.6 URL fetched with 403 is generated over and over again - Key: NUTCH-578 URL: https://issues.apache.org/jira/browse/NUTCH-578 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.0.0 Environment: Ubuntu Gutsy Gibbon (7.10) running on VMware server. I have checked out the most recent version of the trunk as of Nov 20, 2007 Reporter: Nathaniel Powell Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-578.patch, NUTCH-578_v2.patch, NUTCH-578_v3.patch, NUTCH-578_v4.patch, crawl-urlfilter.txt, nutch-site.xml, regex-normalize.xml, urls.txt I have not changed the following parameter in the nutch-default.xml: property namedb.fetch.retry.max/name value3/value descriptionThe maximum number of times a url that has encountered recoverable errors is generated for fetch./description /property However, there is a URL which is on the site that I'm crawling, www.teachertube.com, which keeps being generated over and over again for almost every segment (many more times than 3): fetch of http://www.teachertube.com/images/ failed with: Http code=403, url=http://www.teachertube.com/images/ This is a bug, right? Thanks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement
[ https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1249: - Affects Version/s: (was: 1.5) Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Resolve all issues flagged up by adding javac -Xlint arguement -- Key: NUTCH-1249 URL: https://issues.apache.org/jira/browse/NUTCH-1249 Project: Nutch Issue Type: Improvement Components: build Affects Versions: nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.6 There are a heap of issues flagged up by NUTCH-1237, I think over time it would be great to get these addressed and resolved. What is interesting is that adding the same arguements to /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail. Some of this stuff is documented in the link below http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1273) Fix [deprecation] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1273: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Fix [deprecation] javac warnings Key: NUTCH-1273 URL: https://issues.apache.org/jira/browse/NUTCH-1273 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 Attachments: NUTCH-1273-nutchgora.patch, NUTCH-1273-trunk.patch, NUTCH-1273-v2-trunk.patch As part of this task, these warnings should be resolved, however this particular strand of warnings can either be resolved by adding {code} @SuppressWarnings(deprecation) {code} or by actually upgrading our class usage to rely upon non-deprecated classes. Which option is more appropriate for the project? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1113: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Merging segments causes URLs to vanish from crawldb/index? -- Key: NUTCH-1113 URL: https://issues.apache.org/jira/browse/NUTCH-1113 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Edward Drapkin Fix For: 1.6 Attachments: merged_segment_output.txt, unmerged_segment_output.txt When I run Nutch, I use the following steps: nutch inject crawldb/ url.txt repeated 3 times: nutch generate crawldb/ segments/ -normalize nutch fetch `ls -d segments/* | tail -1` nutch parse `ls -d segments/* | tail -1` nutch update crawldb `ls -d segments/* | tail -1` nutch mergesegs merged/ -dir segments/ nutch invertlinks linkdb/ -dir merged/ nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing code from Nutch 1.1). When I crawl with merging segments, I lose about 20% of the URLs that wind up in the index vs. when I crawl without merging the segments. Somehow the segment merger causes me to lose ~20% of my crawl database! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1116) Write JUnit tests for all plugins
[ https://issues.apache.org/jira/browse/NUTCH-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1116: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Write JUnit tests for all plugins --- Key: NUTCH-1116 URL: https://issues.apache.org/jira/browse/NUTCH-1116 Project: Nutch Issue Type: Improvement Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is a step towards covering the parts of our plugin codebase which are currently missing JUnit test cases. Each plugin will have its own sub-issue meaning that this parent issue should not be deemed complete until all existing (and newly contributed) plugins have the appropriate test cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1084) ReadDB url throws exception
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1084: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 ReadDB url throws exception --- Key: NUTCH-1084 URL: https://issues.apache.org/jira/browse/NUTCH-1084 Project: Nutch Issue Type: Bug Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Readdb -url suffers from two problems: 1. it trips over the _SUCCESS file generated by newer Hadoop version 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???) The first problem can be remedied by not allowing the injector or updater to write the _SUCCESS file. Until now that's the solution implemented for similar issues. I've not been successful as to make the Hadoop readers simply skip the file. The second issue seems a bit strange and did not happen on a local check out. I'm not yet sure whether this is a Hadoop issue or something being corrupt in the CrawlDB. Here's the stack trace: {code} Exception in thread main java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204) at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146) at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524) at org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105) at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383) at org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389) at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url
[ https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1150: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 http.redirect.max can lead to multiple parses of the same url - Key: NUTCH-1150 URL: https://issues.apache.org/jira/browse/NUTCH-1150 Project: Nutch Issue Type: Bug Affects Versions: 1.3, 1.4 Reporter: Markus Jelsma Fix For: 1.6 With http.redirect.max 0 it's possible that a document is parsed multiple times. This is the case when several url's from the same fetch redirect to a shared location. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1147) WebGraph nodeDumper uses only 1 reducer
[ https://issues.apache.org/jira/browse/NUTCH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1147: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 WebGraph nodeDumper uses only 1 reducer --- Key: NUTCH-1147 URL: https://issues.apache.org/jira/browse/NUTCH-1147 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.6 Attachments: NUTCH-1147-1.5-1.patch The noderDumper is restricted to only one reducer, making it slow and producing too large files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1194) CrawlDB lock should be released earlier
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1194: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 CrawlDB lock should be released earlier --- Key: NUTCH-1194 URL: https://issues.apache.org/jira/browse/NUTCH-1194 Project: Nutch Issue Type: Improvement Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Lock on the CrawlDB is released when everything is finished. But when generating many segments, the lock remains in place while it's not neccessary anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately after the selector has finished. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1201) Allow for different FetcherThread impls
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1201: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Allow for different FetcherThread impls --- Key: NUTCH-1201 URL: https://issues.apache.org/jira/browse/NUTCH-1201 Project: Nutch Issue Type: New Feature Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: CustomFetcher.java, NUTCH-1201-1.5-wip.patch For certain cases we need to modify parts in FetcherThread and make it pluggable. This introduces a new config directive fetcher.impl that takes a FQCN and uses that setting Fetcher.fetch to load a class to use for job.setMapRunnerClass(). This new class has to extend Fetcher and and inner class FetcherThread. This allows for overriding methods in FetcherThread but also methods in Fetcher itself if required. A follow up on this issue would be to refactor parts of FetcherThread to make it easier to override small sections instead of copying the entire method body for a small change, which is now the case. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1183) Summary task for adding command line usage instructions to webgraph classes
[ https://issues.apache.org/jira/browse/NUTCH-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1183: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Summary task for adding command line usage instructions to webgraph classes --- Key: NUTCH-1183 URL: https://issues.apache.org/jira/browse/NUTCH-1183 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.6 The following files should provide output when called innacurately from the command line. Something similar to {code} Usage: class -arg1, -arg2, etc etc {code} * webgraph * linkrank * scoreupdater * nodedumper * nodereader If anyone would like to see further classes included in this task please add to the above list. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1176) Fix all javadoc warnings from nightly builds
[ https://issues.apache.org/jira/browse/NUTCH-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1176: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Fix all javadoc warnings from nightly builds Key: NUTCH-1176 URL: https://issues.apache.org/jira/browse/NUTCH-1176 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 The warnings can clearly be seen from the javadoc target (near bottom) of any successful nightly build. An example is provided below. https://builds.apache.org/job/nutch-trunk/1638/console -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1040) Backport REST-API from 2.0
[ https://issues.apache.org/jira/browse/NUTCH-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1040: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Backport REST-API from 2.0 -- Key: NUTCH-1040 URL: https://issues.apache.org/jira/browse/NUTCH-1040 Project: Nutch Issue Type: New Feature Components: REST_api Reporter: Julien Nioche Fix For: 1.6 See https://issues.apache.org/jira/browse/NUTCH-880 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1274: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Fix [cast] javac warnings - Key: NUTCH-1274 URL: https://issues.apache.org/jira/browse/NUTCH-1274 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 A typical example of this is {code} trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java:460: warning: [cast] redundant cast to int [javac] res ^= (int)(signature[i] 24 + signature[i+1] 16 + {code} these should all be fixed by replacing with the correct implementations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Rely on Tika for outlink extraction --- Key: NUTCH-1233 URL: https://issues.apache.org/jira/browse/NUTCH-1233 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1233-1.5-wip.patch Tika provides outlink extraction features that are not used in Nutch. To be able to use it in Nutch we need Tika to return the rel attr value of each link, which it currently doesn't. There's a patch for Tika 1.1. If that patch is included in Tika and we upgraded to that new version this issue can be worked on. Here's preliminary code that does both Tika and current outlink extraction. This also includes parts of the Boilerpipe code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1014) Migrate from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1014: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Migrate from Apache ORO to java.util.regex -- Key: NUTCH-1014 URL: https://issues.apache.org/jira/browse/NUTCH-1014 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Fix For: 1.6 A separate issue tracking migration of all components from Apache ORO to java.util.regex. Components involved are: - RegexURLNormalzier - OutlinkExtractor - JSParseFilter - MoreIndexingFilter - BasicURLNormalizer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1063) OutlinkExtractor test generates an exception but does not fail
[ https://issues.apache.org/jira/browse/NUTCH-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1063: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 OutlinkExtractor test generates an exception but does not fail -- Key: NUTCH-1063 URL: https://issues.apache.org/jira/browse/NUTCH-1063 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Julien Nioche Fix For: 1.6 Testsuite: org.apache.nutch.parse.TestOutlinkExtractor Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.043 sec - Standard Output --- 2011-07-19 15:06:36,073 ERROR parse.OutlinkExtractor (OutlinkExtractor.java:getOutlinks(121)) - getOutlinks java.lang.NullPointerException at org.apache.oro.text.regex.PatternMatcherInput.init(Unknown Source) at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:95) at org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:72) at org.apache.nutch.parse.TestOutlinkExtractor.testGetNoOutlinks(TestOutlinkExtractor.java:40) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1220) Upgrade Solr deps
[ https://issues.apache.org/jira/browse/NUTCH-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1220: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Upgrade Solr deps - Key: NUTCH-1220 URL: https://issues.apache.org/jira/browse/NUTCH-1220 Project: Nutch Issue Type: Task Components: build Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 SlfJ4 needs to be part of upgrade to Solr 3.5 but that breaks something else. Likely Hadoop has a different Slf4J version? {code} Exception in thread main java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log(Lorg/slf4j/Marker;Ljava/lang/String;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V at org.apache.commons.logging.impl.SLF4JLocationAwareLog.debug(SLF4JLocationAwareLog.java:133) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:136) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:180) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1418) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1319) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:109) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:544) at org.apache.hadoop.mapred.FileInputFormat.addInputPath(FileInputFormat.java:339) at org.apache.nutch.util.domain.DomainStatistics.run(DomainStatistics.java:108) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.util.domain.DomainStatistics.main(DomainStatistics.java:215) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1123) JUnit test for scoring-link
[ https://issues.apache.org/jira/browse/NUTCH-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1123: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for scoring-link --- Key: NUTCH-1123 URL: https://issues.apache.org/jira/browse/NUTCH-1123 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-865) Format source code in unique style
[ https://issues.apache.org/jira/browse/NUTCH-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-865: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Format source code in unique style -- Key: NUTCH-865 URL: https://issues.apache.org/jira/browse/NUTCH-865 Project: Nutch Issue Type: Improvement Components: build Reporter: Pham Tuan Minh Assignee: Lewis John McGibbney Fix For: 1.6 Attachments: NUTCH-865-nutchgora-rev1188268.patch, NUTCH-865-trunk-rev1188252.patch, NUTCH-865.patch We should define a standard format rules for source code/comments, then using eclipse tool to format the whole source code in the same style. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1120) JUnit test for microformats-reltag
[ https://issues.apache.org/jira/browse/NUTCH-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1120: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for microformats-reltag -- Key: NUTCH-1120 URL: https://issues.apache.org/jira/browse/NUTCH-1120 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1186) FreeGenerator always normalizes
[ https://issues.apache.org/jira/browse/NUTCH-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1186: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 FreeGenerator always normalizes --- Key: NUTCH-1186 URL: https://issues.apache.org/jira/browse/NUTCH-1186 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 The FreeGenerator does not honor the -normalize option, it always normalizes all URL's in the input directory. The -filter option is respected. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1308) Unnecessary truncate content configuration, and logging in parse-zip/ZipParser
[ https://issues.apache.org/jira/browse/NUTCH-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1308: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Unnecessary truncate content configuration, and logging in parse-zip/ZipParser Key: NUTCH-1308 URL: https://issues.apache.org/jira/browse/NUTCH-1308 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4, nutchgora Reporter: Lewis John McGibbney Fix For: 1.6 Two issues here... 1) Recently ferdy committed NUTCH-965 which skips parsing of truncated documents. Parse zip has it's own implementation for the same when it should really draw on the aforementioned implementation. 2) If (in the offending piece of code mentioned above) truncation occurs, we get an incorrect log message the Parser can't handle incomplete pdf files!!! This is incorrect, shouldn't be there, and should be removed. {code} 72 if (contentLen != null contentInBytes.length != len) { 73return new ParseStatus(ParseStatus.FAILED, 74ParseStatus.FAILED_TRUNCATED, Content truncated at 75+ contentInBytes.length 76+ bytes. Parser can't handle incomplete pdf file.) 77.getEmptyParseResult(content.getUrl(), getConf()); 78} {code} For clarity, the issue is present in both Nutchgora branch[1] and Nutch trunk[2] [1] https://svn.apache.org/viewvc/nutch/branches/nutchgora/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java?diff_format=hview=markup [2] https://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip/ZipParser.java?diff_format=hview=markup [2] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1252) SegmentReader -get shows wrong data
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1252: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 SegmentReader -get shows wrong data --- Key: NUTCH-1252 URL: https://issues.apache.org/jira/browse/NUTCH-1252 Project: Nutch Issue Type: Bug Affects Versions: 1.4, 1.5 Reporter: Sebastian Nagel Fix For: 1.6 Attachments: NUTCH-1252-v2.patch, NUTCH-1252.patch The command/option -get of the SegmentReader may show wrong data associated with the given URL. To reproduce: {code} % mkdir -p test_readseg/urls % echo -e http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0; test_readseg/urls/seeds % nutch inject test_readseg/crawldb test_readseg/urls Injector: starting at 2012-01-18 09:32:25 Injector: crawlDb: test_readseg/crawldb Injector: urlDir: test_readseg/urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03 % nutch generate test_readseg/crawldb test_readseg/segments/ Generator: starting at 2012-01-18 09:32:30 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: test_readseg/segments/20120118093232 Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03 % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' -nocontent -noparse -nofetch -noparsedata -noparsetext SegmentReader: get 'http://nutch.apache.org/' Crawl Generate:: Version: 7 Status: 1 (db_unfetched) Fetch time: Wed Jan 18 09:32:26 CET 2012 Modified time: Thu Jan 01 01:00:00 CET 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 10.0 Signature: null Metadata: _ngt_: 1326875550401test: AbcTest {code} The metadata and the score indicate that the CrawlDatum shown is the wrong one (that associated to http://abc.test/ but not to http://nutch.apache.org/). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1121) JUnit test for parse-js
[ https://issues.apache.org/jira/browse/NUTCH-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1121: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for parse-js --- Key: NUTCH-1121 URL: https://issues.apache.org/jira/browse/NUTCH-1121 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-809: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.6 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1046) Add tests for indexing to SOLR
[ https://issues.apache.org/jira/browse/NUTCH-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1046: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Add tests for indexing to SOLR -- Key: NUTCH-1046 URL: https://issues.apache.org/jira/browse/NUTCH-1046 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Fix For: 1.6 We currently have no tests for checking that the indexing to SOLR works as expected. Running an embedded SOLR Server within the tests would be good. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1228: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Change mapred.task.timeout to mapreduce.task.timeout in fetcher --- Key: NUTCH-1228 URL: https://issues.apache.org/jira/browse/NUTCH-1228 Project: Nutch Issue Type: Task Components: fetcher Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.6 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1001) bin/nutch fetch/parse handle crawl/segments directory
[ https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1001: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 bin/nutch fetch/parse handle crawl/segments directory - Key: NUTCH-1001 URL: https://issues.apache.org/jira/browse/NUTCH-1001 Project: Nutch Issue Type: Improvement Reporter: Gabriele Kahlout Priority: Minor Fix For: 1.6 Attachments: Fetcher.java, NUTCH-1001.patch, nutch1001v2.patch I'm having issues porting scripts across different systems to support the step of extracting the latest/only segments resulting from the generate phase. Variants include: $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1] $ s1=`ls -d crawl/segments/2* | tail -1` #[2] $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1` $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o [a-zA-Z0-9/\-]* |tail -1` And I'm not sure what windows users would have to do. Some users may also do with: bin/nutch fetch with crawl/segments/2* But I don't see a need in having the user extract/worry-about the latest/only segment, and have it a described step in every nutch tutorial. More over only fetch and parse expect a segment while other commands are fine with the directory of segments. Therefore, I think it's beneficial if fetch and parse also handle directories of segments. [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1060) URL filters to produce regexes to be used by OutlinkExtractor.
[ https://issues.apache.org/jira/browse/NUTCH-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1060: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 URL filters to produce regexes to be used by OutlinkExtractor. -- Key: NUTCH-1060 URL: https://issues.apache.org/jira/browse/NUTCH-1060 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Fix For: 1.6 The problem: OutlinkExtractor produces many URL's from plain text using an advanced regular expression: {code} ([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@~=%-]{0,1000}))?) {code} This expression does not take into account the various non-regex-based URL filters such as prefix, domain and suffix and thus produces URL's that are going to be filtered out by some filter. This, however, becomes a problem when parsing millions of documents that are being processed by the OutlinkExtractor (when case parse-html|parse-tika do not produce any outlinks). Large bodies of full text usually contain a lot of sequences that are extracted as URL's. Many of which are thought to be part of an URI schema such as: id:123 says:what user:doe update:tue-19-jul The above examples can be easily remedied by using a configured prefix URL filter. It may, however, be an even better idea to prevent the extraction of these URL's at the first place. No extraction means filtering less URL's and potentially saving a lot of data. Comments? I'll see if i can produce a patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1100) SolrDedup broken
[ https://issues.apache.org/jira/browse/NUTCH-1100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1100: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 SolrDedup broken Key: NUTCH-1100 URL: https://issues.apache.org/jira/browse/NUTCH-1100 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.6 Some Solr indices are unable to be deduped from Nutch. For unknown reasons Nutch will throw the exception below. There are no peculiarities to be found in the Solr logs, the queries are normal and seem to succeed. {code} java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:388) at org.apache.hadoop.io.Text.set(Text.java:178) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:272) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:243) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1124) JUnit test for scoring-opic
[ https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1124: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for scoring-opic --- Key: NUTCH-1124 URL: https://issues.apache.org/jira/browse/NUTCH-1124 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1197) Add statically configured field values to solrindex-mapping.xml
[ https://issues.apache.org/jira/browse/NUTCH-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1197: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Add statically configured field values to solrindex-mapping.xml --- Key: NUTCH-1197 URL: https://issues.apache.org/jira/browse/NUTCH-1197 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.6 Attachments: NUTCH-1197.patch In some cases it's useful to be able to add to every document sent to Solr a set of predefined fields with static values. This could be implemented on the Solr side (with a custom UpdateRequestProcessor), but it may be less cumbersome to add them on the Nutch side. Example: let's say I have several Nutch configurations all indexing to the same Solr instance, and I want each of them to add its identifier as a field in all documents, e.g. origin=web_crawl_1, origin=file_crawl, origin=unlimited_crawl, etc... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1122) JUnit test for protocol-ftp
[ https://issues.apache.org/jira/browse/NUTCH-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1122: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for protocol-ftp --- Key: NUTCH-1122 URL: https://issues.apache.org/jira/browse/NUTCH-1122 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1127) JUnit test for urlfilter-validator
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1127: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for urlfilter-validator -- Key: NUTCH-1127 URL: https://issues.apache.org/jira/browse/NUTCH-1127 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1247) CrawlDatum.retries should be int
[ https://issues.apache.org/jira/browse/NUTCH-1247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1247: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 CrawlDatum.retries should be int Key: NUTCH-1247 URL: https://issues.apache.org/jira/browse/NUTCH-1247 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1247.patch_A, NUTCH-1247.patch_B CrawlDatum.retries is a byte and goes bad with larger values. 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -127: 1 12/01/12 18:35:22 INFO crawl.CrawlDbReader: retry -128: 1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-208) http: proxy exception list:
[ https://issues.apache.org/jira/browse/NUTCH-208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-208: Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 http: proxy exception list: --- Key: NUTCH-208 URL: https://issues.apache.org/jira/browse/NUTCH-208 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.8, 1.3, nutchgora Reporter: Matthias Günter Assignee: Lewis John McGibbney Priority: Trivial Labels: patch Fix For: 1.6 Attachments: NUTCH-208-branch-1.4-20110210-v3.patch, NUTCH-208-branch-1.4-20110807.patch, NUTCH-208-branch-1.4-20110809-v2.patch, NUTCH-208-trunk-2.0-20110810-v2.patch, NUTCH-208-trunk-2.0-20110810.patch, patch.txt, patch.txt, proxy_exception_list-0.8.diff I suggest that a parameter is added to nutch-default.xml which allows to generate a proxy exception list. property namehttp.proxy.exception.list/name value/value descriptionURL's and hosts that don't use the proxy (e.g. intranets)/description /property This is useful when scanning intranet/internet combinations from behind a firewall. A preliminary patch is added to this extend to this request, showing the changes. We will test it and update it if necessary. this also reflects the reality in web browsers, where there is in most cases an exception list. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1031: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.6 We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1107) Log slow parse entries
[ https://issues.apache.org/jira/browse/NUTCH-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1107: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Log slow parse entries -- Key: NUTCH-1107 URL: https://issues.apache.org/jira/browse/NUTCH-1107 Project: Nutch Issue Type: Improvement Components: parser Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.6 Parse mapper and outputformat should have a facility to log (configurable) slow entries. This is useful for debugging slow parses. Logging parser keys only is not good enough, especially in a distributed environment. Sometimes the actual parse (mapper) is very slow and sometimes the normalization and filtering of an entry's outlinks is slow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-585: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1320: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 IndexChecker and ParseChecker choke on IDN's Key: NUTCH-1320 URL: https://issues.apache.org/jira/browse/NUTCH-1320 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1320-1.5-1.patch These handy debug tools do not handle IDN's and throw an NPE bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 {code} Exception in thread main java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1126) JUnit test for urlfilter-prefix
[ https://issues.apache.org/jira/browse/NUTCH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1126: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for urlfilter-prefix --- Key: NUTCH-1126 URL: https://issues.apache.org/jira/browse/NUTCH-1126 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1087: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Deprecate crawl command and replace with example script --- Key: NUTCH-1087 URL: https://issues.apache.org/jira/browse/NUTCH-1087 Project: Nutch Issue Type: Task Affects Versions: 1.4 Reporter: Markus Jelsma Priority: Minor Fix For: 1.6 * remove the crawl command * add basic crawl shell script See thread: http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1128) JUnit test for urlmeta
[ https://issues.apache.org/jira/browse/NUTCH-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1128: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for urlmeta -- Key: NUTCH-1128 URL: https://issues.apache.org/jira/browse/NUTCH-1128 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1034) Create Solr Velocity templates
[ https://issues.apache.org/jira/browse/NUTCH-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1034: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Create Solr Velocity templates -- Key: NUTCH-1034 URL: https://issues.apache.org/jira/browse/NUTCH-1034 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: doc.vm.patch, facets.vm.patch Solr has Velocity integration and provides an easy method for creating HTML based front-ends for the search engine. This issue tracks the development of Velocity templates specifically for Nutch users. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1179) Option to restrict generated records by metadata
[ https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1179: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Option to restrict generated records by metadata Key: NUTCH-1179 URL: https://issues.apache.org/jira/browse/NUTCH-1179 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 The generator should be able to select entries based on a metadata key/value pair. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1300: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Indexer to normalize URL's -- Key: NUTCH-1300 URL: https://issues.apache.org/jira/browse/NUTCH-1300 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: NUTCH-1300-1.5-1.patch Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1130) JUnit test for Any23 RDF plugin
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1130: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for Any23 RDF plugin --- Key: NUTCH-1130 URL: https://issues.apache.org/jira/browse/NUTCH-1130 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.6 The JUnit test should be written prior to the progression of the Any23 Nutch plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1047: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.6 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1035) Tune Solr config for Nutch users
[ https://issues.apache.org/jira/browse/NUTCH-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1035: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Tune Solr config for Nutch users Key: NUTCH-1035 URL: https://issues.apache.org/jira/browse/NUTCH-1035 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: solrconfig.xml To improve and ease integration with Solr we should provide a solrconfig.xml specifically for Nutch integration including a request handler with a Velocity response writer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1226) Migrate CrawlDbReader to MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1226: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Migrate CrawlDbReader to MapReduce API -- Key: NUTCH-1226 URL: https://issues.apache.org/jira/browse/NUTCH-1226 Project: Nutch Issue Type: Sub-task Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: NUTCH-1226-1.5-1.patch Hadoop 0.21 only! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1223) Migrate WebGraph to MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1223: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Migrate WebGraph to MapReduce API - Key: NUTCH-1223 URL: https://issues.apache.org/jira/browse/NUTCH-1223 Project: Nutch Issue Type: Sub-task Reporter: Markus Jelsma Fix For: 1.6 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1202) Fetcher timebomb kills long waiting fetch jobs
[ https://issues.apache.org/jira/browse/NUTCH-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1202: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Fetcher timebomb kills long waiting fetch jobs -- Key: NUTCH-1202 URL: https://issues.apache.org/jira/browse/NUTCH-1202 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Markus Jelsma Fix For: 1.6 The timebomb feature kills of mappers of jobs that have been waiting too long in the job queue. The timebomb feature should start at mapper initialization instead, not in job init. Thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1079) StringBuffer converted to StringBuilder
[ https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1079: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 StringBuffer converted to StringBuilder --- Key: NUTCH-1079 URL: https://issues.apache.org/jira/browse/NUTCH-1079 Project: Nutch Issue Type: Improvement Components: fetcher, indexer Reporter: Karthik K Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch All across the codebase, it contains StringBuffer, when thread-safety is probably not intended. This patch replaces StringBuffer to StringBuilder, as applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1319) HostNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1319: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 HostNormalizer -- Key: NUTCH-1319 URL: https://issues.apache.org/jira/browse/NUTCH-1319 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1319-1.5-1.patch Nutch would benefit from having a host normalizer. A host normalizer maps a given host to the desired host. A basic example is to map www.apache.org to apache.org. The Apache website is one of many on the internet that has a duplicate website on the same domain just because it allows both www and non-www to return HTTP 200 and proper content. It is also able to handle wildcards such as *.example.org to example.org if there are multiple sub domains that actually point to the same website. Large internet crawls tend to get polluted very quickly due to these problems. It also leads to skewed scores in the webgraph as different websites link to different versions of the same duplicate website. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1140: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 index-more plugin, resetTitle method creates multiple values in the Title field --- Key: NUTCH-1140 URL: https://issues.apache.org/jira/browse/NUTCH-1140 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.3 Reporter: Joe Liedtke Priority: Minor Fix For: 1.6 Attachments: MoreIndexingFilter.093011.patch From the comments in MoreIndexingFilter.java, the index-more plugin is meant to reset the Title field of a document if it contains a Content-Disposition header. The current behavior is to add a Title regardless of whether one exists or not, which can cause issues down the line with the Solr Indexing process, and based on a thread in the nutch user list it appears that this is causing some users to mark the title as multi-valued in the schema: http://www.lucidimagination.com/search/document/9440ff6b5deb285b/multiple_values_encountered_for_non_multivalued_field_title#17736c5807826be8 The following patch removes the title field before adding a new one, which has resolved the issue for me: --- MoreIndexingFilter.old2011-09-30 11:44:35.0 + +++ MoreIndexingFilter.java 2011-09-30 09:58:48.0 + @@ -276,6 +276,7 @@ for (int i=0; ipatterns.length; i++) { if (matcher.contains(contentDisposition,patterns[i])) { result = matcher.getMatch(); +doc.removeField(title); doc.add(title, result.group(1)); break; } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1021) Migrate OutlinkExtractor from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1021: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Migrate OutlinkExtractor from Apache ORO to java.util.regex Key: NUTCH-1021 URL: https://issues.apache.org/jira/browse/NUTCH-1021 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1021-1.4-2.patch, NUTCH-1021-1.4-4.patch, NUTCH-1021-1.4.patch Migrate from deprecated ORO to Java util regex. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1039) Fetcher fails for pages without content-length header
[ https://issues.apache.org/jira/browse/NUTCH-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1039: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Fetcher fails for pages without content-length header - Key: NUTCH-1039 URL: https://issues.apache.org/jira/browse/NUTCH-1039 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Fetcher fails: 2011-07-11 14:45:34,764 ERROR http.Http - org.apache.nutch.protocol.http.api.HttpException: bad content length: 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.readPlainContent(HttpResponse.java:218) 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.protocol.http.HttpResponse.init(HttpResponse.java:158) 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.protocol.http.Http.getResponse(Http.java:64) 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:138) 2011-07-11 14:45:34,765 ERROR http.Http - at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:79) Both fetcher and indexing filter checker fail sometimes. I'm unsure whether this is something in Nutch or whether the remote server only returns content-length incidentally. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1275) Fix [unchecked] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1275: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Fix [unchecked] javac warnings -- Key: NUTCH-1275 URL: https://issues.apache.org/jira/browse/NUTCH-1275 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.6 We can simply suppress these warnings using {code} SuppressWarnings [unchecked] {code} However if there is a another method for resolving these warnings then they should be implemented if deemed beneficial to code quality. Some resources http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1151) Index-anchor to add numInlinks count
[ https://issues.apache.org/jira/browse/NUTCH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1151: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Index-anchor to add numInlinks count Key: NUTCH-1151 URL: https://issues.apache.org/jira/browse/NUTCH-1151 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.6 Attachments: NUTCH-1151-1.5-1.patch Issue to improve in index-anchor to add the number of inlinks per document. This count is useful for calculating some authority metric in the search server.T -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1053: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.6 Attachments: nutch-1053.patch, seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-961: Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Expose Tika's boilerpipe support Key: NUTCH-961 URL: https://issues.apache.org/jira/browse/NUTCH-961 Project: Nutch Issue Type: New Feature Components: parser Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Attachments: BoilerpipeExtractorRepository.java, NUTCH-961-1.3-3.patch, NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, NUTCH-961v2.patch Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1118) JUnit test for index-basic
[ https://issues.apache.org/jira/browse/NUTCH-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1118: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for index-basic -- Key: NUTCH-1118 URL: https://issues.apache.org/jira/browse/NUTCH-1118 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1284: - Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Trivial Fix For: 1.6 Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1149) DomainStats should process numeric CrawlDB metadata
[ https://issues.apache.org/jira/browse/NUTCH-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1149: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 DomainStats should process numeric CrawlDB metadata --- Key: NUTCH-1149 URL: https://issues.apache.org/jira/browse/NUTCH-1149 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.6 Right now the DomainStats program only outputs the sum of fetched records per domain or host. It should also be able to output processed numerics of meta data in order to get the average size (content length) for a given domain or host. This is also useful for generating a metric for adult material (by domain or host) when using a plugin that stores a propability factor of adult material per URL in the Crawl DB. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1181) Indexer to use webgraph inlinks
[ https://issues.apache.org/jira/browse/NUTCH-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1181: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Indexer to use webgraph inlinks --- Key: NUTCH-1181 URL: https://issues.apache.org/jira/browse/NUTCH-1181 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 Indexers currently rely on the LinkDB for anchor indexing while the WebGraph provides the same data as an inverted link DB. An inlinkDB created by the WebGraph program with non-zero LinkRank scores on the nodes also provide an improved set ordered by popularity. This issue must: - let IndexerMapReduce understand the new format; - allow for indexing only popular anchors. The goal is todeprecate all code associated with invertlinks and ultimately remove it from the codebase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1117) JUnit test for index-anchor
[ https://issues.apache.org/jira/browse/NUTCH-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1117: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 JUnit test for index-anchor --- Key: NUTCH-1117 URL: https://issues.apache.org/jira/browse/NUTCH-1117 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.6 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1317) Max content length by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1317: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Max content length by MIME-type --- Key: NUTCH-1317 URL: https://issues.apache.org/jira/browse/NUTCH-1317 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.6 The good old http.content.length directive is not sufficient in large internet crawls. For example, a 5MB PDF file may be parsed without issues but a 5MB HTML file may time out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1277) Fix [fallthrough] javac warnings
[ https://issues.apache.org/jira/browse/NUTCH-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1277: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Fix [fallthrough] javac warnings Key: NUTCH-1277 URL: https://issues.apache.org/jira/browse/NUTCH-1277 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Fix For: nutchgora, 1.6 This usually occurs when we have an instance where a switch statement(s) fall through (that is, one or more break statements are missing). We need to determine where a simple {code} @SuppressWarnings(fallthrough) {code} is required or whether we need to include the break statements in switch blocks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1215) UpdateDB should not require segment as input
[ https://issues.apache.org/jira/browse/NUTCH-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1215: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 UpdateDB should not require segment as input Key: NUTCH-1215 URL: https://issues.apache.org/jira/browse/NUTCH-1215 Project: Nutch Issue Type: Bug Components: linkdb Affects Versions: 1.4 Reporter: Markus Jelsma Fix For: 1.6 Attachments: NUTCH-1215-1.5-1.patch UpdateDB requires an input segment. This causes the metrics for the records of the segment to change, e.g. from fetched to not_modified and changes an adaptive fetch schedule accordingly. This should not happen when one needs to update for filtering of normalizing or other maintenance. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1103) Port protocol-sftp to 1.4
[ https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1103: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Port protocol-sftp to 1.4 - Key: NUTCH-1103 URL: https://issues.apache.org/jira/browse/NUTCH-1103 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Priority: Minor Fix For: 1.6 Port protocol-sftp from trunk back to 1.4 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1088) Write Solr XML documents
[ https://issues.apache.org/jira/browse/NUTCH-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1088: - Fix Version/s: (was: 1.5) 1.6 20120304-push-1.6 Write Solr XML documents Key: NUTCH-1088 URL: https://issues.apache.org/jira/browse/NUTCH-1088 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.6 Documents need to be reindexed when index-time analysis is modified. Indexing individual segments from Nutch is tedious, especially for small segments. This issue should add a feature that can write XML batches. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-828) Fetch Filter
[ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-828: Fix Version/s: (was: 1.5) (was: nutchgora) 1.6 20120304-push-1.6 Fetch Filter Key: NUTCH-828 URL: https://issues.apache.org/jira/browse/NUTCH-828 Project: Nutch Issue Type: New Feature Components: fetcher Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.6 Attachments: NUTCH-828-1-20100608.patch, NUTCH-828-2-20100608.patch Adds a Nutch extension point for a fetch filter. The fetch filter allows filtering content and parse data/text after it is fetched but before it is written to segments. The fliter can return true if content is to be written or false if it is not. Some use cases for this filter would be topical search engines that only want to fetch/index certain types of content, for example a news or sports only search engine. In these types of situations the only way to determine if content belongs to a particular set is to fetch the page and then analyze the content. If the content passes, meaning belongs to the set of say sports pages, then we want to include it. If it doesn't then we want to ignore it, never fetch that same page in the future, and ignore any urls on that page. If content is rejected due to a fetch filter then its status is written to the CrawlDb as gone and its content is ignored and not written to segments. This effectively stop crawling along the crawl path of that page and the urls from that page. An example filter, fetch-safe, is provided that allows fetching content that does not contain a list of bad words. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Attachment: NUTCH-1024-1.5-3.patch New patch with proper logging and configuration files. Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Attachment: NUTCH-1024-1.5-3.patch Something went wrong here. Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Attachment: (was: NUTCH-1024-1.5-3.patch) Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Attachment: NUTCH-1024-1.5-2.patch New patch for 1.5 with modifications as per Julien's comments. Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, NUTCH-1024-1.5-2.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1320: - Attachment: NUTCH-1320-1.5-1.patch Patch for 1.5. URLUtil now has a toASCII and toUnicode method wrapping the java.net.IDN methods. These take an URL and return a normalized one. IndexChecker and ParseChecker choke on IDN's Key: NUTCH-1320 URL: https://issues.apache.org/jira/browse/NUTCH-1320 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1320-1.5-1.patch These handy debug tools do not handle IDN's and throw an NPE bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 {code} Exception in thread main java.lang.NullPointerException at org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1234) Upgrade to Tika 1.1
[ https://issues.apache.org/jira/browse/NUTCH-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1234: - Attachment: NUTCH-1234-1.5-1.patch Patch for 1.5 upgrading to Tika-core 1.1 and upgrading Hadoop test to 1.0.0 and all tests pass. Will commit shortly unless there are objections. Upgrade to Tika 1.1 --- Key: NUTCH-1234 URL: https://issues.apache.org/jira/browse/NUTCH-1234 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1234-1.5-1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1319) HostNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1319: - Patch Info: Patch Available HostNormalizer -- Key: NUTCH-1319 URL: https://issues.apache.org/jira/browse/NUTCH-1319 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1319-1.5-1.patch Nutch would benefit from having a host normalizer. A host normalizer maps a given host to the desired host. A basic example is to map www.apache.org to apache.org. The Apache website is one of many on the internet that has a duplicate website on the same domain just because it allows both www and non-www to return HTTP 200 and proper content. It is also able to handle wildcards such as *.example.org to example.org if there are multiple sub domains that actually point to the same website. Large internet crawls tend to get polluted very quickly due to these problems. It also leads to skewed scores in the webgraph as different websites link to different versions of the same duplicate website. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1319) HostNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1319: - Attachment: NUTCH-1319-1.5-1.patch Patch for 1.5. HostNormalizer -- Key: NUTCH-1319 URL: https://issues.apache.org/jira/browse/NUTCH-1319 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1319-1.5-1.patch Nutch would benefit from having a host normalizer. A host normalizer maps a given host to the desired host. A basic example is to map www.apache.org to apache.org. The Apache website is one of many on the internet that has a duplicate website on the same domain just because it allows both www and non-www to return HTTP 200 and proper content. It is also able to handle wildcards such as *.example.org to example.org if there are multiple sub domains that actually point to the same website. Large internet crawls tend to get polluted very quickly due to these problems. It also leads to skewed scores in the webgraph as different websites link to different versions of the same duplicate website. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1305) Domain(blacklist)URLFilter to trim entries
[ https://issues.apache.org/jira/browse/NUTCH-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1305: - Attachment: NUTCH-1305-1.5-1.patch Patch for 1.5. Fixes the issue. Domain(blacklist)URLFilter to trim entries -- Key: NUTCH-1305 URL: https://issues.apache.org/jira/browse/NUTCH-1305 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1305-1.5-1.patch Both filters should handle entries with trailing whitespace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1300) Indexer to normalize URL's
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1300: - Attachment: NUTCH-1300-1.5-1.patch Patch for 1.5. Indexer to normalize URL's -- Key: NUTCH-1300 URL: https://issues.apache.org/jira/browse/NUTCH-1300 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1300-1.5-1.patch Indexers should be able to normalize URL's. This is useful when a new normalizer is applied to the entire CrawlDB. Without it, some or all records in a segment cannot be indexed at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1299) NPE in LinkRank inverter
[ https://issues.apache.org/jira/browse/NUTCH-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1299: - Patch Info: Patch Available NPE in LinkRank inverter Key: NUTCH-1299 URL: https://issues.apache.org/jira/browse/NUTCH-1299 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.5 No Node object is passed from the inverter's mapper to the reducer, which expects one, causing the following exception: {code} java.lang.NullPointerException at org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:409) at org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:356) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) {code} This never happens unless you have a funky web graph. Our web graph changes frequently, adding and deleting records. It's likely a large number of records deleted from the outlink database is responsible for this. This error, however, only showed up now, a great deal of time after we began deleting records. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1299) NPE in LinkRank inverter
[ https://issues.apache.org/jira/browse/NUTCH-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1299: - Attachment: NUTCH-1299-1.5-1.patch Most likely solution is to check whether a LoopSet enters the reducer without an accompanying Node or LinkDatum object, which are mandatory. NPE in LinkRank inverter Key: NUTCH-1299 URL: https://issues.apache.org/jira/browse/NUTCH-1299 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.5 Attachments: NUTCH-1299-1.5-1.patch No Node object is passed from the inverter's mapper to the reducer, which expects one, causing the following exception: {code} java.lang.NullPointerException at org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:409) at org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:356) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) {code} This never happens unless you have a funky web graph. Our web graph changes frequently, adding and deleting records. It's likely a large number of records deleted from the outlink database is responsible for this. This error, however, only showed up now, a great deal of time after we began deleting records. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1299) NPE in LinkRank inverter
[ https://issues.apache.org/jira/browse/NUTCH-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1299: - Attachment: NUTCH-1299-1.5-2.patch New patch logs warning with proper error message. NPE in LinkRank inverter Key: NUTCH-1299 URL: https://issues.apache.org/jira/browse/NUTCH-1299 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Critical Fix For: 1.5 Attachments: NUTCH-1299-1.5-1.patch, NUTCH-1299-1.5-2.patch No Node object is passed from the inverter's mapper to the reducer, which expects one, causing the following exception: {code} java.lang.NullPointerException at org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:409) at org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:356) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) {code} This never happens unless you have a funky web graph. Our web graph changes frequently, adding and deleting records. It's likely a large number of records deleted from the outlink database is responsible for this. This error, however, only showed up now, a great deal of time after we began deleting records. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1299) LinkRank inverter to ignore records without Node
[ https://issues.apache.org/jira/browse/NUTCH-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1299: - Priority: Major (was: Critical) Summary: LinkRank inverter to ignore records without Node (was: NPE in LinkRank inverter) LinkRank inverter to ignore records without Node Key: NUTCH-1299 URL: https://issues.apache.org/jira/browse/NUTCH-1299 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1299-1.5-1.patch, NUTCH-1299-1.5-2.patch No Node object is passed from the inverter's mapper to the reducer, which expects one, causing the following exception: {code} java.lang.NullPointerException at org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:409) at org.apache.nutch.scoring.webgraph.LinkRank$Inverter.reduce(LinkRank.java:356) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) {code} This never happens unless you have a funky web graph. Our web graph changes frequently, adding and deleting records. It's likely a large number of records deleted from the outlink database is responsible for this. This error, however, only showed up now, a great deal of time after we began deleting records. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Attachment: NUTCH-1024-1.5-1.patch New patch for trunk! This also includes a change to the injector where injected fetchInterval is added to CrawlDatum MD. In AdaptiveFetchSchedule this injected interval overrides anything else. Dynamically set fetchInterval by MIME-type -- Key: NUTCH-1024 URL: https://issues.apache.org/jira/browse/NUTCH-1024 Project: Nutch Issue Type: New Feature Components: generator Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: AdaptiveFetchSchedule.patch, MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, Nutch.patch, adaptive-mimetypes.txt Add facility to configure default or fixed fetchInterval values by MIME-type. This is useful for conserving resources for files that are known to change frequently or never and everything in between. * simple key\tvalue\n configuration file * only set fetchInterval for new documents * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira