[ https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1104: ---------------------------------------- Description: Umbrella issue for tracking issues that should be ported from 1.x trunk to the NutchGora branch. Please mark ported issues by modifying this description. NOT YET PORTED: * NUTCH-809 Parse-metatags plugin * NUTCH-987 Support HTTP auth for Solr communication * NUTCH-1028 Log parser keys * NUTCH-1036 Solr jobs should increment counters in Reporter * NUTCH-1057 Make fetcher thread time out configurable * NUTCH-1067 Configure minimum throughput for fetcher * NUTCH-1101 Options to purge db_gone records in updatedb * NUTCH-1102 Fetcher, rely on fetcher.parse directive only * NUTCH-1105 MaxContentLength option for index-basic * NUTCH-940 Statis field plugin * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk * NUTCH-1207 ParserChecker to output signature * NUTCH-1090 InvertLinks should inform when ignoring internal links * NUTCH-1174 Outlinks are not properly normalized * NUTCH-1203 ParseSegment to show number of milliseconds per parse * NUTCH-1173 DomainStats doesn't count db_not_modified * NUTCH-1155 Host/domain limit in generator is generate.max.count+1 * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex * NUTCH-1142 Normalization and filtering in WebGraph * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file * NUTCH-1195 Add Solr 4x (trunk) example schema * NUTCH-1141 Configurable Fetcher queue depth * NUTCH-1214 DomainStats tool should be named for what it's doing * NUTCH-1213 Pass additional SolrParams when indexing to Solr * NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN requirements * NUTCH-1231 Upgrade to Tika 1.0 * NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0 * NUTCH-1235 Upgrade to new Hadoop 0.20.205.0 * NUTCH-1184 Fetcher to parse and follow Nth degree outlinks * NUTCH-1214 DomainStats tool should be named for what it's doing * NUTCH-1207 ParserChecker to output signature * NUTCH-1174 Outlinks are not properly normalized * NUTCH-1173 DomainStats doesn't count db_not_modified * NUTCH-1142 Normalization and filtering in WebGraph PORTED: * No issues yet NOT GOING TO BE PORTED: * No issues, explain why it should not be ported was: Umbrella issue for tracking issues that should be ported from 1.x trunk to the NutchGora branch. Please mark ported issues by modifying this description. NOT YET PORTED: * NUTCH-987 Support HTTP auth for Solr communication * NUTCH-1028 Log parser keys * NUTCH-1036 Solr jobs should increment counters in Reporter * NUTCH-1057 Make fetcher thread time out configurable * NUTCH-1067 Configure minimum throughput for fetcher * NUTCH-1101 Options to purge db_gone records in updatedb * NUTCH-1102 Fetcher, rely on fetcher.parse directive only * NUTCH-1105 MaxContentLength option for index-basic * NUTCH-940 Statis field plugin * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk * NUTCH-1207 ParserChecker to output signature * NUTCH-1090 InvertLinks should inform when ignoring internal links * NUTCH-1174 Outlinks are not properly normalized * NUTCH-1203 ParseSegment to show number of milliseconds per parse * NUTCH-1173 DomainStats doesn't count db_not_modified * NUTCH-1155 Host/domain limit in generator is generate.max.count+1 * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex * NUTCH-1142 Normalization and filtering in WebGraph * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS file * NUTCH-1195 Add Solr 4x (trunk) example schema * NUTCH-1141 Configurable Fetcher queue depth * NUTCH-1214 DomainStats tool should be named for what it's doing * NUTCH-1213 Pass additional SolrParams when indexing to Solr * NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN requirements * NUTCH-1231 Upgrade to Tika 1.0 * NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0 * NUTCH-1235 Upgrade to new Hadoop 0.20.205.0 * NUTCH-1184 Fetcher to parse and follow Nth degree outlinks * NUTCH-1214 DomainStats tool should be named for what it's doing * NUTCH-1207 ParserChecker to output signature * NUTCH-1174 Outlinks are not properly normalized * NUTCH-1173 DomainStats doesn't count db_not_modified * NUTCH-1142 Normalization and filtering in WebGraph PORTED: * No issues yet NOT GOING TO BE PORTED: * No issues, explain why it should not be ported > Port issues from trunk NutchGora branch > --------------------------------------- > > Key: NUTCH-1104 > URL: https://issues.apache.org/jira/browse/NUTCH-1104 > Project: Nutch > Issue Type: Task > Affects Versions: nutchgora > Reporter: Markus Jelsma > Fix For: nutchgora > > > Umbrella issue for tracking issues that should be ported from 1.x trunk to > the NutchGora branch. Please mark ported issues by modifying this description. > NOT YET PORTED: > * NUTCH-809 Parse-metatags plugin > * NUTCH-987 Support HTTP auth for Solr communication > * NUTCH-1028 Log parser keys > * NUTCH-1036 Solr jobs should increment counters in Reporter > * NUTCH-1057 Make fetcher thread time out configurable > * NUTCH-1067 Configure minimum throughput for fetcher > * NUTCH-1101 Options to purge db_gone records in updatedb > * NUTCH-1102 Fetcher, rely on fetcher.parse directive only > * NUTCH-1105 MaxContentLength option for index-basic > * NUTCH-940 Statis field plugin > * NUTCH-1094 create comprehensive documentation for Nutch 2.0 trunk > * NUTCH-1207 ParserChecker to output signature > * NUTCH-1090 InvertLinks should inform when ignoring internal links > * NUTCH-1174 Outlinks are not properly normalized > * NUTCH-1203 ParseSegment to show number of milliseconds per parse > * NUTCH-1173 DomainStats doesn't count db_not_modified > * NUTCH-1155 Host/domain limit in generator is generate.max.count+1 > * NUTCH-1061 Migrate MoreIndexingFilter from Apache ORO to java.util.regex > * NUTCH-1142 Normalization and filtering in WebGraph > * NUTCH-1153 LinkRank not to log all keys and not to write Hadoop _SUCCESS > file > * NUTCH-1195 Add Solr 4x (trunk) example schema > * NUTCH-1141 Configurable Fetcher queue depth > * NUTCH-1214 DomainStats tool should be named for what it's doing > * NUTCH-1213 Pass additional SolrParams when indexing to Solr > * NUTCH-1211 URLFilterChecker command line help doesn't inform user of STDIN > requirements > * NUTCH-1231 Upgrade to Tika 1.0 > * NUTCH-1230 MimeType API deprecated and breaks with Tika 1.0 > * NUTCH-1235 Upgrade to new Hadoop 0.20.205.0 > * NUTCH-1184 Fetcher to parse and follow Nth degree outlinks > * NUTCH-1214 DomainStats tool should be named for what it's doing > * NUTCH-1207 ParserChecker to output signature > * NUTCH-1174 Outlinks are not properly normalized > * NUTCH-1173 DomainStats doesn't count db_not_modified > * NUTCH-1142 Normalization and filtering in WebGraph > PORTED: > * No issues yet > NOT GOING TO BE PORTED: > * No issues, explain why it should not be ported -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira