[jira] [Created] (NUTCH-1269) Generate main problems
Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1270) some of Deflate encoded pages not fetched
some of Deflate encoded pages not fetched - Key: NUTCH-1270 URL: https://issues.apache.org/jira/browse/NUTCH-1270 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht it is a problem with some of web pages that fetched but their content can not retrived after this change, this error fixed we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException { if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); } byte[] content = DeflateUtils.inflateBestEffort(compressed, getMaxContent()); +if(content==null) + content = DeflateUtils.inflateBestEffort(compressed, 20); if (content == null) throw new IOException(inflateBestEffort returned null); if (LOGGER.isTraceEnabled()) { LOGGER.trace(fetched + compressed.length + bytes of compressed content (expanded to + content.length + bytes) from + url); } return content; } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203453#comment-13203453 ] Lewis John McGibbney commented on NUTCH-1269: - Hi Behnam. Can you please package the above code as a patch against 1.5 (trunk). That way we can try it if we get time. Thank you Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1271) Fix errors @ compile time
Fix errors @ compile time - Key: NUTCH-1271 URL: https://issues.apache.org/jira/browse/NUTCH-1271 Project: Nutch Issue Type: Improvement Components: build Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Priority: Minor Fix For: nutchgora, 1.5 After adding the -Xlint commands to build.xml, we see many errors when compiling. These should be fixed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1269: --- Attachment: NUTCH-1269.patch Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] behnam nikbakht updated NUTCH-1269: --- Patch Info: Patch Available yes, thanks for your attention Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203488#comment-13203488 ] Markus Jelsma commented on NUTCH-1269: -- It won't patch for trunk, all hunks fail. Anyway, this issue looks like NUTCH-1074. Segment sizes are uniform and the correct number of records per queue end up in a segment. I think this duplicates NUTCH-1074 which was fixed for 1.4. What Nutch are you using Benham? Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203494#comment-13203494 ] behnam nikbakht commented on NUTCH-1269: i am using Nutch-1.3, and i know about NUTCH-1074, in uploaded patch urls per host distributed uniformly between segments, for example if there are 100 url from host a , and 4 segment, there are 25 url from host a in each segment. multiple number of reducers in selector, cause to some problems in segment size and setting reducers of this job to 1 dont have effect on performance. if we delete variable full, we can say that there is no limit on segments after map. this is a problem that we have about host count, and cause to not generating from some hosts after full of all aegments Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically
[jira] [Commented] (NUTCH-1269) Generate main problems
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203501#comment-13203501 ] Markus Jelsma commented on NUTCH-1269: -- Ah, yes, i understand now. Your patch is an attempt to spread the host (or domain) limit over all generated segments. Interesting. Can you provide a patch that works with trunk and have this feature enabled via configuration? Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: Generate, MaxHostCount, MaxNumSegments Attachments: NUTCH-1269.patch there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100); ... public void reduce(FloatWritable key, IteratorSelectorEntry values, OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkmaxNumSegments;jk++){ if (segCounts[jk]segmentSize){ full=false; } } if(full){ break; } SelectorEntry entry = values.next(); Text url = entry.url; //logWrite(Generated3:+limit+-+count+-+url.toString()); String urlString = url.toString(); URL u = null; String hostordomain = null; try { if (normalise normalizers != null) { urlString = normalizers.normalize(urlString, URLNormalizers.SCOPE_GENERATE_HOST_COUNT); } u = new URL(urlString); if (byDomain) { hostordomain = URLUtil.getDomainName(u); } else { hostordomain = new URL(urlString).getHost(); } hostordomain = hostordomain.toLowerCase(); boolean countLimit=true; // only filter if we are counting hosts or domains int[] hostCount = hostCounts.get(hostordomain); //host count: {a,b,c,d} means that from this host there are a urls in segment 0 and b urls in seg 1 and ... if (hostCount == null) { hostCount = new int[maxNumSegments]; for(int kl=0;klhostCount.length;kl++) hostCount[kl]=0; hostCounts.put(hostordomain, hostCount); } int selectedSeg=currentsegmentnum; int minCount=hostCount[selectedSeg]; for(int jk=0;jkmaxNumSegments;jk++){ if(hostCount[jk]minCount){ minCount=hostCount[jk]; selectedSeg=jk; } } if(hostCount[selectedSeg]=maxHostCount){ count++; entry.segnum = new IntWritable(selectedSeg); hostCount[selectedSeg]++; output.collect(key, entry); } } catch (Exception e) { LOG.warn(Malformed URL: ' + urlString + ', skipping ( logWrite(Generate-malform:+hostordomain+-+url.toString()); + StringUtils.stringifyException(e) + )); //continue; } } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1270) some of Deflate encoded pages not fetched
[ https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203515#comment-13203515 ] Lewis John McGibbney commented on NUTCH-1270: - Hi Benham, again thanks for opening this ticket, but could you possibly patch this against trunk (1.5)? Thankyou some of Deflate encoded pages not fetched - Key: NUTCH-1270 URL: https://issues.apache.org/jira/browse/NUTCH-1270 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht Labels: fetch, processDeflateEncoded it is a problem with some of web pages that fetched but their content can not retrived after this change, this error fixed we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException { if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); } byte[] content = DeflateUtils.inflateBestEffort(compressed, getMaxContent()); +if(content==null) + content = DeflateUtils.inflateBestEffort(compressed, 20); if (content == null) throw new IOException(inflateBestEffort returned null); if (LOGGER.isTraceEnabled()) { LOGGER.trace(fetched + compressed.length + bytes of compressed content (expanded to + content.length + bytes) from + url); } return content; } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Fwd: Mandatory svnpubsub migration by Jan 2013
Hi, Can anyone comment where we lie with this? I really don't have a clue. Thanks Lewis -- Forwarded message -- From: Joe Schaefer joe_schae...@yahoo.com Date: Wed, Feb 8, 2012 at 12:26 PM Subject: Mandatory svnpubsub migration by Jan 2013 To: Apache Infrastructure infrastruct...@apache.org [PLEASE DO NOT RESPOND TO THIS POST! DIRECT ALL FURTHER INQUIRIES TO infrastruct...@apache.org] FYI: infrastructure policy regarding website hosting has changed as of November 2011: we are requiring all websites and dist/ dirs to be svnpubsub or ASF CMS backed by the end of 2012. If your PMC has already met this requirement congratulations, you can ignore the remainder of this post. As stated on http://www.apache.org/dev/project-site.html#svnpubsub we are migrating our webserver infrastructure to 100% svnpubsub over the course of 2012. If your site does not currently make use of this technology, it is time to consider a migration effort, as rsync-based sites will be PERMANENTLY FROZEN in Jan 2013 due to infra disabling the hourly rsync jobs. While we recommend migrating to the ASF CMS [0] for Anakia based or Confluence based sites, and have provided tooling [1] to help facilitate this, we are only mandating svnpubsub (which the CMS uses itself). svnpubsub is a client-server system whereby a client watches an svn working copy for relevant commit notifications from the svn server. It subsequently runs svn up on the working copy, bringing in the relevant changes. sites that use static build technologies that commit the build results to svn are naturally compatible with svnpubsub; simply file a JIRA ticket with INFRA to request a migration: any commits to the resulting build tree will be instantly picked up on the live site. The CMS is a more elaborate system based on svnpubsub which provides a webgui for convenient online editing. Dozens of sites have already successfully deployed using the CMS and are quite happy with the results. The system is sufficiently flexible to accommodate a wide variety of choices regarding templating systems and storage formats, but most sites have standardized on the combination of Django and Markdown. Talk to infra if you would like to use the CMS in this or some other fashion, we'll see what we can do. NOTE: the policy for dist/ dirs for managing project releases is similar. We have setup a dedicated svn server for handling this, please contact infra when you are ready to start using it. HTH [0]: http://www.apache.org/dev/cms [1]: https://svn.apache.org/repos/infra/websites/cms/conversion-utilities/ -- *Lewis*
tika-core, tika-parser
Hi, Can anyone shed light on this? We don't have any parsers in our libs dir and we don't have tika-parsers jar, only the tika-core jar. Where are the parsers and how does this all work? I've posted a question (same subject) on the Tika list and Nick tells me there must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you? Thanks
Re: tika-core, tika-parser
Hi Markus, For starters http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view=markup Can we pick our way through this? Thanks On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Can anyone shed light on this? We don't have any parsers in our libs dir and we don't have tika-parsers jar, only the tika-core jar. Where are the parsers and how does this all work? I've posted a question (same subject) on the Tika list and Nick tells me there must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you? Thanks -- *Lewis*
Re: Mandatory svnpubsub migration by Jan 2013
The Nutch site is already based on svnpubsub. On 8 February 2012 12:40, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi, Can anyone comment where we lie with this? I really don't have a clue. Thanks Lewis -- Forwarded message -- From: Joe Schaefer joe_schae...@yahoo.com Date: Wed, Feb 8, 2012 at 12:26 PM Subject: Mandatory svnpubsub migration by Jan 2013 To: Apache Infrastructure infrastruct...@apache.org [PLEASE DO NOT RESPOND TO THIS POST! DIRECT ALL FURTHER INQUIRIES TO infrastruct...@apache.org] FYI: infrastructure policy regarding website hosting has changed as of November 2011: we are requiring all websites and dist/ dirs to be svnpubsub or ASF CMS backed by the end of 2012. If your PMC has already met this requirement congratulations, you can ignore the remainder of this post. As stated on http://www.apache.org/dev/project-site.html#svnpubsub we are migrating our webserver infrastructure to 100% svnpubsub over the course of 2012. If your site does not currently make use of this technology, it is time to consider a migration effort, as rsync-based sites will be PERMANENTLY FROZEN in Jan 2013 due to infra disabling the hourly rsync jobs. While we recommend migrating to the ASF CMS [0] for Anakia based or Confluence based sites, and have provided tooling [1] to help facilitate this, we are only mandating svnpubsub (which the CMS uses itself). svnpubsub is a client-server system whereby a client watches an svn working copy for relevant commit notifications from the svn server. It subsequently runs svn up on the working copy, bringing in the relevant changes. sites that use static build technologies that commit the build results to svn are naturally compatible with svnpubsub; simply file a JIRA ticket with INFRA to request a migration: any commits to the resulting build tree will be instantly picked up on the live site. The CMS is a more elaborate system based on svnpubsub which provides a webgui for convenient online editing. Dozens of sites have already successfully deployed using the CMS and are quite happy with the results. The system is sufficiently flexible to accommodate a wide variety of choices regarding templating systems and storage formats, but most sites have standardized on the combination of Django and Markdown. Talk to infra if you would like to use the CMS in this or some other fashion, we'll see what we can do. NOTE: the policy for dist/ dirs for managing project releases is similar. We have setup a dedicated svn server for handling this, please contact infra when you are ready to start using it. HTH [0]: http://www.apache.org/dev/cms [1]: https://svn.apache.org/repos/infra/websites/cms/conversion-utilities/ -- *Lewis* -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: tika-core, tika-parser
The dependencies for the plugins are defined locally as shown in the URL below, where you can see the ref to tika-parsers for parse-tika. Is that more clear for you Markus? On 8 February 2012 12:58, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Markus, For starters http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view=markup Can we pick our way through this? Thanks On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Can anyone shed light on this? We don't have any parsers in our libs dir and we don't have tika-parsers jar, only the tika-core jar. Where are the parsers and how does this all work? I've posted a question (same subject) on the Tika list and Nick tells me there must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you? Thanks -- *Lewis* -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: tika-core, tika-parser
Yes, it's listed there indeed! But where are the parser impls then? I'll check this out. I must be getting crazy or something! On Wednesday 08 February 2012 13:58:46 Lewis John Mcgibbney wrote: Hi Markus, For starters http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view =markup Can we pick our way through this? Thanks On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.iowrote: Hi, Can anyone shed light on this? We don't have any parsers in our libs dir and we don't have tika-parsers jar, only the tika-core jar. Where are the parsers and how does this all work? I've posted a question (same subject) on the Tika list and Nick tells me there must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you? Thanks -- Markus Jelsma - CTO - Openindex
Re: tika-core, tika-parser
Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's something else. dependencies, dependencies, dependencies :( On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote: The dependencies for the plugins are defined locally as shown in the URL below, where you can see the ref to tika-parsers for parse-tika. Is that more clear for you Markus? On 8 February 2012 12:58, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Markus, For starters http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi ew=markup Can we pick our way through this? Thanks On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Can anyone shed light on this? We don't have any parsers in our libs dir and we don't have tika-parsers jar, only the tika-core jar. Where are the parsers and how does this all work? I've posted a question (same subject) on the Tika list and Nick tells me there must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you? Thanks -- *Lewis* -- Markus Jelsma - CTO - Openindex
Re: tika-core, tika-parser
sorry don't understand what your issue is. We have a dependency on tika-parsers and the actual parser implementations (listed in tika parsers' POM) are pulled transitively just like any other dependency managed by Ivy. They end up being copied in runtime/local/plugins/parse-tika/ or put in the job in runtime/deploy/ On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote: Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's something else. dependencies, dependencies, dependencies :( On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote: The dependencies for the plugins are defined locally as shown in the URL below, where you can see the ref to tika-parsers for parse-tika. Is that more clear for you Markus? On 8 February 2012 12:58, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Markus, For starters http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi ew=markup Can we pick our way through this? Thanks On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Can anyone shed light on this? We don't have any parsers in our libs dir and we don't have tika-parsers jar, only the tika-core jar. Where are the parsers and how does this all work? I've posted a question (same subject) on the Tika list and Nick tells me there must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you? Thanks -- *Lewis* -- Markus Jelsma - CTO - Openindex -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: tika-core, tika-parser
On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote: sorry don't understand what your issue is. We have a dependency on tika-parsers and the actual parser implementations (listed in tika parsers' POM) are pulled transitively just like any other dependency managed by Ivy. They end up being copied in runtime/local/plugins/parse-tika/ or put in the job in runtime/deploy/ My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT that i need to use in Nutch. However, when i build tika-parsers and put it in Nutch' lib directory i still seem to be missing dependencies. Then trouble begins: Exception in thread main java.lang.NoClassDefFoundError: Could not initialize class org.apache.tika.parser.dwg.DWGParser at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at sun.misc.Service$LazyIterator.next(Service.java:271) at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149) at org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211) at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138) Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config file, which i did. But then other dependency issues come and go. The more parsers i remove from the config file the better it goes, but then Tika won't build anymore because of failing tests. I asked this on the Nutch list because i wasn't sure anymore how Nutch deals with these its own deps, which you explained well. I'll give up for now :) On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote: Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's something else. dependencies, dependencies, dependencies :( On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote: The dependencies for the plugins are defined locally as shown in the URL below, where you can see the ref to tika-parsers for parse-tika. Is that more clear for you Markus? On 8 February 2012 12:58, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Markus, For starters http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi ew=markup Can we pick our way through this? Thanks On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Can anyone shed light on this? We don't have any parsers in our libs dir and we don't have tika-parsers jar, only the tika-core jar. Where are the parsers and how does this all work? I've posted a question (same subject) on the Tika list and Nick tells me there must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you? Thanks -- *Lewis* -- Markus Jelsma - CTO - Openindex -- Markus Jelsma - CTO - Openindex
Finding specific file types only -- *.ics files
Hi, I'm interested in using Nutch to crawl certain websites looking for only a specific file type, in my case I'm looking for any url that ends with a *.ics construct. I don't need to parse the ics files, I just need to know all the .ics files that exist. A list of links would be great. Can Nutch be configured to do this? Thanks! Pete p...@curveos.com
Re: tika-core, tika-parser
On Feb 8, 2012, at 5:28am, Markus Jelsma wrote: On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote: sorry don't understand what your issue is. We have a dependency on tika-parsers and the actual parser implementations (listed in tika parsers' POM) are pulled transitively just like any other dependency managed by Ivy. They end up being copied in runtime/local/plugins/parse-tika/ or put in the job in runtime/deploy/ My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT that i need to use in Nutch. However, when i build tika-parsers and put it in Nutch' lib directory i still seem to be missing dependencies. Then trouble begins: I don't know anything about how Nutch handles jars in its lib directory, but this sounds like you have a raw jar (tika-parsers) without its pom.xml. So then Ivy (or Maven) doesn't know about the transitive dependencies on other jars, which are needed to implement the actual parsing support. -- Ken Exception in thread main java.lang.NoClassDefFoundError: Could not initialize class org.apache.tika.parser.dwg.DWGParser at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at sun.misc.Service$LazyIterator.next(Service.java:271) at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149) at org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211) at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138) Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config file, which i did. But then other dependency issues come and go. The more parsers i remove from the config file the better it goes, but then Tika won't build anymore because of failing tests. I asked this on the Nutch list because i wasn't sure anymore how Nutch deals with these its own deps, which you explained well. I'll give up for now :) On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote: Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's something else. dependencies, dependencies, dependencies :( On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote: The dependencies for the plugins are defined locally as shown in the URL below, where you can see the ref to tika-parsers for parse-tika. Is that more clear for you Markus? On 8 February 2012 12:58, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Markus, For starters http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi ew=markup Can we pick our way through this? Thanks On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi, Can anyone shed light on this? We don't have any parsers in our libs dir and we don't have tika-parsers jar, only the tika-core jar. Where are the parsers and how does this all work? I've posted a question (same subject) on the Tika list and Nick tells me there must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you? Thanks -- *Lewis* -- Markus Jelsma - CTO - Openindex -- Markus Jelsma - CTO - Openindex -- Ken Krugler http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Mahout Solr