[jira] [Created] (NUTCH-1269) Generate main problems

2012-02-08 Thread behnam nikbakht (Created) (JIRA)
Generate main problems
--

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht


there are some problems with current Generate method, with maxNumSegments and 
maxHostCount options:
1. first, size of generated segments are different
2. with maxHostCount option, it is unclear that it was applied or not
3. urls from one host are distributed non-uniform between segments
we change Generator.java as described below:
in Selector class:
private int maxNumSegments;
private int segmentSize;
private int maxHostCount;
public void config
...
  maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
  segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
  maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
...
public void reduce(FloatWritable key, IteratorSelectorEntry values,
OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter)
throws IOException {
int limit2=(int)((limit*3)/2);
  while (values.hasNext()) {
if(count == limit)
break;
if (count % segmentSize == 0 ) {
  if (currentsegmentnum  maxNumSegments-1){
currentsegmentnum++;
  }
  else
currentsegmentnum=0;
}

boolean full=true;
for(int jk=0;jkmaxNumSegments;jk++){
if (segCounts[jk]segmentSize){
full=false;
}
}
if(full){
break;
}
SelectorEntry entry = values.next();
Text url = entry.url;
//logWrite(Generated3:+limit+-+count+-+url.toString());
String urlString = url.toString();
URL u = null;
String hostordomain = null;
try {
  if (normalise  normalizers != null) {
urlString = normalizers.normalize(urlString,
URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
  }
   
  u = new URL(urlString);
  if (byDomain) {
hostordomain = URLUtil.getDomainName(u);
  } else {
hostordomain = new URL(urlString).getHost();
  }
 
hostordomain = hostordomain.toLowerCase();

boolean countLimit=true;
// only filter if we are counting hosts or domains
 int[] hostCount = hostCounts.get(hostordomain);
 //host count: {a,b,c,d} means that from this host there are a urls 
in segment 0 and b urls in seg 1 and ...
 if (hostCount == null) {
 hostCount = new int[maxNumSegments];
 for(int kl=0;klhostCount.length;kl++)
 hostCount[kl]=0;
 hostCounts.put(hostordomain, hostCount);
 }  
 int selectedSeg=currentsegmentnum;
 int minCount=hostCount[selectedSeg];
 for(int jk=0;jkmaxNumSegments;jk++){
 if(hostCount[jk]minCount){
 minCount=hostCount[jk];
 selectedSeg=jk;
 }
}
if(hostCount[selectedSeg]=maxHostCount){
count++;
entry.segnum = new IntWritable(selectedSeg);
hostCount[selectedSeg]++;
output.collect(key, entry);
}

} catch (Exception e) {
  LOG.warn(Malformed URL: ' + urlString + ', skipping (
logWrite(Generate-malform:+hostordomain+-+url.toString());
  + StringUtils.stringifyException(e) + ));
  //continue;
}
  }
}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1270) some of Deflate encoded pages not fetched

2012-02-08 Thread behnam nikbakht (Created) (JIRA)
some of Deflate encoded pages not fetched
-

 Key: NUTCH-1270
 URL: https://issues.apache.org/jira/browse/NUTCH-1270
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht


it is a problem with some of web pages that fetched but their content can not 
retrived
after this change, this error fixed
we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
  public byte[] processDeflateEncoded(byte[] compressed, URL url) throws 
IOException {

if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); }

byte[] content = DeflateUtils.inflateBestEffort(compressed, 
getMaxContent());
+if(content==null)
+   content = DeflateUtils.inflateBestEffort(compressed, 20);

if (content == null)
  throw new IOException(inflateBestEffort returned null);

if (LOGGER.isTraceEnabled()) {
  LOGGER.trace(fetched  + compressed.length
 +  bytes of compressed content (expanded to 
 + content.length +  bytes) from  + url);
}
return content;
  }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1269) Generate main problems

2012-02-08 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203453#comment-13203453
 ] 

Lewis John McGibbney commented on NUTCH-1269:
-

Hi Behnam. Can you please package the above code as a patch against 1.5 
(trunk). That way we can try it if we get time. Thank you

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments

 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1271) Fix errors @ compile time

2012-02-08 Thread Lewis John McGibbney (Created) (JIRA)
Fix errors @ compile time
-

 Key: NUTCH-1271
 URL: https://issues.apache.org/jira/browse/NUTCH-1271
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: nutchgora, 1.5


After adding the -Xlint commands to build.xml, we see many errors when 
compiling. These should be fixed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1269) Generate main problems

2012-02-08 Thread behnam nikbakht (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

behnam nikbakht updated NUTCH-1269:
---

Attachment: NUTCH-1269.patch

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Attachments: NUTCH-1269.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1269) Generate main problems

2012-02-08 Thread behnam nikbakht (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

behnam nikbakht updated NUTCH-1269:
---

Patch Info: Patch Available

yes, thanks for your attention

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Attachments: NUTCH-1269.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1269) Generate main problems

2012-02-08 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203488#comment-13203488
 ] 

Markus Jelsma commented on NUTCH-1269:
--

It won't patch for trunk, all hunks fail. Anyway, this issue looks like 
NUTCH-1074. Segment sizes are uniform and the correct number of records per 
queue end up in a segment. I think this duplicates NUTCH-1074 which was fixed 
for 1.4. What Nutch are you using Benham?

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Attachments: NUTCH-1269.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1269) Generate main problems

2012-02-08 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203494#comment-13203494
 ] 

behnam nikbakht commented on NUTCH-1269:


i am using Nutch-1.3, and i know about NUTCH-1074, in uploaded patch urls per 
host distributed uniformly between segments, for example if there are 100 url 
from host a , and 4 segment, there are 25 url from host a in each segment.
multiple number of reducers in selector, cause to some problems in segment size 
and setting reducers of this job to 1 dont have effect on performance.
if we delete variable full, we can say that there is no limit on segments after 
map.
this is a problem that we have about host count, and cause to not generating 
from some hosts after full of all aegments

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Attachments: NUTCH-1269.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically 

[jira] [Commented] (NUTCH-1269) Generate main problems

2012-02-08 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203501#comment-13203501
 ] 

Markus Jelsma commented on NUTCH-1269:
--

Ah, yes, i understand now. Your patch is an attempt to spread the host (or 
domain) limit over all generated segments. Interesting. Can you provide a patch 
that works with trunk and have this feature enabled via configuration?

 Generate main problems
 --

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: Generate, MaxHostCount, MaxNumSegments
 Attachments: NUTCH-1269.patch


 there are some problems with current Generate method, with maxNumSegments and 
 maxHostCount options:
 1. first, size of generated segments are different
 2. with maxHostCount option, it is unclear that it was applied or not
 3. urls from one host are distributed non-uniform between segments
 we change Generator.java as described below:
 in Selector class:
 private int maxNumSegments;
 private int segmentSize;
 private int maxHostCount;
 public void config
 ...
   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
   maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
 ...
 public void reduce(FloatWritable key, IteratorSelectorEntry values,
 OutputCollectorFloatWritable,SelectorEntry output, Reporter 
 reporter)
 throws IOException {
   int limit2=(int)((limit*3)/2);
   while (values.hasNext()) {
   if(count == limit)
 break;
 if (count % segmentSize == 0 ) {
   if (currentsegmentnum  maxNumSegments-1){
 currentsegmentnum++;
   }
   else
 currentsegmentnum=0;
 }
 boolean full=true;
 for(int jk=0;jkmaxNumSegments;jk++){
   if (segCounts[jk]segmentSize){
   full=false;
   }
 }
 if(full){
   break;
 }
 SelectorEntry entry = values.next();
 Text url = entry.url;
 //logWrite(Generated3:+limit+-+count+-+url.toString());
 String urlString = url.toString();
 URL u = null;
 String hostordomain = null;
 try {
   if (normalise  normalizers != null) {
 urlString = normalizers.normalize(urlString,
 URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
   }

   u = new URL(urlString);
   if (byDomain) {
 hostordomain = URLUtil.getDomainName(u);
   } else {
 hostordomain = new URL(urlString).getHost();
   }
  
   hostordomain = hostordomain.toLowerCase();
 boolean countLimit=true;
 // only filter if we are counting hosts or domains
  int[] hostCount = hostCounts.get(hostordomain);
  //host count: {a,b,c,d} means that from this host there are a 
 urls in segment 0 and b urls in seg 1 and ...
  if (hostCount == null) {
  hostCount = new int[maxNumSegments];
  for(int kl=0;klhostCount.length;kl++)
  hostCount[kl]=0;
  hostCounts.put(hostordomain, hostCount);
  }  
  int selectedSeg=currentsegmentnum;
  int minCount=hostCount[selectedSeg];
  for(int jk=0;jkmaxNumSegments;jk++){
  if(hostCount[jk]minCount){
  minCount=hostCount[jk];
  selectedSeg=jk;
  }
 }
 if(hostCount[selectedSeg]=maxHostCount){
 count++;
 entry.segnum = new IntWritable(selectedSeg);
 hostCount[selectedSeg]++;
 output.collect(key, entry);
 }
 } catch (Exception e) {
   LOG.warn(Malformed URL: ' + urlString + ', skipping (
 logWrite(Generate-malform:+hostordomain+-+url.toString());
   + StringUtils.stringifyException(e) + ));
   //continue;
 }
   }
 }
 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1270) some of Deflate encoded pages not fetched

2012-02-08 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203515#comment-13203515
 ] 

Lewis John McGibbney commented on NUTCH-1270:
-

Hi Benham, again thanks for opening this ticket, but could you possibly patch 
this against trunk (1.5)? Thankyou

 some of Deflate encoded pages not fetched
 -

 Key: NUTCH-1270
 URL: https://issues.apache.org/jira/browse/NUTCH-1270
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht
  Labels: fetch, processDeflateEncoded

 it is a problem with some of web pages that fetched but their content can not 
 retrived
 after this change, this error fixed
 we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
   public byte[] processDeflateEncoded(byte[] compressed, URL url) throws 
 IOException {
 if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); }
 byte[] content = DeflateUtils.inflateBestEffort(compressed, 
 getMaxContent());
 +if(content==null)
 + content = DeflateUtils.inflateBestEffort(compressed, 20);
 if (content == null)
   throw new IOException(inflateBestEffort returned null);
 if (LOGGER.isTraceEnabled()) {
   LOGGER.trace(fetched  + compressed.length
  +  bytes of compressed content (expanded to 
  + content.length +  bytes) from  + url);
 }
 return content;
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Fwd: Mandatory svnpubsub migration by Jan 2013

2012-02-08 Thread Lewis John Mcgibbney
Hi,

Can anyone comment where we lie with this?

I really don't have a clue.

Thanks

Lewis

-- Forwarded message --
From: Joe Schaefer joe_schae...@yahoo.com
Date: Wed, Feb 8, 2012 at 12:26 PM
Subject: Mandatory svnpubsub migration by Jan 2013
To: Apache Infrastructure infrastruct...@apache.org


[PLEASE DO NOT RESPOND TO THIS POST! DIRECT ALL FURTHER
INQUIRIES TO infrastruct...@apache.org]

FYI: infrastructure policy regarding website hosting has
changed as of November 2011: we are requiring all websites
and dist/ dirs to be svnpubsub or ASF CMS backed by the end of 2012.
If your PMC has already met this requirement congratulations,
you can ignore the remainder of this post.

As stated on http://www.apache.org/dev/project-site.html#svnpubsub
we are migrating our webserver infrastructure to 100% svnpubsub
over the course of 2012.  If your site does not currently make
use of this technology, it is time to consider a migration effort,
as rsync-based sites will be PERMANENTLY FROZEN in Jan 2013 due

to infra disabling the hourly rsync jobs.  While we recommend
migrating to the ASF CMS [0] for Anakia based or Confluence based
sites, and have provided tooling [1] to help facilitate this,
we are only mandating svnpubsub (which the CMS uses itself).

svnpubsub is a client-server system whereby a client watches an
svn working copy for relevant commit notifications from the svn
server.  It subsequently runs svn up on the working copy, bringing
in the relevant changes.  sites that use static build technologies
that commit the build results to svn are naturally compatible with
svnpubsub; simply file a JIRA ticket with INFRA to request a
migration: any commits to the resulting build tree will be

instantly picked up on the live site.


The CMS is a more elaborate system based on svnpubsub which
provides a webgui for convenient online editing.  Dozens of
sites have already successfully deployed using the CMS and
are quite happy with the results.  The system is sufficiently
flexible to accommodate a wide variety of choices regarding
templating systems and storage formats, but most sites have
standardized on the combination of Django and Markdown.  Talk
to infra if you would like to use the CMS in this or some other
fashion, we'll see what we can do.


NOTE: the policy for dist/ dirs for managing project releases is
similar.  We have setup a dedicated svn server for handling this,
please contact infra when you are ready to start using it.


HTH


[0]: http://www.apache.org/dev/cms
[1]: https://svn.apache.org/repos/infra/websites/cms/conversion-utilities/




-- 
*Lewis*


tika-core, tika-parser

2012-02-08 Thread Markus Jelsma
Hi,

Can anyone shed light on this? We don't have any parsers in our libs dir and 
we don't have tika-parsers jar, only the tika-core jar. Where are the parsers 
and how does this all work? 

I've posted a question (same subject) on the Tika list and Nick tells me there 
must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you?

Thanks


Re: tika-core, tika-parser

2012-02-08 Thread Lewis John Mcgibbney
Hi Markus,

For starters

http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view=markup

Can we pick our way through this?

Thanks

On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

 Hi,

 Can anyone shed light on this? We don't have any parsers in our libs dir
 and
 we don't have tika-parsers jar, only the tika-core jar. Where are the
 parsers
 and how does this all work?

 I've posted a question (same subject) on the Tika list and Nick tells me
 there
 must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
 you?

 Thanks




-- 
*Lewis*


Re: Mandatory svnpubsub migration by Jan 2013

2012-02-08 Thread Julien Nioche
The Nutch site is already based on svnpubsub.

On 8 February 2012 12:40, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi,

 Can anyone comment where we lie with this?

 I really don't have a clue.

 Thanks

 Lewis


 -- Forwarded message --
 From: Joe Schaefer joe_schae...@yahoo.com
 Date: Wed, Feb 8, 2012 at 12:26 PM
 Subject: Mandatory svnpubsub migration by Jan 2013
 To: Apache Infrastructure infrastruct...@apache.org


 [PLEASE DO NOT RESPOND TO THIS POST! DIRECT ALL FURTHER
 INQUIRIES TO infrastruct...@apache.org]

 FYI: infrastructure policy regarding website hosting has
 changed as of November 2011: we are requiring all websites
 and dist/ dirs to be svnpubsub or ASF CMS backed by the end of 2012.
 If your PMC has already met this requirement congratulations,
 you can ignore the remainder of this post.

 As stated on http://www.apache.org/dev/project-site.html#svnpubsub
 we are migrating our webserver infrastructure to 100% svnpubsub
 over the course of 2012.  If your site does not currently make
 use of this technology, it is time to consider a migration effort,
 as rsync-based sites will be PERMANENTLY FROZEN in Jan 2013 due

 to infra disabling the hourly rsync jobs.  While we recommend
 migrating to the ASF CMS [0] for Anakia based or Confluence based
 sites, and have provided tooling [1] to help facilitate this,
 we are only mandating svnpubsub (which the CMS uses itself).

 svnpubsub is a client-server system whereby a client watches an
 svn working copy for relevant commit notifications from the svn
 server.  It subsequently runs svn up on the working copy, bringing
 in the relevant changes.  sites that use static build technologies
 that commit the build results to svn are naturally compatible with
 svnpubsub; simply file a JIRA ticket with INFRA to request a
 migration: any commits to the resulting build tree will be

 instantly picked up on the live site.


 The CMS is a more elaborate system based on svnpubsub which
 provides a webgui for convenient online editing.  Dozens of
 sites have already successfully deployed using the CMS and
 are quite happy with the results.  The system is sufficiently
 flexible to accommodate a wide variety of choices regarding
 templating systems and storage formats, but most sites have
 standardized on the combination of Django and Markdown.  Talk
 to infra if you would like to use the CMS in this or some other
 fashion, we'll see what we can do.


 NOTE: the policy for dist/ dirs for managing project releases is
 similar.  We have setup a dedicated svn server for handling this,
 please contact infra when you are ready to start using it.


 HTH


 [0]: http://www.apache.org/dev/cms
 [1]: https://svn.apache.org/repos/infra/websites/cms/conversion-utilities/




 --
 *Lewis*




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: tika-core, tika-parser

2012-02-08 Thread Julien Nioche
The dependencies for the plugins are defined locally as shown in the URL
below, where you can see the ref to tika-parsers for parse-tika. Is that
more clear for you Markus?

On 8 February 2012 12:58, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Markus,

 For starters


 http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view=markup

 Can we pick our way through this?

 Thanks


 On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma markus.jel...@openindex.io
  wrote:

 Hi,

 Can anyone shed light on this? We don't have any parsers in our libs dir
 and
 we don't have tika-parsers jar, only the tika-core jar. Where are the
 parsers
 and how does this all work?

 I've posted a question (same subject) on the Tika list and Nick tells me
 there
 must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
 you?

 Thanks




 --
 *Lewis*




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: tika-core, tika-parser

2012-02-08 Thread Markus Jelsma
Yes, it's listed there indeed! But where are the parser impls then? I'll check 
this out. I must be getting crazy or something!

On Wednesday 08 February 2012 13:58:46 Lewis John Mcgibbney wrote:
 Hi Markus,
 
 For starters
 
 http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view
 =markup
 
 Can we pick our way through this?
 
 Thanks
 
 On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
 
 markus.jel...@openindex.iowrote:
  Hi,
  
  Can anyone shed light on this? We don't have any parsers in our libs dir
  and
  we don't have tika-parsers jar, only the tika-core jar. Where are the
  parsers
  and how does this all work?
  
  I've posted a question (same subject) on the Tika list and Nick tells me
  there
  must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
  you?
  
  Thanks

-- 
Markus Jelsma - CTO - Openindex


Re: tika-core, tika-parser

2012-02-08 Thread Markus Jelsma
Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's 
something else.

dependencies, dependencies, dependencies :(

On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
 The dependencies for the plugins are defined locally as shown in the URL
 below, where you can see the ref to tika-parsers for parse-tika. Is that
 more clear for you Markus?
 
 On 8 February 2012 12:58, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.comwrote:
  Hi Markus,
  
  For starters
  
  
  http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
  ew=markup
  
  Can we pick our way through this?
  
  Thanks
  
  
  On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
  markus.jel...@openindex.io
  
   wrote:
  Hi,
  
  Can anyone shed light on this? We don't have any parsers in our libs dir
  and
  we don't have tika-parsers jar, only the tika-core jar. Where are the
  parsers
  and how does this all work?
  
  I've posted a question (same subject) on the Tika list and Nick tells me
  there
  must be parsers somewhere. Well, i have no idea how we do it in Nutch,
  do you?
  
  Thanks
  
  --
  *Lewis*

-- 
Markus Jelsma - CTO - Openindex


Re: tika-core, tika-parser

2012-02-08 Thread Julien Nioche
sorry don't understand what your issue is. We have a dependency on
tika-parsers and the actual parser implementations (listed in tika parsers'
POM) are pulled transitively just like any other dependency managed by Ivy.
They end up being copied in  runtime/local/plugins/parse-tika/ or put in
the job in runtime/deploy/


On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote:

 Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
 something else.

 dependencies, dependencies, dependencies :(

 On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
  The dependencies for the plugins are defined locally as shown in the URL
  below, where you can see the ref to tika-parsers for parse-tika. Is that
  more clear for you Markus?
 
  On 8 February 2012 12:58, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.comwrote:
   Hi Markus,
  
   For starters
  
  
  
 http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
   ew=markup
  
   Can we pick our way through this?
  
   Thanks
  
  
   On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
   markus.jel...@openindex.io
  
wrote:
   Hi,
  
   Can anyone shed light on this? We don't have any parsers in our libs
 dir
   and
   we don't have tika-parsers jar, only the tika-core jar. Where are the
   parsers
   and how does this all work?
  
   I've posted a question (same subject) on the Tika list and Nick tells
 me
   there
   must be parsers somewhere. Well, i have no idea how we do it in Nutch,
   do you?
  
   Thanks
  
   --
   *Lewis*

 --
 Markus Jelsma - CTO - Openindex




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: tika-core, tika-parser

2012-02-08 Thread Markus Jelsma


On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
 sorry don't understand what your issue is. We have a dependency on
 tika-parsers and the actual parser implementations (listed in tika parsers'
 POM) are pulled transitively just like any other dependency managed by Ivy.
 They end up being copied in  runtime/local/plugins/parse-tika/ or put in
 the job in runtime/deploy/

My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT 
that i need to use in Nutch. However, when i build tika-parsers and put it in 
Nutch' lib directory i still seem to be missing dependencies. Then trouble 
begins:

Exception in thread main java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.tika.parser.dwg.DWGParser
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at sun.misc.Service$LazyIterator.next(Service.java:271)
at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149)
at 
org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)

Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config 
file, which i did. But then other dependency issues come and go. The more 
parsers i remove from the config file the better it goes, but then Tika won't 
build anymore because of failing tests.

I asked this on the Nutch list because i wasn't sure anymore how Nutch deals 
with these its own deps, which you explained well.

I'll give up for now :)



 
 On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote:
  Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
  something else.
  
  dependencies, dependencies, dependencies :(
  
  On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
   The dependencies for the plugins are defined locally as shown in the
   URL below, where you can see the ref to tika-parsers for parse-tika.
   Is that more clear for you Markus?
   
   On 8 February 2012 12:58, Lewis John Mcgibbney
  
  lewis.mcgibb...@gmail.comwrote:
Hi Markus,

For starters
  
  http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
  
ew=markup

Can we pick our way through this?

Thanks


On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
markus.jel...@openindex.io

 wrote:
Hi,

Can anyone shed light on this? We don't have any parsers in our libs
  
  dir
  
and
we don't have tika-parsers jar, only the tika-core jar. Where are
the parsers
and how does this all work?

I've posted a question (same subject) on the Tika list and Nick
tells
  
  me
  
there
must be parsers somewhere. Well, i have no idea how we do it in
Nutch, do you?

Thanks

--
*Lewis*
  
  --
  Markus Jelsma - CTO - Openindex

-- 
Markus Jelsma - CTO - Openindex


Finding specific file types only -- *.ics files

2012-02-08 Thread Peter Jameson
Hi,

I'm interested in using Nutch to crawl certain websites looking for only a 
specific file type, in my case I'm looking for any url that ends with a *.ics 
construct.  I don't need to parse the ics files, I just need to know all the 
.ics files that exist.  A list of links would be great.

Can Nutch be configured to do this?

Thanks!

Pete
p...@curveos.com




Re: tika-core, tika-parser

2012-02-08 Thread Ken Krugler

On Feb 8, 2012, at 5:28am, Markus Jelsma wrote:

 
 
 On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
 sorry don't understand what your issue is. We have a dependency on
 tika-parsers and the actual parser implementations (listed in tika parsers'
 POM) are pulled transitively just like any other dependency managed by Ivy.
 They end up being copied in  runtime/local/plugins/parse-tika/ or put in
 the job in runtime/deploy/
 
 My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT 
 that i need to use in Nutch. However, when i build tika-parsers and put it in 
 Nutch' lib directory i still seem to be missing dependencies. Then trouble 
 begins:

I don't know anything about how Nutch handles jars in its lib directory, but 
this sounds like you have a raw jar (tika-parsers) without its pom.xml.

So then Ivy (or Maven) doesn't know about the transitive dependencies on other 
jars, which are needed to implement the actual parsing support.

-- Ken

 
 Exception in thread main java.lang.NoClassDefFoundError: Could not 
 initialize class org.apache.tika.parser.dwg.DWGParser
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at sun.misc.Service$LazyIterator.next(Service.java:271)
at org.apache.nutch.parse.tika.TikaConfig.init(TikaConfig.java:149)
at 
 org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254)
at 
 org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
at 
 org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
 
 Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config 
 file, which i did. But then other dependency issues come and go. The more 
 parsers i remove from the config file the better it goes, but then Tika won't 
 build anymore because of failing tests.
 
 I asked this on the Nutch list because i wasn't sure anymore how Nutch deals 
 with these its own deps, which you explained well.
 
 I'll give up for now :)
 
 
 
 
 On 8 February 2012 13:03, Markus Jelsma markus.jel...@openindex.io wrote:
 Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
 something else.
 
 dependencies, dependencies, dependencies :(
 
 On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
 The dependencies for the plugins are defined locally as shown in the
 URL below, where you can see the ref to tika-parsers for parse-tika.
 Is that more clear for you Markus?
 
 On 8 February 2012 12:58, Lewis John Mcgibbney
 
 lewis.mcgibb...@gmail.comwrote:
 Hi Markus,
 
 For starters
 
 http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
 
 ew=markup
 
 Can we pick our way through this?
 
 Thanks
 
 
 On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
 markus.jel...@openindex.io
 
 wrote:
 Hi,
 
 Can anyone shed light on this? We don't have any parsers in our libs
 
 dir
 
 and
 we don't have tika-parsers jar, only the tika-core jar. Where are
 the parsers
 and how does this all work?
 
 I've posted a question (same subject) on the Tika list and Nick
 tells
 
 me
 
 there
 must be parsers somewhere. Well, i have no idea how we do it in
 Nutch, do you?
 
 Thanks
 
 --
 *Lewis*
 
 --
 Markus Jelsma - CTO - Openindex
 
 -- 
 Markus Jelsma - CTO - Openindex

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Mahout  Solr