[jira] [Created] (NUTCH-1331) limit crawler to defined depth

2012-04-11 Thread behnam nikbakht (Created) (JIRA)
limit crawler to defined depth
--

 Key: NUTCH-1331
 URL: https://issues.apache.org/jira/browse/NUTCH-1331
 Project: Nutch
  Issue Type: New Feature
  Components: generator, parser, storage
Affects Versions: 1.4
Reporter: behnam nikbakht


there is a need to limit crawler to some defined depth, and importance of this 
option is to avoid crawling of infinite loops, with dynamic generated urls, 
that occur in some sites, and to optimize crawler to select important urls.
an option is define a iteration limit on generate,fetch,parse,updatedb cycle, 
but it works only if in each cycle, all of unfetched urls become fetched, 
(without recrawling them and with some other considerations)
we can define a new parameter in CrawlDatum, named depth, and like score-opic 
algorithm, compute depth of a link after parse, and in generate, only select 
urls with valid depth.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1329) parser not extract outlinks to external web sites

2012-04-04 Thread behnam nikbakht (Created) (JIRA)
parser not extract outlinks to external web sites
-

 Key: NUTCH-1329
 URL: https://issues.apache.org/jira/browse/NUTCH-1329
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: behnam nikbakht


found a bug in 
/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java,
 that outlinks like www.example2.com from www.example1.com are inserted as 
www.example1.com/www.example2.com
i correct this bug by testing that if outlink (www.example2.com) is a valid 
url, else inserted with it's base url
so i replace these lines:
URL url = URLUtil.resolveURL(base, target);
outlinks.add(new Outlink(url.toString(),
 linkText.toString().trim()));

with:
String host_temp=null;
try{
host_temp=URLUtil.getDomainName(new URL(target));
}
catch(Exception eiuy){
host_temp=null;
}
URL url=null;
if(host_temp==null)// it is an internal outlink
url = URLUtil.resolveURL(base, target);
else //it is an external link
url=new URL(target);
outlinks.add(new Outlink(url.toString(),
 linkText.toString().trim()));


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1328) a problem with regex-normalize.xml

2012-04-02 Thread behnam nikbakht (Created) (JIRA)
a problem with regex-normalize.xml
--

 Key: NUTCH-1328
 URL: https://issues.apache.org/jira/browse/NUTCH-1328
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: behnam nikbakht


there is a regex-pattern in regex-normalize.xml:
pattern([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|amp;|#|$)/pattern
that remove session ids from urls, but there is some sites, like:
http://www.mehrnews.com/fa
that have urls, like:
http://www.mehrnews.com/fa/newsdetail.aspx?NewsID=1567539
and with this pattern, this url converted to an invalid url:
http://www.mehrnews.com/fa/newsdetail.aspx?New

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1309) fetch queue management

2012-03-12 Thread behnam nikbakht (Created) (JIRA)
fetch queue management
--

 Key: NUTCH-1309
 URL: https://issues.apache.org/jira/browse/NUTCH-1309
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht


when run fetch in hadoop with multiple concurrent mapper, there are multiple 
independent fetchQueues that make hard to manage them. i suggest that construct 
fetchQueues before begin of run with this line:
feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage

2012-03-06 Thread behnam nikbakht (Created) (JIRA)
Fetcher to skip queues for URLS getting repeated exceptions, based on percentage


 Key: NUTCH-1303
 URL: https://issues.apache.org/jira/browse/NUTCH-1303
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht


as described in https://issues.apache.org/jira/browse/NUTCH-769, it is a good 
solution to skip queues with high exception value, but it is not easy to set 
value of fetcher.max.exceptions.per.queue when size of queues are different.
i suggest that define a ratio instead of value, so if the ratio of exceptions 
per requests exceeds, then queue cleared.
also, it is not sufficient to keep fetcher from high exceptions, value of 
fetcher.throughput.threshold.pages ensures that a valueable throughput of fetch 
can gained against slow hosts, but it clean all queues not slow queue. i 
suggest for this one that this factor like fetcher.max.exceptions.per.queue 
enforce to each queue not all of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2012-03-03 Thread behnam nikbakht (Created) (JIRA)
it is better for fetchItemQueues to select items from greater queues first
--

 Key: NUTCH-1297
 URL: https://issues.apache.org/jira/browse/NUTCH-1297
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht


there is a situation that if we have multiple hosts in fetch, and size of hosts 
were different, large hosts have a long delay until the getFetchItem() in 
FetchItemQueues class select a url from them, so we can give them more priority.
for example if we have 10 url from host1 and 1000 url from host2, and have 5 
threads, if all threads first selected from host1, we had more delay on fetch 
rather than a situation that threads first selected from host2, and when host 2 
was busy, then selected from host1.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages

2012-02-21 Thread behnam nikbakht (Created) (JIRA)
Generator should not generate filter and not found and denied and gone and 
permanently moved pages
--

 Key: NUTCH-1288
 URL: https://issues.apache.org/jira/browse/NUTCH-1288
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Affects Versions: 1.4
Reporter: behnam nikbakht


Generator should not generate filter and not found and denied and gone and 
permanently moved pages.
in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked 
against special states of fetch like not found, and not generate them again.
so we can add a status in CrawlDatum that indicates invalid urls, and set this 
status in fetch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1282) linkdb scalability

2012-02-19 Thread behnam nikbakht (Created) (JIRA)
linkdb scalability
--

 Key: NUTCH-1282
 URL: https://issues.apache.org/jira/browse/NUTCH-1282
 Project: Nutch
  Issue Type: Improvement
  Components: linkdb
Affects Versions: 1.4
Reporter: behnam nikbakht


as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage 
is only for anchor and not impact on scoring. 
as seemed, size of linkdb in incremental crawl grow very fast and make it 
unscalable for huge size of web sites.
so, here is two choises, one, ignore invertlinks and linkdb from crawl, and 
second, make it scalable
in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed 
segments, and second for merge new linkdb with old linkdb. the second job is 
unscalable and we can ignore it with this changes in solrIndex:
in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum 
== null or parseText == null or parseData == null, then add anchor to doc and 
update solr (no insert)
here also some changes required to NutchDocument.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2012-02-18 Thread behnam nikbakht (Created) (JIRA)
tika parser not work properly with unwanted file types that passed from filters 
in nutch


 Key: NUTCH-1281
 URL: https://issues.apache.org/jira/browse/NUTCH-1281
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: behnam nikbakht


when in parse-plugins.xml, set this property:
mimeType name=*
plugin id=parse-tika /
/mimeType
all unwanted files that pass from all filters, refered to tika
but for some file types like .flv, tika parser has problem and hunged and cause 
to fail in parse Job.
if this file types passed from regex-urlfilter and other filters, parse job 
failed.
for this problem I suggest that add some properties for valid file types, and 
use this code in TikaParser.java, like this:


public ParseResult getParse(Content content) {
String mimeType = content.getContentType();

+   String[]validTypes=new 
String[]{application/pdf,application/x-tika-msoffice,application/x-tika- 
ooxml,application/vnd.oasis.opendocument.text,text/plain,application/rtf,application/rss+xml,application/x-bzip2,application/x-gzip,application/x-javascript,application/javascript,text/javascript,application/x-shockwave-flash,application/zip,text/xml,application/xml};
+   boolean valid=false;
+   for(int k=0;kvalidTypes.length;k++){
+   if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
+   valid=true;
+   }
+   if(!valid)
+   return new ParseStatus(ParseStatus.NOTPARSED, Can't 
parse for unwanted filetype + mimeType).getEmptyParseResult(content.getUrl(), 
getConf());

URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1278) Fetch Improvement in threads per host

2012-02-14 Thread behnam nikbakht (Created) (JIRA)
Fetch Improvement in threads per host
-

 Key: NUTCH-1278
 URL: https://issues.apache.org/jira/browse/NUTCH-1278
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 1.4
Reporter: behnam nikbakht


the value of maxThreads is equal to fetcher.threads.per.host and is constant 
for every host
there is a possibility with using of dynamic values for every host that 
influeced with number of blocked requests.
this means that if number of blocked requests for one host increased, then we 
most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1269) Generate main problems

2012-02-08 Thread behnam nikbakht (Created) (JIRA)
Generate main problems
--

 Key: NUTCH-1269
 URL: https://issues.apache.org/jira/browse/NUTCH-1269
 Project: Nutch
  Issue Type: Improvement
  Components: generator
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht


there are some problems with current Generate method, with maxNumSegments and 
maxHostCount options:
1. first, size of generated segments are different
2. with maxHostCount option, it is unclear that it was applied or not
3. urls from one host are distributed non-uniform between segments
we change Generator.java as described below:
in Selector class:
private int maxNumSegments;
private int segmentSize;
private int maxHostCount;
public void config
...
  maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
  segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
  maxHostCount=job.getInt(GENERATE_MAX_PER_HOST, 100);  
...
public void reduce(FloatWritable key, IteratorSelectorEntry values,
OutputCollectorFloatWritable,SelectorEntry output, Reporter reporter)
throws IOException {
int limit2=(int)((limit*3)/2);
  while (values.hasNext()) {
if(count == limit)
break;
if (count % segmentSize == 0 ) {
  if (currentsegmentnum  maxNumSegments-1){
currentsegmentnum++;
  }
  else
currentsegmentnum=0;
}

boolean full=true;
for(int jk=0;jkmaxNumSegments;jk++){
if (segCounts[jk]segmentSize){
full=false;
}
}
if(full){
break;
}
SelectorEntry entry = values.next();
Text url = entry.url;
//logWrite(Generated3:+limit+-+count+-+url.toString());
String urlString = url.toString();
URL u = null;
String hostordomain = null;
try {
  if (normalise  normalizers != null) {
urlString = normalizers.normalize(urlString,
URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
  }
   
  u = new URL(urlString);
  if (byDomain) {
hostordomain = URLUtil.getDomainName(u);
  } else {
hostordomain = new URL(urlString).getHost();
  }
 
hostordomain = hostordomain.toLowerCase();

boolean countLimit=true;
// only filter if we are counting hosts or domains
 int[] hostCount = hostCounts.get(hostordomain);
 //host count: {a,b,c,d} means that from this host there are a urls 
in segment 0 and b urls in seg 1 and ...
 if (hostCount == null) {
 hostCount = new int[maxNumSegments];
 for(int kl=0;klhostCount.length;kl++)
 hostCount[kl]=0;
 hostCounts.put(hostordomain, hostCount);
 }  
 int selectedSeg=currentsegmentnum;
 int minCount=hostCount[selectedSeg];
 for(int jk=0;jkmaxNumSegments;jk++){
 if(hostCount[jk]minCount){
 minCount=hostCount[jk];
 selectedSeg=jk;
 }
}
if(hostCount[selectedSeg]=maxHostCount){
count++;
entry.segnum = new IntWritable(selectedSeg);
hostCount[selectedSeg]++;
output.collect(key, entry);
}

} catch (Exception e) {
  LOG.warn(Malformed URL: ' + urlString + ', skipping (
logWrite(Generate-malform:+hostordomain+-+url.toString());
  + StringUtils.stringifyException(e) + ));
  //continue;
}
  }
}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1270) some of Deflate encoded pages not fetched

2012-02-08 Thread behnam nikbakht (Created) (JIRA)
some of Deflate encoded pages not fetched
-

 Key: NUTCH-1270
 URL: https://issues.apache.org/jira/browse/NUTCH-1270
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.4
 Environment: software
Reporter: behnam nikbakht


it is a problem with some of web pages that fetched but their content can not 
retrived
after this change, this error fixed
we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
  public byte[] processDeflateEncoded(byte[] compressed, URL url) throws 
IOException {

if (LOGGER.isTraceEnabled()) { LOGGER.trace(inflating); }

byte[] content = DeflateUtils.inflateBestEffort(compressed, 
getMaxContent());
+if(content==null)
+   content = DeflateUtils.inflateBestEffort(compressed, 20);

if (content == null)
  throw new IOException(inflateBestEffort returned null);

if (LOGGER.isTraceEnabled()) {
  LOGGER.trace(fetched  + compressed.length
 +  bytes of compressed content (expanded to 
 + content.length +  bytes) from  + url);
}
return content;
  }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1204) not all of pages parsed

2011-11-11 Thread behnam nikbakht (Created) (JIRA)
not all of pages parsed
---

 Key: NUTCH-1204
 URL: https://issues.apache.org/jira/browse/NUTCH-1204
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
Reporter: behnam nikbakht
Priority: Critical


when we fetch a site in multiple segments, and dump crawldb with readdb, the 
system says that some of pages are unfetched, and when we checked, we find that 
these pages were fetched and stored but was not parsed
we try to crawl a site with only html pages and edit suffix-urlfilter.txt and 
parser.timeout property and test it and find that only some of html pages are 
parsed
this is a critical situation for performance because fetching of sites is well 
but parsing of them in iterations cause refetching these sites

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1199) unfetched URLs problem

2011-11-07 Thread behnam nikbakht (Created) (JIRA)
unfetched URLs problem
--

 Key: NUTCH-1199
 URL: https://issues.apache.org/jira/browse/NUTCH-1199
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, generator
Reporter: behnam nikbakht
Priority: Critical


we write a script to fetch unfetched urls:
#first dump from readdb to a text file, and extract unfetched urls to a text 
file:
bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format csv
cat $SITE_DIR/tmp/dump_urls.txt/part-0 | grep db_unfetched  
$SITE_DIR/tmp/dump_unf
unfetched_urls_file=$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt
cat $SITE_DIR/tmp/dump_unf | awk -F '' '{print $2}'   
$unfetched_urls_file

unfetched_count=`cat $unfetched_urls_file|wc -l`
#next, we have a list of unfetched urls in unfetched_urls.txt , then, we use 
command freegen to create segments for #these urls, we can not use command 
generate because these url's were generated previously
   if [[ $unfetched_count -lt $it_size ]]

   then
echo UNFETCHED $J , $it_size URLs from 
$unfetched_count generated
((J++))
bin/nutch freegen 
$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt $crawlseg
s2=`ls -d $crawlseg/2* | tail -1`
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb $crawldb $s2
echo bin/nutch updatedb $crawldb $s2  
$SITE_DIR/updatedblog.txt
get_new_links
exit
   fi
# if number of urls are greater than it_size, then package them
ij=1
while read line
do
let ind = $ij / $it_size
mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/
echo $line  
$SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt
echo $ind
((ij++))
let completed=$ij % $it_size
   if [[ $completed -eq 0 ]]

   then
  echo 
UNFETCHED $J , $it_size URLs from $unfetched_count generated
((J++))
bin/nutch freegen 
$SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt $crawlseg
#finally fetch,parse and update new segment
s2=`ls -d $crawlseg/2* | tail -1`
bin/nutch fetch $s2
bin/nutch parse $s2
rm $crawldb/.locked
bin/nutch updatedb $crawldb $s2
echo bin/nutch updatedb $crawldb $s2  
$SITE_DIR/updatedblog.txt
   fi
done $unfetched_urls_file


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira