[jira] [Commented] (NUTCH-1270) some of Deflate encoded pages not fetched

2012-04-03 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245259#comment-13245259
 ] 

behnam nikbakht commented on NUTCH-1270:


for example, with the site:
http://www.noormags.com/view/fa/default
when fetch the first page, and dump from segment, see that there is a problem 
with fetch,
when i replace 
byte[] content = DeflateUtils.inflateBestEffort(compressed, getMaxContent());
with
byte[] content = DeflateUtils.inflateBestEffort(compressed, 999);
it's work

> some of Deflate encoded pages not fetched
> -
>
> Key: NUTCH-1270
> URL: https://issues.apache.org/jira/browse/NUTCH-1270
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.4
> Environment: software
>Reporter: behnam nikbakht
>  Labels: fetch, processDeflateEncoded
> Attachments: NUTCH-1270.patch
>
>
> it is a problem with some of web pages that fetched but their content can not 
> retrived
> after this change, this error fixed
> we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
>   public byte[] processDeflateEncoded(byte[] compressed, URL url) throws 
> IOException {
> if (LOGGER.isTraceEnabled()) { LOGGER.trace("inflating"); }
> byte[] content = DeflateUtils.inflateBestEffort(compressed, 
> getMaxContent());
> +if(content==null)
> + content = DeflateUtils.inflateBestEffort(compressed, 20);
> if (content == null)
>   throw new IOException("inflateBestEffort returned null");
> if (LOGGER.isTraceEnabled()) {
>   LOGGER.trace("fetched " + compressed.length
>  + " bytes of compressed content (expanded to "
>  + content.length + " bytes) from " + url);
> }
> return content;
>   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

2012-03-06 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223157#comment-13223157
 ] 

behnam nikbakht commented on NUTCH-1067:


i can not understand why disable the threshold checker:
throughputThresholdPages = -1;
that cause to enforce this factor once.

> Configure minimum throughput for fetcher
> 
>
> Key: NUTCH-1067
> URL: https://issues.apache.org/jira/browse/NUTCH-1067
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, 
> NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be 
> very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If 
> all other url's have been fetched, these queue's can stall the entire 
> fetcher, 60 url's can then take 10 minutes or even more. This can usually be 
> dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum 
> number of pages per second before the fetcher gives up. It doesn't use the 
> global number of pages / running time but records the actual pages processed 
> in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number 
> of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1282) linkdb scalability

2012-03-03 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221580#comment-13221580
 ] 

behnam nikbakht commented on NUTCH-1282:


another option is when we construct web graph for implementing advanced scoring 
methods, and then, we can extract anchors from inlinks in web graph and no need 
to linkdb.

> linkdb scalability
> --
>
> Key: NUTCH-1282
> URL: https://issues.apache.org/jira/browse/NUTCH-1282
> Project: Nutch
>  Issue Type: Improvement
>  Components: linkdb
>Affects Versions: 1.4
>Reporter: behnam nikbakht
>
> as described in NUTCH-1054, the linkdb is optional in solrindex and it's 
> usage is only for anchor and not impact on scoring. 
> as seemed, size of linkdb in incremental crawl grow very fast and make it 
> unscalable for huge size of web sites.
> so, here is two choises, one, ignore invertlinks and linkdb from crawl, and 
> second, make it scalable
> in invertlinks, there is 2 jobs, first for construct new linkdb from new 
> parsed segments, and second for merge new linkdb with old linkdb. the second 
> job is unscalable and we can ignore it with this changes in solrIndex:
> in the class IndexerMapReduce, reduce method, if fetchDatum == null or 
> dbDatum == null or parseText == null or parseData == null, then add anchor to 
> doc and update solr (no insert)
> here also some changes required to NutchDocument.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

2012-03-03 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221568#comment-13221568
 ] 

behnam nikbakht commented on NUTCH-1278:


i edit this patch to make this changes:
use a class named HostsUtil that manage a xml file, named hosts_conf.xml for 
maintaining hosts permanent and temporal informations in a multi thread 
environment. this class maintain some of variables like timeout for fetch and 
hostcount for generate , ...
this variables can used in fetch and generate and other parts of Nutch.
for adaptive http.timeout in fetch, we simply change some parts of Fetcher.java 
and some changes in Protocol.java and it's implementations without disturb them.


> Fetch Improvement in threads per host
> -
>
> Key: NUTCH-1278
> URL: https://issues.apache.org/jira/browse/NUTCH-1278
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 1.4
>Reporter: behnam nikbakht
> Attachments: NUTCH-1278-v.2.zip, NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant 
> for every host
> there is a possibility with using of dynamic values for every host that 
> influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we 
> most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2012-02-19 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211316#comment-13211316
 ] 

behnam nikbakht commented on NUTCH-1281:


Problem is that actual mime-types can not properly filtered until the parse or 
fetch start. and here are many file types that we can not filter all of them, 
and maybe there are some bugs with tika parser with some file types.
so we can filter them in TikaParser from valid file types.

> tika parser not work properly with unwanted file types that passed from 
> filters in nutch
> 
>
> Key: NUTCH-1281
> URL: https://issues.apache.org/jira/browse/NUTCH-1281
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: behnam nikbakht
>
> when in parse-plugins.xml, set this property:
> 
> 
> 
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and 
> cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job 
> failed.
> for this problem I suggest that add some properties for valid file types, and 
> use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
>   String mimeType = content.getContentType();
> + String[]validTypes=new 
> String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- 
> ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> + boolean valid=false;
> + for(int k=0;k + if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> + valid=true;
> + }
> + if(!valid)
> + return new ParseStatus(ParseStatus.NOTPARSED, "Can't 
> parse for unwanted filetype "+ 
> mimeType).getEmptyParseResult(content.getUrl(), getConf());
>   
>   URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1278) Fetch Improvement in threads per host

2012-02-19 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211294#comment-13211294
 ] 

behnam nikbakht commented on NUTCH-1278:


here is a primary patch, that has some changes in Fetcher.java ,Protocol.java 
and it's plugins like lib-http
i use a file in local system for maintaining a hashtable that contains hosts 
and their http.timeout
for each blocked response, there is a increment in timeout and for each 
success, there is a decrement
we can use different increment and decrement rates so we can make a balance 
between total time of fetch Job, and a relation between fetched and blocked 
rates. for example it can configurable that if 90% of requests for some host 
are seccess, there is no need to increase timeout.

> Fetch Improvement in threads per host
> -
>
> Key: NUTCH-1278
> URL: https://issues.apache.org/jira/browse/NUTCH-1278
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Affects Versions: 1.4
>Reporter: behnam nikbakht
> Attachments: NUTCH-1278.zip
>
>
> the value of maxThreads is equal to fetcher.threads.per.host and is constant 
> for every host
> there is a possibility with using of dynamic values for every host that 
> influeced with number of blocked requests.
> this means that if number of blocked requests for one host increased, then we 
> most decrease this value and increase http.timeout

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1269) Generate main problems

2012-02-08 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13203494#comment-13203494
 ] 

behnam nikbakht commented on NUTCH-1269:


i am using Nutch-1.3, and i know about NUTCH-1074, in uploaded patch urls per 
host distributed uniformly between segments, for example if there are 100 url 
from host a , and 4 segment, there are 25 url from host a in each segment.
multiple number of reducers in selector, cause to some problems in segment size 
and setting reducers of this job to 1 dont have effect on performance.
if we delete variable full, we can say that there is no limit on segments after 
map.
this is a problem that we have about host count, and cause to not generating 
from some hosts after full of all aegments

> Generate main problems
> --
>
> Key: NUTCH-1269
> URL: https://issues.apache.org/jira/browse/NUTCH-1269
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.4
> Environment: software
>Reporter: behnam nikbakht
>  Labels: Generate, MaxHostCount, MaxNumSegments
> Attachments: NUTCH-1269.patch
>
>
> there are some problems with current Generate method, with maxNumSegments and 
> maxHostCount options:
> 1. first, size of generated segments are different
> 2. with maxHostCount option, it is unclear that it was applied or not
> 3. urls from one host are distributed non-uniform between segments
> we change Generator.java as described below:
> in Selector class:
> private int maxNumSegments;
> private int segmentSize;
> private int maxHostCount;
> public void config
> ...
>   maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1);
>   segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments;
>   maxHostCount=job.getInt("GENERATE_MAX_PER_HOST", 100);  
> ...
> public void reduce(FloatWritable key, Iterator values,
> OutputCollector output, Reporter 
> reporter)
> throws IOException {
>   int limit2=(int)((limit*3)/2);
>   while (values.hasNext()) {
>   if(count == limit)
> break;
> if (count % segmentSize == 0 ) {
>   if (currentsegmentnum < maxNumSegments-1){
> currentsegmentnum++;
>   }
>   else
> currentsegmentnum=0;
> }
> boolean full=true;
> for(int jk=0;jk   if (segCounts[jk]   full=false;
>   }
> }
> if(full){
>   break;
> }
> SelectorEntry entry = values.next();
> Text url = entry.url;
> //logWrite("Generated3:"+limit+"-"+count+"-"+url.toString());
> String urlString = url.toString();
> URL u = null;
> String hostordomain = null;
> try {
>   if (normalise && normalizers != null) {
> urlString = normalizers.normalize(urlString,
> URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
>   }
>
>   u = new URL(urlString);
>   if (byDomain) {
> hostordomain = URLUtil.getDomainName(u);
>   } else {
> hostordomain = new URL(urlString).getHost();
>   }
>  
>   hostordomain = hostordomain.toLowerCase();
> boolean countLimit=true;
> // only filter if we are counting hosts or domains
>  int[] hostCount = hostCounts.get(hostordomain);
>  //host count: {a,b,c,d} means that from this host there are a 
> urls in segment 0 and b urls in seg 1 and ...
>  if (hostCount == null) {
>  hostCount = new int[maxNumSegments];
>  for(int kl=0;kl  hostCount[kl]=0;
>  hostCounts.put(hostordomain, hostCount);
>  }  
>  int selectedSeg=currentsegmentnum;
>  int minCount=hostCount[selectedSeg];
>  for(int jk=0;jk  if(hostCount[jk]  minCount=hostCount[jk];
>  selectedSeg=jk;
>  }
> }
> if(hostCount[selectedSeg]<=maxHostCount){
> count++;
> entry.segnum = new IntWritable(selectedSeg);
> hostCount[selectedSeg]++;
> output.collect(key, entry);
> }
> } catch (Exception e) {
>   LOG.warn("Malformed URL: '" + urlString + "', skipping ("
> logWrite("Generate-malform:"+hostordomain+"-"+url.toString());
>   + StringUtils.stringifyException(e) + ")");
>   //continue;
> }
>   }
> }
> 

--
This message is automatically generated by J

[jira] [Commented] (NUTCH-1199) unfetched URLs problem

2011-11-08 Thread behnam nikbakht (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146163#comment-13146163
 ] 

behnam nikbakht commented on NUTCH-1199:


the problem is huge number of unfetched urls, for example we have only 2000 
fetched urls from a site with 4 urls
and by command generate, we can not regenerate them and assign segments to 
them, so we use freegen command that create segments for unfetched urls and 
fetch them and update crawldb.
is it a good or bad solution?

> unfetched URLs problem
> --
>
> Key: NUTCH-1199
> URL: https://issues.apache.org/jira/browse/NUTCH-1199
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, generator
>Reporter: behnam nikbakht
>Priority: Critical
>  Labels: db_unfetched, fetch, freegen, generate, unfetched, 
> updatedb
>
> we write a script to fetch unfetched urls:
> #first dump from readdb to a text file, and extract unfetched urls to a text 
> file:
> bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format 
> csv
> cat $SITE_DIR/tmp/dump_urls.txt/part-0 | grep db_unfetched > 
> $SITE_DIR/tmp/dump_unf
> unfetched_urls_file="$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt"
> cat $SITE_DIR/tmp/dump_unf | awk -F '"' '{print $2}' >  
> $unfetched_urls_file
> unfetched_count=`cat $unfetched_urls_file|wc -l`
> #next, we have a list of unfetched urls in unfetched_urls.txt , then, we use 
> command freegen to create segments for #these urls, we can not use command 
> generate because these url's were generated previously
>if [[ $unfetched_count -lt $it_size ]]
>then
> echo "UNFETCHED $J , $it_size URLs from 
> $unfetched_count generated"
> ((J++))
> bin/nutch freegen 
> $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt $crawlseg
> s2=`ls -d $crawlseg/2* | tail -1`
> bin/nutch fetch $s2
> bin/nutch parse $s2
> bin/nutch updatedb $crawldb $s2
> echo "bin/nutch updatedb $crawldb $s2" >> 
> $SITE_DIR/updatedblog.txt
> get_new_links
> exit
>fi
> # if number of urls are greater than it_size, then package them
> ij=1
> while read line
> do
> let "ind = $ij / $it_size"
> mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/
> echo $line >> 
> $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt
> echo $ind
> ((ij++))
> let "completed=$ij % $it_size"
>if [[ $completed -eq 0 ]]
>then
>   echo 
> "UNFETCHED $J , $it_size URLs from $unfetched_count generated"
> ((J++))
> bin/nutch freegen 
> $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt 
> $crawlseg
> #finally fetch,parse and update new segment
> s2=`ls -d $crawlseg/2* | tail -1`
> bin/nutch fetch $s2
> bin/nutch parse $s2
> rm $crawldb/.locked
> bin/nutch updatedb $crawldb $s2
> echo "bin/nutch updatedb $crawldb $s2" >> 
> $SITE_DIR/updatedblog.txt
>fi
> done <$unfetched_urls_file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira