Re: index web
yes, you are right, the whole web has the two links.. but the web isnot created by me. If I have the opportunity, I will try thank you very much for the help, Really helped me a lot of busy:) 2009/3/20 yanky young yanky.yo...@gmail.com not really i guess any page in this website can have two links generated by javascript function, that's why nutch can't find that url because nutch will not click that link to trigger that js function as human does. I suggest that, you can generated those multilingual links in server side, for example, in jsp, then in web pages you can get stataic links that can be found by nutch. for example, now in your jsp page, those two links are like this: a href=javascript:jump('en')English/a a href=javascript:jump('la')La/a these two links can not be found by nutch, so u can change your jsp like this: % String pageUrl = request.getRequestURI(); String enUrl = pageUrl + request_locale=en; String laUrl = pageUrl + request_locale=la; % a href=%=enUrl%English/a a href=%=laUrl%La/a then u get static urls in your pages when u browse good luck yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks very much!!! in other words, now i only put http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome and http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome in the url.txt? 2009/3/20 yanky young yanky.yo...@gmail.com I think my guess is right. I just see the code of that page. those two urls are generated by javascript function: function jump(lan) in this case, nutch might not be that smart to recognize this kind of generated url but if you generated this two links from server side, and then the urls in web pages is static link, then nutch can crawl as usual. good luck yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks u can login in http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome and notice the upper right corner, have two translate , it can reach those two urls so i am worried . 2009/3/20 yanky young yanky.yo...@gmail.com that must work, but it seems weird. u know, from the seed url you given, nutch will crawl from the seed url and the whole crawled pages is actually a tree. the root node is the seed url. if u can not reach those two urls from the seed url by yourself, nutch can not too. yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks.. the url is http://www.laopdr.gov.la/... depth 15 topN1200 ... seems must put http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A in the urls directory 2009/3/19 yanky young yanky.yo...@gmail.com Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.action http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome . the difference is the request_locale parameter. I have no idea how these two urls with different request_locale parameters are generated, but I guess nutch just don't know this request_locale parameters because this parameter may be added by javascript or backend content management system. Maybe u can write these links in a page that can be crawled by nutch. The point is that these links must can be found somewhere in your whole website pages. if not, they can not be found by nutch. good luck yanky 2009/3/19 陈琛 kylin.chc...@gmail.com please help me, it is Urgent and Important, thanks -- Forwarded message -- From: 陈琛 kylin.chc...@gmail.com Date: 2009/3/19 Subject: index web To: nutch-user@lucene.apache.org hi, all: i can get index url like http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome but cannot get index like http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome
Re: index web
thanks u can login in http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome and notice the upper right corner, have two translate , it can reach those two urls so i am worried . 2009/3/20 yanky young yanky.yo...@gmail.com that must work, but it seems weird. u know, from the seed url you given, nutch will crawl from the seed url and the whole crawled pages is actually a tree. the root node is the seed url. if u can not reach those two urls from the seed url by yourself, nutch can not too. yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks.. the url is http://www.laopdr.gov.la/... depth 15 topN1200 ... seems must put http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A in the urls directory 2009/3/19 yanky young yanky.yo...@gmail.com Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.action http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome . the difference is the request_locale parameter. I have no idea how these two urls with different request_locale parameters are generated, but I guess nutch just don't know this request_locale parameters because this parameter may be added by javascript or backend content management system. Maybe u can write these links in a page that can be crawled by nutch. The point is that these links must can be found somewhere in your whole website pages. if not, they can not be found by nutch. good luck yanky 2009/3/19 陈琛 kylin.chc...@gmail.com please help me, it is Urgent and Important, thanks -- Forwarded message -- From: 陈琛 kylin.chc...@gmail.com Date: 2009/3/19 Subject: index web To: nutch-user@lucene.apache.org hi, all: i can get index url like http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome but cannot get index like http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A; and http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; why not index ? the web have any different? please notice request_locale= thanks
Re: index web
I think my guess is right. I just see the code of that page. those two urls are generated by javascript function: function jump(lan) in this case, nutch might not be that smart to recognize this kind of generated url but if you generated this two links from server side, and then the urls in web pages is static link, then nutch can crawl as usual. good luck yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks u can login in http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome and notice the upper right corner, have two translate , it can reach those two urls so i am worried . 2009/3/20 yanky young yanky.yo...@gmail.com that must work, but it seems weird. u know, from the seed url you given, nutch will crawl from the seed url and the whole crawled pages is actually a tree. the root node is the seed url. if u can not reach those two urls from the seed url by yourself, nutch can not too. yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks.. the url is http://www.laopdr.gov.la/... depth 15 topN1200 ... seems must put http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A in the urls directory 2009/3/19 yanky young yanky.yo...@gmail.com Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.action http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome . the difference is the request_locale parameter. I have no idea how these two urls with different request_locale parameters are generated, but I guess nutch just don't know this request_locale parameters because this parameter may be added by javascript or backend content management system. Maybe u can write these links in a page that can be crawled by nutch. The point is that these links must can be found somewhere in your whole website pages. if not, they can not be found by nutch. good luck yanky 2009/3/19 陈琛 kylin.chc...@gmail.com please help me, it is Urgent and Important, thanks -- Forwarded message -- From: 陈琛 kylin.chc...@gmail.com Date: 2009/3/19 Subject: index web To: nutch-user@lucene.apache.org hi, all: i can get index url like http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome but cannot get index like http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A; and http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; why not index ? the web have any different? please notice request_locale= thanks
Re: index web
not really i guess any page in this website can have two links generated by javascript function, that's why nutch can't find that url because nutch will not click that link to trigger that js function as human does. I suggest that, you can generated those multilingual links in server side, for example, in jsp, then in web pages you can get stataic links that can be found by nutch. for example, now in your jsp page, those two links are like this: a href=javascript:jump('en')English/a a href=javascript:jump('la')La/a these two links can not be found by nutch, so u can change your jsp like this: % String pageUrl = request.getRequestURI(); String enUrl = pageUrl + request_locale=en; String laUrl = pageUrl + request_locale=la; % a href=%=enUrl%English/a a href=%=laUrl%La/a then u get static urls in your pages when u browse good luck yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks very much!!! in other words, now i only put http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome and http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome in the url.txt? 2009/3/20 yanky young yanky.yo...@gmail.com I think my guess is right. I just see the code of that page. those two urls are generated by javascript function: function jump(lan) in this case, nutch might not be that smart to recognize this kind of generated url but if you generated this two links from server side, and then the urls in web pages is static link, then nutch can crawl as usual. good luck yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks u can login in http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome and notice the upper right corner, have two translate , it can reach those two urls so i am worried . 2009/3/20 yanky young yanky.yo...@gmail.com that must work, but it seems weird. u know, from the seed url you given, nutch will crawl from the seed url and the whole crawled pages is actually a tree. the root node is the seed url. if u can not reach those two urls from the seed url by yourself, nutch can not too. yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks.. the url is http://www.laopdr.gov.la/... depth 15 topN1200 ... seems must put http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A in the urls directory 2009/3/19 yanky young yanky.yo...@gmail.com Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.action http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome . the difference is the request_locale parameter. I have no idea how these two urls with different request_locale parameters are generated, but I guess nutch just don't know this request_locale parameters because this parameter may be added by javascript or backend content management system. Maybe u can write these links in a page that can be crawled by nutch. The point is that these links must can be found somewhere in your whole website pages. if not, they can not be found by nutch. good luck yanky 2009/3/19 陈琛 kylin.chc...@gmail.com please help me, it is Urgent and Important, thanks -- Forwarded message -- From: 陈琛 kylin.chc...@gmail.com Date: 2009/3/19 Subject: index web To: nutch-user@lucene.apache.org hi, all: i can get index url like http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome but cannot get index like http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A; and http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
Re: index web
Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.actionhttp://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome. the difference is the request_locale parameter. I have no idea how these two urls with different request_locale parameters are generated, but I guess nutch just don't know this request_locale parameters because this parameter may be added by javascript or backend content management system. Maybe u can write these links in a page that can be crawled by nutch. The point is that these links must can be found somewhere in your whole website pages. if not, they can not be found by nutch. good luck yanky 2009/3/19 陈琛 kylin.chc...@gmail.com please help me, it is Urgent and Important, thanks -- Forwarded message -- From: 陈琛 kylin.chc...@gmail.com Date: 2009/3/19 Subject: index web To: nutch-user@lucene.apache.org hi, all: i can get index url like http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome but cannot get index like http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A; and http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; why not index ? the web have any different? please notice request_locale= thanks
Re: index web
thanks.. the url is http://www.laopdr.gov.la/... depth 15 topN1200 ... seems must put http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A in the urls directory 2009/3/19 yanky young yanky.yo...@gmail.com Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.action http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome . the difference is the request_locale parameter. I have no idea how these two urls with different request_locale parameters are generated, but I guess nutch just don't know this request_locale parameters because this parameter may be added by javascript or backend content management system. Maybe u can write these links in a page that can be crawled by nutch. The point is that these links must can be found somewhere in your whole website pages. if not, they can not be found by nutch. good luck yanky 2009/3/19 陈琛 kylin.chc...@gmail.com please help me, it is Urgent and Important, thanks -- Forwarded message -- From: 陈琛 kylin.chc...@gmail.com Date: 2009/3/19 Subject: index web To: nutch-user@lucene.apache.org hi, all: i can get index url like http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome but cannot get index like http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A; and http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; why not index ? the web have any different? please notice request_locale= thanks
Re: index web
that must work, but it seems weird. u know, from the seed url you given, nutch will crawl from the seed url and the whole crawled pages is actually a tree. the root node is the seed url. if u can not reach those two urls from the seed url by yourself, nutch can not too. yanky 2009/3/20 陈琛 kylin.chc...@gmail.com thanks.. the url is http://www.laopdr.gov.la/... depth 15 topN1200 ... seems must put http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A in the urls directory 2009/3/19 yanky young yanky.yo...@gmail.com Hi: i guess the urls you mentioned are all directed to the same jsp or servlet, apparently they all begin with http://app02.laopdr.gov.la/ePortal/news/detail.action http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome . the difference is the request_locale parameter. I have no idea how these two urls with different request_locale parameters are generated, but I guess nutch just don't know this request_locale parameters because this parameter may be added by javascript or backend content management system. Maybe u can write these links in a page that can be crawled by nutch. The point is that these links must can be found somewhere in your whole website pages. if not, they can not be found by nutch. good luck yanky 2009/3/19 陈琛 kylin.chc...@gmail.com please help me, it is Urgent and Important, thanks -- Forwarded message -- From: 陈琛 kylin.chc...@gmail.com Date: 2009/3/19 Subject: index web To: nutch-user@lucene.apache.org hi, all: i can get index url like http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome but cannot get index like http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A; and http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A; why not index ? the web have any different? please notice request_locale= thanks