Re: index web

2009-03-22 Thread 陈琛
yes, you are right, the whole web has the two links..

but the web isnot created by me. If I have the opportunity, I will try

 thank you very much for the help, Really helped me a lot of busy:)


2009/3/20 yanky young yanky.yo...@gmail.com

 not really

 i guess any page in this website can have two links generated by javascript
 function, that's why nutch can't find that url because nutch will not click
 that link to trigger that js function as human does.

 I suggest that, you can generated those multilingual links in server side,
 for example, in jsp, then in web pages you can get stataic links that can
 be
 found by nutch.

 for example, now in your jsp page, those two links are like this:

 a href=javascript:jump('en')English/a
 a href=javascript:jump('la')La/a

 these two links can not be found by nutch, so u can change your jsp like
 this:
 %
 String pageUrl = request.getRequestURI();
 String enUrl = pageUrl + request_locale=en;
 String laUrl = pageUrl + request_locale=la;
 %
 a href=%=enUrl%English/a
 a href=%=laUrl%La/a

 then u get static urls in your pages when u browse

 good luck

 yanky

 2009/3/20 陈琛 kylin.chc...@gmail.com

  thanks very much!!!
 
 
  in other words, now i only put
 
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome
  and
 
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome
  in the url.txt?
 
 
  2009/3/20 yanky young yanky.yo...@gmail.com
 
   I think my guess is right. I just see the code of that page.
  
   those two urls are generated by javascript function:
  
   function jump(lan)
  
   in this case, nutch might not be that smart to recognize this kind of
   generated url
  
   but if you generated this two links from server side, and then the
   urls in web pages is static link, then nutch
  
   can crawl as usual.
  
   good luck
  
   yanky
  
  
   2009/3/20 陈琛 kylin.chc...@gmail.com
  
thanks
   
u can login in
   
   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
   
and notice the upper right corner, have two translate , it can reach
   those
two urls
   
so i am worried .
2009/3/20 yanky young yanky.yo...@gmail.com
   
 that must work, but it seems weird. u know, from the seed url you
   given,
 nutch will crawl from the seed url and the whole crawled pages is
actually
 a
 tree. the root node is the seed url. if u can not reach those two
  urls
from
 the seed url by yourself, nutch can not too.

 yanky


 2009/3/20 陈琛 kylin.chc...@gmail.com

  thanks..
the url is http://www.laopdr.gov.la/...
  depth 15 topN1200 ...
 
  seems must put
 
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
  
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A
  
  in
  the urls directory
 
 
 
  2009/3/19 yanky young yanky.yo...@gmail.com
 
   Hi:
  
   i guess the urls you mentioned are all directed to the same jsp
  or
  servlet,
   apparently they all begin with
   http://app02.laopdr.gov.la/ePortal/news/detail.action
  
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
   .
   the difference is the request_locale parameter. I have no idea
  how
 these
   two
   urls with different request_locale parameters are generated,
 but
  I
 guess
   nutch just don't know this request_locale parameters because
 this
  parameter
   may be added by javascript or backend content management
 system.
Maybe
 u
   can
   write these links in a page that can be crawled by nutch. The
  point
is
  that
   these links must can be found somewhere in your whole website
   pages.
if
   not,
   they can not be found by nutch.
  
   good luck
  
   yanky
  
  
  
   2009/3/19 陈琛 kylin.chc...@gmail.com
  
please help me, it is Urgent and Important, thanks
   
-- Forwarded message --
From: 陈琛 kylin.chc...@gmail.com
Date: 2009/3/19
Subject: index web
To: nutch-user@lucene.apache.org
   
   
hi, all:
   
i can get index url like
   
   
  
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
   
but  cannot get index like
   
   
  
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome

  
 

   

Re: index web

2009-03-20 Thread 陈琛
thanks

u can login in
http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome

and notice the upper right corner, have two translate , it can reach those
two urls

so i am worried .
2009/3/20 yanky young yanky.yo...@gmail.com

 that must work, but it seems weird. u know, from the seed url you given,
 nutch will crawl from the seed url and the whole crawled pages is actually
 a
 tree. the root node is the seed url. if u can not reach those two urls from
 the seed url by yourself, nutch can not too.

 yanky


 2009/3/20 陈琛 kylin.chc...@gmail.com

  thanks..
the url is http://www.laopdr.gov.la/...
  depth 15 topN1200 ...
 
  seems must put
 
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A
  
  in
  the urls directory
 
 
 
  2009/3/19 yanky young yanky.yo...@gmail.com
 
   Hi:
  
   i guess the urls you mentioned are all directed to the same jsp or
  servlet,
   apparently they all begin with
   http://app02.laopdr.gov.la/ePortal/news/detail.action
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
   .
   the difference is the request_locale parameter. I have no idea how
 these
   two
   urls with different request_locale parameters are generated, but I
 guess
   nutch just don't know this request_locale parameters because this
  parameter
   may be added by javascript or backend content management system. Maybe
 u
   can
   write these links in a page that can be crawled by nutch. The point is
  that
   these links must can be found somewhere in your whole website pages. if
   not,
   they can not be found by nutch.
  
   good luck
  
   yanky
  
  
  
   2009/3/19 陈琛 kylin.chc...@gmail.com
  
please help me, it is Urgent and Important, thanks
   
-- Forwarded message --
From: 陈琛 kylin.chc...@gmail.com
Date: 2009/3/19
Subject: index web
To: nutch-user@lucene.apache.org
   
   
hi, all:
   
i can get index url like
   
   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
   
but  cannot get index like
   
   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome

  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A;
   
and
   
   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome

  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
   

   
why not index ?
the web have any different?
   
please notice request_locale=
   
   
thanks
   
  
 



Re: index web

2009-03-20 Thread yanky young
I think my guess is right. I just see the code of that page.

those two urls are generated by javascript function:

function jump(lan)

in this case, nutch might not be that smart to recognize this kind of
generated url

but if you generated this two links from server side, and then the
urls in web pages is static link, then nutch

can crawl as usual.

good luck

yanky


2009/3/20 陈琛 kylin.chc...@gmail.com

 thanks

 u can login in

 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome

 and notice the upper right corner, have two translate , it can reach those
 two urls

 so i am worried .
 2009/3/20 yanky young yanky.yo...@gmail.com

  that must work, but it seems weird. u know, from the seed url you given,
  nutch will crawl from the seed url and the whole crawled pages is
 actually
  a
  tree. the root node is the seed url. if u can not reach those two urls
 from
  the seed url by yourself, nutch can not too.
 
  yanky
 
 
  2009/3/20 陈琛 kylin.chc...@gmail.com
 
   thanks..
 the url is http://www.laopdr.gov.la/...
   depth 15 topN1200 ...
  
   seems must put
  
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A
   
   in
   the urls directory
  
  
  
   2009/3/19 yanky young yanky.yo...@gmail.com
  
Hi:
   
i guess the urls you mentioned are all directed to the same jsp or
   servlet,
apparently they all begin with
http://app02.laopdr.gov.la/ePortal/news/detail.action
   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
.
the difference is the request_locale parameter. I have no idea how
  these
two
urls with different request_locale parameters are generated, but I
  guess
nutch just don't know this request_locale parameters because this
   parameter
may be added by javascript or backend content management system.
 Maybe
  u
can
write these links in a page that can be crawled by nutch. The point
 is
   that
these links must can be found somewhere in your whole website pages.
 if
not,
they can not be found by nutch.
   
good luck
   
yanky
   
   
   
2009/3/19 陈琛 kylin.chc...@gmail.com
   
 please help me, it is Urgent and Important, thanks

 -- Forwarded message --
 From: 陈琛 kylin.chc...@gmail.com
 Date: 2009/3/19
 Subject: index web
 To: nutch-user@lucene.apache.org


 hi, all:

 i can get index url like


   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome

 but  cannot get index like


   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome
 
   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A;

 and


   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome
 
   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;

 

 why not index ?
 the web have any different?

 please notice request_locale=


 thanks

   
  
 



Re: index web

2009-03-20 Thread yanky young
not really

i guess any page in this website can have two links generated by javascript
function, that's why nutch can't find that url because nutch will not click
that link to trigger that js function as human does.

I suggest that, you can generated those multilingual links in server side,
for example, in jsp, then in web pages you can get stataic links that can be
found by nutch.

for example, now in your jsp page, those two links are like this:

a href=javascript:jump('en')English/a
a href=javascript:jump('la')La/a

these two links can not be found by nutch, so u can change your jsp like
this:
%
String pageUrl = request.getRequestURI();
String enUrl = pageUrl + request_locale=en;
String laUrl = pageUrl + request_locale=la;
%
a href=%=enUrl%English/a
a href=%=laUrl%La/a

then u get static urls in your pages when u browse

good luck

yanky

2009/3/20 陈琛 kylin.chc...@gmail.com

 thanks very much!!!


 in other words, now i only put

 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome
 and

 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome
 in the url.txt?


 2009/3/20 yanky young yanky.yo...@gmail.com

  I think my guess is right. I just see the code of that page.
 
  those two urls are generated by javascript function:
 
  function jump(lan)
 
  in this case, nutch might not be that smart to recognize this kind of
  generated url
 
  but if you generated this two links from server side, and then the
  urls in web pages is static link, then nutch
 
  can crawl as usual.
 
  good luck
 
  yanky
 
 
  2009/3/20 陈琛 kylin.chc...@gmail.com
 
   thanks
  
   u can login in
  
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
  
   and notice the upper right corner, have two translate , it can reach
  those
   two urls
  
   so i am worried .
   2009/3/20 yanky young yanky.yo...@gmail.com
  
that must work, but it seems weird. u know, from the seed url you
  given,
nutch will crawl from the seed url and the whole crawled pages is
   actually
a
tree. the root node is the seed url. if u can not reach those two
 urls
   from
the seed url by yourself, nutch can not too.
   
yanky
   
   
2009/3/20 陈琛 kylin.chc...@gmail.com
   
 thanks..
   the url is http://www.laopdr.gov.la/...
 depth 15 topN1200 ...

 seems must put


   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A
 
 in
 the urls directory



 2009/3/19 yanky young yanky.yo...@gmail.com

  Hi:
 
  i guess the urls you mentioned are all directed to the same jsp
 or
 servlet,
  apparently they all begin with
  http://app02.laopdr.gov.la/ePortal/news/detail.action
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
  .
  the difference is the request_locale parameter. I have no idea
 how
these
  two
  urls with different request_locale parameters are generated, but
 I
guess
  nutch just don't know this request_locale parameters because this
 parameter
  may be added by javascript or backend content management system.
   Maybe
u
  can
  write these links in a page that can be crawled by nutch. The
 point
   is
 that
  these links must can be found somewhere in your whole website
  pages.
   if
  not,
  they can not be found by nutch.
 
  good luck
 
  yanky
 
 
 
  2009/3/19 陈琛 kylin.chc...@gmail.com
 
   please help me, it is Urgent and Important, thanks
  
   -- Forwarded message --
   From: 陈琛 kylin.chc...@gmail.com
   Date: 2009/3/19
   Subject: index web
   To: nutch-user@lucene.apache.org
  
  
   hi, all:
  
   i can get index url like
  
  
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
  
   but  cannot get index like
  
  
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome
   
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A;
  
   and
  
  
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome
   
 

   
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
  
   
 

Re: index web

2009-03-19 Thread yanky young
Hi:

i guess the urls you mentioned are all directed to the same jsp or servlet,
apparently they all begin with
http://app02.laopdr.gov.la/ePortal/news/detail.actionhttp://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome.
the difference is the request_locale parameter. I have no idea how these two
urls with different request_locale parameters are generated, but I guess
nutch just don't know this request_locale parameters because this parameter
may be added by javascript or backend content management system. Maybe u can
write these links in a page that can be crawled by nutch. The point is that
these links must can be found somewhere in your whole website pages. if not,
they can not be found by nutch.

good luck

yanky



2009/3/19 陈琛 kylin.chc...@gmail.com

 please help me, it is Urgent and Important, thanks

 -- Forwarded message --
 From: 陈琛 kylin.chc...@gmail.com
 Date: 2009/3/19
 Subject: index web
 To: nutch-user@lucene.apache.org


 hi, all:

 i can get index url like

 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome

 but  cannot get index like

 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A;
 and

 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;


 why not index ?
 the web have any different?

 please notice request_locale=


 thanks



Re: index web

2009-03-19 Thread 陈琛
thanks..
   the url is http://www.laopdr.gov.la/...
depth 15 topN1200 ...

seems must put
http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A
in
the urls directory



2009/3/19 yanky young yanky.yo...@gmail.com

 Hi:

 i guess the urls you mentioned are all directed to the same jsp or servlet,
 apparently they all begin with
 http://app02.laopdr.gov.la/ePortal/news/detail.action
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
 .
 the difference is the request_locale parameter. I have no idea how these
 two
 urls with different request_locale parameters are generated, but I guess
 nutch just don't know this request_locale parameters because this parameter
 may be added by javascript or backend content management system. Maybe u
 can
 write these links in a page that can be crawled by nutch. The point is that
 these links must can be found somewhere in your whole website pages. if
 not,
 they can not be found by nutch.

 good luck

 yanky



 2009/3/19 陈琛 kylin.chc...@gmail.com

  please help me, it is Urgent and Important, thanks
 
  -- Forwarded message --
  From: 陈琛 kylin.chc...@gmail.com
  Date: 2009/3/19
  Subject: index web
  To: nutch-user@lucene.apache.org
 
 
  hi, all:
 
  i can get index url like
 
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
 
  but  cannot get index like
 
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome
  
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A;
 
  and
 
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome
  
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
 
  
 
  why not index ?
  the web have any different?
 
  please notice request_locale=
 
 
  thanks
 



Re: index web

2009-03-19 Thread yanky young
that must work, but it seems weird. u know, from the seed url you given,
nutch will crawl from the seed url and the whole crawled pages is actually a
tree. the root node is the seed url. if u can not reach those two urls from
the seed url by yourself, nutch can not too.

yanky


2009/3/20 陈琛 kylin.chc...@gmail.com

 thanks..
   the url is http://www.laopdr.gov.la/...
 depth 15 topN1200 ...

 seems must put

 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A
 
 in
 the urls directory



 2009/3/19 yanky young yanky.yo...@gmail.com

  Hi:
 
  i guess the urls you mentioned are all directed to the same jsp or
 servlet,
  apparently they all begin with
  http://app02.laopdr.gov.la/ePortal/news/detail.action
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
  .
  the difference is the request_locale parameter. I have no idea how these
  two
  urls with different request_locale parameters are generated, but I guess
  nutch just don't know this request_locale parameters because this
 parameter
  may be added by javascript or backend content management system. Maybe u
  can
  write these links in a page that can be crawled by nutch. The point is
 that
  these links must can be found somewhere in your whole website pages. if
  not,
  they can not be found by nutch.
 
  good luck
 
  yanky
 
 
 
  2009/3/19 陈琛 kylin.chc...@gmail.com
 
   please help me, it is Urgent and Important, thanks
  
   -- Forwarded message --
   From: 陈琛 kylin.chc...@gmail.com
   Date: 2009/3/19
   Subject: index web
   To: nutch-user@lucene.apache.org
  
  
   hi, all:
  
   i can get index url like
  
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?id=10110from=ePortal_NewsDetail_FromHome
  
   but  cannot get index like
  
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome
   
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=en_USid=10110from=ePortal_NewsDetail_FromHome%0A;
  
   and
  
  
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome
   
 
 http://app02.laopdr.gov.la/ePortal/news/detail.action?request_locale=lo_LAid=10110from=ePortal_NewsDetail_FromHome%0A;
  
   
  
   why not index ?
   the web have any different?
  
   please notice request_locale=
  
  
   thanks