http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26 [html] 2006.3.15 ...автосиденье UNP СПОРТ Latos Group "); document.write(" Lordflex "); document.write(" Magniflex "); document.write(" Bedding 12% "); document.write(" Primavera "); document.write(" HUKLA... http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26 (cached) (explain) (anchors)
-----Original Message----- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 15, 2006 11:21 AM To: [email protected] Subject: Re: javascript in summaries [nutch-0.7.1] Hi there Can you fetch only one page, say http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26 And try to find the code working or not? Good luck! On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote: > Yes I see that. But in fact I see javascript in my summaries too and don't > know how remove it :) > > -----Original Message----- > From: Jack Tang [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 15, 2006 11:14 AM > To: [email protected] > Subject: Re: javascript in summaries [nutch-0.7.1] > > On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote: > > > > This script present in html page inside <script>//<!-- code //--></script> > Really? > In html parser I think DOMContentUitls escape the element. > > private static final boolean getTextHelper(StringBuffer sb, Node node, > boolean abortOnNestedAnchors, > int anchorDepth) { > if ("script".equalsIgnoreCase(node.getNodeName())) { > return false; > } > if ("style".equalsIgnoreCase(node.getNodeName())) { > return false; > } > > > > > > > > -----Original Message----- > > From: Jack Tang [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, March 15, 2006 10:58 AM > > To: [email protected] > > Subject: Re: javascript in summaries [nutch-0.7.1] > > > > Maybe you can filter javascript files(*.js) using url filter.. > > > > /Jack > > > > On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote: > > > Hello > > > > > > > > > > > > Sorry my little English > > > > > > > > > > > > I use nutch-0.7.1 and have issue with html parser > > > > > > > > > > > > I got in summary javascript code and don't know how to remove it. For > > > example > > > > > > > > > > > > . \n'); } if (plugin) { document.write(' '); document.write(' '); > > > document.write(' '); document.write(' '); document.write(' '); > > > document.write ... > > > > > > > > > > > > Or http://62.141.52.208:8080/dual/search.jsp?query=document.write :) > > > > > > > > > > > > This is my nutch-site.plugin line: > > > > > > <property> > > > > > > > > > <value>nutch-extensionpoints|protocol-(http|httpclient)|urlfilter-regex|pars > > > e-html|index-(basic|more)|query-(more|stemmer|site|url)</value> > > > > > > </property> > > > > > > > > > > > > Can anybody help me? > > > > > > > > > > > > > > > -- > > Keep Discovering ... ... > > http://www.jroller.com/page/jmars > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
