Thanks for your help. -----Original Message----- From: Jerome Charron [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 15, 2006 12:00 PM To: [email protected] Subject: Re: javascript in summaries [nutch-0.7.1]
Hi, I reproduce this with nutch-0.8 with neko html parser (it seems that script tags are not removed). You can switch the html parser implementation to tagsoup. In my tests, all is ok. (property parser.html.impl) Regards Jerome On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote: > > http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26 > [html] 2006.3.15 > ...автосиденье UNP СПОРТ Latos Group "); document.write(" Lordflex "); > document.write(" Magniflex "); document.write(" Bedding 12% "); > document.write(" Primavera "); document.write(" HUKLA... > http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26 (cached) > (explain) (anchors) > > -----Original Message----- > From: Jack Tang [mailto:[EMAIL PROTECTED] > Sent: Wednesday, March 15, 2006 11:21 AM > To: [email protected] > Subject: Re: javascript in summaries [nutch-0.7.1] > > Hi there > > Can you fetch only one page, say > http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26 > > And try to find the code working or not? > > Good luck! > > On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote: > > Yes I see that. But in fact I see javascript in my summaries too and > don't > > know how remove it :) > > > > -----Original Message----- > > From: Jack Tang [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, March 15, 2006 11:14 AM > > To: [email protected] > > Subject: Re: javascript in summaries [nutch-0.7.1] > > > > On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote: > > > > > > This script present in html page inside <script>//<!-- code > //--></script> > > Really? > > In html parser I think DOMContentUitls escape the element. > > > > private static final boolean getTextHelper(StringBuffer sb, Node node, > > boolean > abortOnNestedAnchors, > > int anchorDepth) { > > if ("script".equalsIgnoreCase(node.getNodeName())) { > > return false; > > } > > if ("style".equalsIgnoreCase(node.getNodeName())) { > > return false; > > } > > > > > > > > > > > > > -----Original Message----- > > > From: Jack Tang [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, March 15, 2006 10:58 AM > > > To: [email protected] > > > Subject: Re: javascript in summaries [nutch-0.7.1] > > > > > > Maybe you can filter javascript files(*.js) using url filter.. > > > > > > /Jack > > > > > > On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote: > > > > Hello > > > > > > > > > > > > > > > > Sorry my little English > > > > > > > > > > > > > > > > I use nutch-0.7.1 and have issue with html parser > > > > > > > > > > > > > > > > I got in summary javascript code and don't know how to remove it. > For > > > > example > > > > > > > > > > > > > > > > . \n'); } if (plugin) { document.write(' '); document.write(' '); > > > > document.write(' '); document.write(' '); document.write(' '); > > > > document.write ... > > > > > > > > > > > > > > > > Or http://62.141.52.208:8080/dual/search.jsp?query=document.write :) > > > > > > > > > > > > > > > > This is my nutch-site.plugin line: > > > > > > > > <property> > > > > > > > > > > > > > > > <value>nutch-extensionpoints|protocol-(http|httpclient)|urlfilter-regex|pars > > > > e-html|index-(basic|more)|query-(more|stemmer|site|url)</value> > > > > > > > > </property> > > > > > > > > > > > > > > > > Can anybody help me? > > > > > > > > > > > > > > > > > > > > > -- > > > Keep Discovering ... ... > > > http://www.jroller.com/page/jmars > > > > > > > > > > > > -- > > Keep Discovering ... ... > > http://www.jroller.com/page/jmars > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars > > -- http://motrech.free.fr/ http://www.frutch.org/
