http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26
[html] 2006.3.15
...автосиденье UNP СПОРТ Latos Group "); document.write(" Lordflex ");
document.write(" Magniflex "); document.write(" Bedding 12% ");
document.write(" Primavera "); document.write(" HUKLA...
http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26 (cached)
(explain) (anchors)

-----Original Message-----
From: Jack Tang [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, March 15, 2006 11:21 AM
To: [email protected]
Subject: Re: javascript in summaries [nutch-0.7.1]

Hi there

Can you fetch only one page, say
http://www.pozvonok.ru/shop/vp.php?id=377&size=-1&idtype=26

And try to find the code working or not?

Good luck!

On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote:
> Yes I see that. But in fact I see javascript in my summaries too and don't
> know how remove it :)
>
> -----Original Message-----
> From: Jack Tang [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 15, 2006 11:14 AM
> To: [email protected]
> Subject: Re: javascript in summaries [nutch-0.7.1]
>
> On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote:
> >
> > This script present in html page inside <script>//<!-- code
//--></script>
> Really?
> In html parser I think DOMContentUitls escape the element.
>
> private static final boolean getTextHelper(StringBuffer sb, Node node,
>                                              boolean abortOnNestedAnchors,
>                                              int anchorDepth) {
>     if ("script".equalsIgnoreCase(node.getNodeName())) {
>       return false;
>     }
>     if ("style".equalsIgnoreCase(node.getNodeName())) {
>       return false;
>     }
>
>
> >
> >
> > -----Original Message-----
> > From: Jack Tang [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, March 15, 2006 10:58 AM
> > To: [email protected]
> > Subject: Re: javascript in summaries [nutch-0.7.1]
> >
> > Maybe you can filter javascript files(*.js) using url filter..
> >
> > /Jack
> >
> > On 3/15/06, Ilia S. Yatsenko <[EMAIL PROTECTED]> wrote:
> > > Hello
> > >
> > >
> > >
> > > Sorry my little English
> > >
> > >
> > >
> > > I use nutch-0.7.1 and have issue with html parser
> > >
> > >
> > >
> > > I got in summary javascript code and don't know how to remove it. For
> > > example
> > >
> > >
> > >
> > > . \n'); } if (plugin) { document.write(' '); document.write(' ');
> > > document.write(' '); document.write(' '); document.write(' ');
> > > document.write ...
> > >
> > >
> > >
> > > Or http://62.141.52.208:8080/dual/search.jsp?query=document.write :)
> > >
> > >
> > >
> > > This is my nutch-site.plugin line:
> > >
> > > <property>
> > >
> > >
> >
>
<value>nutch-extensionpoints|protocol-(http|httpclient)|urlfilter-regex|pars
> > > e-html|index-(basic|more)|query-(more|stemmer|site|url)</value>
> > >
> > > </property>
> > >
> > >
> > >
> > > Can anybody help me?
> > >
> > >
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Reply via email to