[Nutch-general] Re: extract links problem with parse-html plugin

Poettgen Fri, 17 Feb 2006 00:20:50 -0800

I determined the same.

With my Site is the HTML source 160 kByte per Page largely.
The Parser has here definitely problems (whether Javascript on a side is
used or not).


Before my decision for Nutch  I tested the Java/Lucene based open source
solution Oxygen ( http://sourceforge.net/projects/oxyus/ ).  Here the
Parser does not have problems with the Site.

I am not a developer, perhaps one of the nutch developers may throw a view
in into the source code of the Parser used there.

Perhaps it helps.



Elwin <[EMAIL PROTECTED]> wrote on 17.02.2006 08:51:06:

> It seems that the parse-html plguin may not process many pages well,
because
> I have found that the plugin can't extract all valid links in a page when
I
> test it in my code.
> I guess that it may be caused by the style of a html page? When I "view
> source" of a html page I used to parse, I saw that some elements in the
> source are segmented by some unrequired spaces. However, the situation is
> quiet often to the pages of large portal sites or news sites.



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: extract links problem with parse-html plugin

Reply via email to