Naess, Ronny wrote:
> Thanks, Ann.
>
> You gave me some good pointers.
>
> I see that the navigation menu is giving med all the trouble with
> ranking. Does somebody know a way to make the parser skip some content?
> I would like the parser to skip global header and navigation menu so the
> content contains the uniq stuff not everything. Guess this is not a
> simple thing.
No, it's not. Do a Google search for "template detection".
A crude approach, which still might be sufficient in your case, is to do
the following:
* remove all font/color/style formatting elements, and coalesce their
text children with their parents. E.g.
this is <span style="abc">a text</span>
<b>with bold</b> fragment
becomes:
this is a text with bold fragment
* do the same with all non-divisional (structural) tags, i.e. any
formatting tags except for div-s, tables and iframe-s.
* sort the remaining text blocks by size
* drop a certain number (or percentage) of the smallest of the text blocks.
* put the blocks back in order, and extract only their text content.
This is the "main body" text.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general