On 6/27/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Naess, Ronny wrote: > > Thanks, Ann. > > > > You gave me some good pointers. > > > > I see that the navigation menu is giving med all the trouble with > > ranking. Does somebody know a way to make the parser skip some content? > > I would like the parser to skip global header and navigation menu so the > > content contains the uniq stuff not everything. Guess this is not a > > simple thing. > > > No, it's not. Do a Google search for "template detection". > > A crude approach, which still might be sufficient in your case, is to do > the following: > > * remove all font/color/style formatting elements, and coalesce their > text children with their parents. E.g. > > this is <span style="abc">a text</span> > <b>with bold</b> fragment > > becomes: > this is a text with bold fragment > > * do the same with all non-divisional (structural) tags, i.e. any > formatting tags except for div-s, tables and iframe-s. > > * sort the remaining text blocks by size > > * drop a certain number (or percentage) of the smallest of the text blocks. > > * put the blocks back in order, and extract only their text content. > This is the "main body" text. >
Alternatively, for any given divisional tag, you might measure the amount of anchor text versus non-anchor text. If a table/div/... contains mostly anchor text (and all anchor texts consist of a couple of words), you can assume that it is a menu and not relevant content. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
