Unfortunately Tika currently has the same issue of not inserting
spaces between menu list items, which gives you these types of
It's a trivial patch, I just need a few minutes of spare time :(
On Apr 18, 2010, at 10:31pm, Alexander Aristov wrote:
I would suggest you to try the tika parser if you are not using it
might give you better parsed HTML. Next - if you want exclude some
text from HTML than you must re-write html parser and adopt it to your
To reduce relevancy of your page - you may play with boost values
are configured in the nutch-site.xml. But they will apply to every
page and so if you want to apply it to only one(serveral) page then
might want to look at this patch.
On 15 April 2010 21:09, tsmori <tim_m...@ncsu.edu> wrote:
I have an old page on my site that Nutch is fetching. The results
Nutch web app look like this:
... INSECT SYSTEMATIC RESOURCES Home : Site Map search Resources by
Scientific Name ... Common Name Select
Site Map Also see the Taxonomic ...
The big block of text there are the option values in a drop down
What's weird is that this page has 2 drop down menus, but Nutch
one of them and does this kind of thing. It's an ancient page and
unfortunately each option value is actually a URL, which I'm
Nutch is indexing the values, but it's odd that it only does it for
and not the other.
Another problem is that in the web app, this wall of text doesn't
completely messes up any formatting to our custom search application.
Lastly, this page is the first hit in the list of results for a
the term "map" entirely due to the page being called "sitemap". I
probably fix this easy by just calling it something else, but I
to know the best way to handle something like this in Nutch.
First, how do I get nutch to stop pulling all the menu option
the results list. Second, how do I reduce the relevancy of this page
changing the page itself?
View this message in context:
Sent from the Nutch - User mailing list archive at Nabble.com.
e l a s t i c w e b m i n i n g