Hi I would suggest you to try the tika parser if you are not using it now. It might give you better parsed HTML. Next - if you want exclude some parts of text from HTML than you must re-write html parser and adopt it to your needs.
To reduce relevancy of your page - you may play with boost values which are are configured in the nutch-site.xml. But they will apply to every crowled page and so if you want to apply it to only one(serveral) page then you might want to look at this patch. https://issues.apache.org/jira/browse/NUTCH-16 Best Regards Alexander Aristov On 15 April 2010 21:09, tsmori <tim_m...@ncsu.edu> wrote: > > I have an old page on my site that Nutch is fetching. The results in the > Nutch web app look like this: > > Site Map > ... INSECT SYSTEMATIC RESOURCES Home : Site Map search Resources by > Scientific Name ... Common Name Select > > NameAlderfliesAntsAntlionsAphidsBarkliceBeesBeetlesBookliceBristletailsBugsButterfliesCaddisfliesCicadasCockroachesCricketsDamselfliesDobsonfliesDragonfliesEarwigsFleasFliesFishfliesFootspinnersGladiatorsGnatsGrasshoppersHangingfliesHoppersJumping_BristletailsFirebratsKatydidsLacewingsLeaf_InsectsLiceMantidsMayfliesMitesMothsMosquitoesOwlfliesPsyllidsRock_CrawlersSawfliesScale_InsectsScorpionfliesSilverfishSkippersSnakefliesSpidersSpringtailsStonefliesTermitesThripsTicksTwisted_Wing_ParasitesWalking_SticksWaspsWebspinnersWhiteflies > Site Map Also see the Taxonomic ... > > > The big block of text there are the option values in a drop down menu. > What's weird is that this page has 2 drop down menus, but Nutch only grabs > one of them and does this kind of thing. It's an ancient page and > unfortunately each option value is actually a URL, which I'm guessing is > why > Nutch is indexing the values, but it's odd that it only does it for one > menu > and not the other. > > Another problem is that in the web app, this wall of text doesn't wrap so > it > completely messes up any formatting to our custom search application. > > Lastly, this page is the first hit in the list of results for a search on > the term "map" entirely due to the page being called "sitemap". I could > probably fix this easy by just calling it something else, but I would like > to know the best way to handle something like this in Nutch. > > First, how do I get nutch to stop pulling all the menu option values into > the results list. Second, how do I reduce the relevancy of this page > without > changing the page itself? > -- > View this message in context: > http://n3.nabble.com/Weird-crawl-issue-Nutch-picking-up-drop-down-menu-options-tp721751p721751.html > Sent from the Nutch - User mailing list archive at Nabble.com. >