I have an old page on my site that Nutch is fetching. The results in the Nutch web app look like this:
Site Map ... INSECT SYSTEMATIC RESOURCES Home : Site Map search Resources by Scientific Name ... Common Name Select NameAlderfliesAntsAntlionsAphidsBarkliceBeesBeetlesBookliceBristletailsBugsButterfliesCaddisfliesCicadasCockroachesCricketsDamselfliesDobsonfliesDragonfliesEarwigsFleasFliesFishfliesFootspinnersGladiatorsGnatsGrasshoppersHangingfliesHoppersJumping_BristletailsFirebratsKatydidsLacewingsLeaf_InsectsLiceMantidsMayfliesMitesMothsMosquitoesOwlfliesPsyllidsRock_CrawlersSawfliesScale_InsectsScorpionfliesSilverfishSkippersSnakefliesSpidersSpringtailsStonefliesTermitesThripsTicksTwisted_Wing_ParasitesWalking_SticksWaspsWebspinnersWhiteflies Site Map Also see the Taxonomic ... The big block of text there are the option values in a drop down menu. What's weird is that this page has 2 drop down menus, but Nutch only grabs one of them and does this kind of thing. It's an ancient page and unfortunately each option value is actually a URL, which I'm guessing is why Nutch is indexing the values, but it's odd that it only does it for one menu and not the other. Another problem is that in the web app, this wall of text doesn't wrap so it completely messes up any formatting to our custom search application. Lastly, this page is the first hit in the list of results for a search on the term "map" entirely due to the page being called "sitemap". I could probably fix this easy by just calling it something else, but I would like to know the best way to handle something like this in Nutch. First, how do I get nutch to stop pulling all the menu option values into the results list. Second, how do I reduce the relevancy of this page without changing the page itself? -- View this message in context: http://n3.nabble.com/Weird-crawl-issue-Nutch-picking-up-drop-down-menu-options-tp721751p721751.html Sent from the Nutch - User mailing list archive at Nabble.com.