Hi

I would suggest you to try the  tika parser if you are not using it now. It
might give you better parsed HTML. Next - if you want exclude some parts of
text from HTML than you must re-write html parser and adopt it to your
needs.

To reduce relevancy of your page - you may play with boost values which are
are configured in the nutch-site.xml. But they will apply to every crowled
page and so if you want to apply it to only one(serveral) page then you
might want to look at this patch.

https://issues.apache.org/jira/browse/NUTCH-16


Best Regards
Alexander Aristov


On 15 April 2010 21:09, tsmori <tim_m...@ncsu.edu> wrote:

>
> I have an old page on my site that Nutch is fetching. The results in the
> Nutch web app look like this:
>
> Site Map
> ... INSECT SYSTEMATIC RESOURCES Home : Site Map   search Resources by
> Scientific Name ... Common Name Select
>
> NameAlderfliesAntsAntlionsAphidsBarkliceBeesBeetlesBookliceBristletailsBugsButterfliesCaddisfliesCicadasCockroachesCricketsDamselfliesDobsonfliesDragonfliesEarwigsFleasFliesFishfliesFootspinnersGladiatorsGnatsGrasshoppersHangingfliesHoppersJumping_BristletailsFirebratsKatydidsLacewingsLeaf_InsectsLiceMantidsMayfliesMitesMothsMosquitoesOwlfliesPsyllidsRock_CrawlersSawfliesScale_InsectsScorpionfliesSilverfishSkippersSnakefliesSpidersSpringtailsStonefliesTermitesThripsTicksTwisted_Wing_ParasitesWalking_SticksWaspsWebspinnersWhiteflies
> Site Map Also see the Taxonomic ...
>
>
> The big block of text there are the option values in a drop down menu.
> What's weird is that this page has 2 drop down menus, but Nutch only grabs
> one of them and does this kind of thing. It's an ancient page and
> unfortunately each option value is actually a URL, which I'm guessing is
> why
> Nutch is indexing the values, but it's odd that it only does it for one
> menu
> and not the other.
>
> Another problem is that in the web app, this wall of text doesn't wrap so
> it
> completely messes up any formatting to our custom search application.
>
> Lastly, this page is the first hit in the list of results for a search on
> the term "map" entirely due to the page being called "sitemap". I could
> probably fix this easy by just calling it something else, but I would like
> to know the best way to handle something like this in Nutch.
>
> First, how do I get nutch to stop pulling all the menu option values into
> the results list. Second, how do I reduce the relevancy of this page
> without
> changing the page itself?
> --
> View this message in context:
> http://n3.nabble.com/Weird-crawl-issue-Nutch-picking-up-drop-down-menu-options-tp721751p721751.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to