[jira] Updated: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] N. Hira updated NUTCH-585: -- Attachment: nutch-585-jostens-excludeDIVs.patch We use Solr/Nutch on our corporate web site and are very happy with the results. Thank you. We have struggled with something similar to NUTCH-585 for a few months now. Although it is different from the original intent, here's a quick/short patch that might help get this feature going again. h4.Intended use: - Let's assume you're crawling a set of internal web sites and would like to exclude certain HTML fragments (from indexing) like the navigation and other common content. - If these fragments are contained in DIVs with IDs like menuNav, footerNav, etc., then you can now add a new property to nutch-site.xml to exclude these DIVs. - If you don't set this property, the normal behavior remains (backward compatible) {code:xml} property nameparser.html.divIDsToExclude/name valueaccount_menu_container,footer_menu_container,legal,main_menu_container/value description A comma-delimited list of DIV IDs whose content will not be indexed. Use this to tell the HTML parser to ignore, for example, site navigation text. Note that DIVs with these IDs, and their children, will be silently ignored by the parser so verify the indexed content with Luke to confirm results. /description /property {code} h4.Inclusion/growth: - This code was written against nutch 1.2 and is backward compatible in that the new behavior is only present if configured. - In future, it might be good to have different strategy patterns for how exclusions are determined; some might need algorithmic detection (whole web crawls), others might prefer jquery-selectors for HTML fragments, etc. Best regards, -h Hira, N.R. (Jostens, Inc.) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Priority: Minor Attachments: nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Nutch-trunk #1353
See https://hudson.apache.org/hudson/job/Nutch-trunk/1353/ -- [...truncated 1007 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A