[RESULT] [VOTE] Move 2.0 out of trunk
Hi Folks, Okey dok, this VOTE has passed with the following tallies: +1 PMC Markus Jelsma Sami Siren Chris Mattmann Lewis John McGibbney Dennis Kubes Julien Nioche Andrzej Bialecki -1 PMC Alexis de Tréglodé -1 Community Radim Kola Accordingly we will move the current Nutch trunk to a bew branch nutchgorahttp://svn.apache.org/repos/asf/nutch/branches/nutchgoraand then will move the current 1.4-development branch into trunk. I assume the two commands below would do the trick? svn mv https://svn.apache.org/repos/asf/nutch/trunkhttp://svn.apache.org/repos/asf/nutch/trunk https://svn.apache.org/repos/asf/nutch/branches/nutchgorahttp://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/ https://svn.apache.org/repos/asf/nutch/trunkhttp://svn.apache.org/repos/asf/nutch/trunk Thanks Julien On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.comwrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. Thanks Julien [1] http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.htmlhttp://mail-archives.apache.org/mod_mbox/nutch-dev/201109.mbox/%3cca+-fm0tj2kvuco0wwkxbj6hsamxx5819ujv7lco2vo2kd2z...@mail.gmail.com%3E -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] [Commented] (NUTCH-1005) Index headings plugin
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13109406#comment-13109406 ] Markus Jelsma commented on NUTCH-1005: -- Comments? Index headings plugin - Key: NUTCH-1005 URL: https://issues.apache.org/jira/browse/NUTCH-1005 Project: Nutch Issue Type: New Feature Components: indexer, parser Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.4 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1005) Index headings plugin
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13109435#comment-13109435 ] Julien Nioche commented on NUTCH-1005: -- let's try and come up with a single plugin for index-extra, urlmeta, NUTCH-809 and this one. Much of these things are related and could be done in a generic way Index headings plugin - Key: NUTCH-1005 URL: https://issues.apache.org/jira/browse/NUTCH-1005 Project: Nutch Issue Type: New Feature Components: indexer, parser Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.4 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch Very simple plugin for extracting and indexing a comma separated list of headings via the headings configuration directive. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Extension of NUTCH-585 - blacklist whitelist plugin
Hi, Based on the suggestions/code from https://issues.apache.org/jira/browse/NUTCH-585, I have created a plugin toblacklist or whitelist html elements. This was based on the need for not indexing header/footer/navigation, so the user gets really only relevant results, e.g. even if the term shows up in the navigation. The elements to be parsed (or not) can be defined by using CSS-like selectors. A new field called strippedContent is available in the index which can be used for searching. Links are still crawled and parsed from the content field, allowing all pages to be parsed. The full documentation is in the README.txt in the patch. The patch can be found on: http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch Maybe it is of help to someone:) Best, Elisabeth
Re: Extension of NUTCH-585 - blacklist whitelist plugin
Elisabeth, Great. Could you attach your patch to the original issue in JIRA instead and check the box : Grant license to ASF for inclusion in ASF works? Julien On 21 September 2011 16:47, Elisabeth Adler elisabeth.ad...@gmail.comwrote: Hi, Based on the suggestions/code from https://issues.apache.org/** jira/browse/NUTCH-585 https://issues.apache.org/jira/browse/NUTCH-585, I have created a plugin toblacklist or whitelist html elements. This was based on the need for not indexing header/footer/navigation, so the user gets really only relevant results, e.g. even if the term shows up in the navigation. The elements to be parsed (or not) can be defined by using CSS-like selectors. A new field called strippedContent is available in the index which can be used for searching. Links are still crawled and parsed from the content field, allowing all pages to be parsed. The full documentation is in the README.txt in the patch. The patch can be found on: http://www.scintillation.at/** files/nutwe03mnyzwb/blacklist_**whitelist_plugin.patchhttp://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch Maybe it is of help to someone:) Best, Elisabeth -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: Extension of NUTCH-585 - blacklist whitelist plugin
Done. Patch is available in https://issues.apache.org/jira/browse/NUTCH-585 Best, Elisabeth On 21.09.2011 17:51, Julien Nioche wrote: Elisabeth, Great. Could you attach your patch to the original issue in JIRA instead and check the box : Grant license to ASF for inclusion in ASF works? Julien On 21 September 2011 16:47, Elisabeth Adler elisabeth.ad...@gmail.com mailto:elisabeth.ad...@gmail.com wrote: Hi, Based on the suggestions/code from https://issues.apache.org/jira/browse/NUTCH-585, I have created a plugin toblacklist or whitelist html elements. This was based on the need for not indexing header/footer/navigation, so the user gets really only relevant results, e.g. even if the term shows up in the navigation. The elements to be parsed (or not) can be defined by using CSS-like selectors. A new field called strippedContent is available in the index which can be used for searching. Links are still crawled and parsed from the content field, allowing all pages to be parsed. The full documentation is in the README.txt in the patch. The patch can be found on: http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch Maybe it is of help to someone:) Best, Elisabeth -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elisabeth Adler updated NUTCH-585: -- Attachment: blacklist_whitelist_plugin.patch Based on the suggestions/code above, I have created a plugin to blacklist or whitelist html elements (blacklist_whitelist_plugin.patch). This was based on the need for not indexing header/footer/navigation, so the user gets really only relevant results, e.g. even if the term shows up in the navigation. The elements to be parsed (or not) can be defined by using CSS-like selectors. A new field called strippedContent is available in the index which can be used for searching. Links are still crawled and parsed from the content field, allowing all pages to be parsed. The full documentation is in the README.txt in the patch. [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Assignee: Markus Jelsma Priority: Minor Fix For: 1.4 Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [RESULT] [VOTE] Move 2.0 out of trunk
Guys, If no one objects, I will execute the move Friday by 12pm PDT. Will that work? Cheers, Chris On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote: Hi Folks, Okey dok, this VOTE has passed with the following tallies: +1 PMC Markus Jelsma Sami Siren Chris Mattmann Lewis John McGibbney Dennis Kubes Julien Nioche Andrzej Bialecki -1 PMC Alexis de Tréglodé -1 Community Radim Kola Accordingly we will move the current Nutch trunk to a bew branch nutchgora and then will move the current 1.4-development branch into trunk. I assume the two commands below would do the trick? svn mv https://svn.apache.org/repos/asf/nutch/trunk https://svn.apache.org/repos/asf/nutch/branches/nutchgora svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/ https://svn.apache.org/repos/asf/nutch/trunk Thanks Julien On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, Following the discussions [1] on the dev-list about the future of Nutch 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The arguments for / against can be found in the thread I mentioned. The vote is open for the next 72 hours. [ ] +1 : Shelve 2.0 and move 1.4 to trunk [] 0 : No opinion [] -1 : Bad idea. Please give justification. Thanks Julien [1] http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++