[RESULT] [VOTE] Move 2.0 out of trunk

2011-09-21 Thread Julien Nioche
Hi Folks,

Okey dok, this VOTE has passed with the following tallies:

+1 PMC
Markus Jelsma
Sami Siren
Chris Mattmann
Lewis John McGibbney
Dennis Kubes
Julien Nioche
Andrzej Bialecki

-1 PMC
Alexis de Tréglodé

-1 Community
Radim Kola


Accordingly we will move the current Nutch trunk to a bew branch
nutchgorahttp://svn.apache.org/repos/asf/nutch/branches/nutchgoraand
then will move the current 1.4-development branch into trunk.

I assume the two commands below would do the trick?

svn mv 
https://svn.apache.org/repos/asf/nutch/trunkhttp://svn.apache.org/repos/asf/nutch/trunk
https://svn.apache.org/repos/asf/nutch/branches/nutchgorahttp://svn.apache.org/repos/asf/nutch/branches/nutchgora
svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/
https://svn.apache.org/repos/asf/nutch/trunkhttp://svn.apache.org/repos/asf/nutch/trunk


Thanks

Julien




On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.comwrote:

 Hi,

 Following the discussions [1] on the dev-list about the future of Nutch
 2.0, I would like to call for a vote on moving Nutch 2.0 from the trunk to a
 separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The
 arguments for / against can be found in the thread I mentioned.

 The vote is open for the next 72 hours.

 [ ] +1 : Shelve 2.0 and move 1.4 to trunk
 [] 0 : No opinion
 [] -1 : Bad idea.  Please give justification.

 Thanks

 Julien

 [1]
 http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.htmlhttp://mail-archives.apache.org/mod_mbox/nutch-dev/201109.mbox/%3cca+-fm0tj2kvuco0wwkxbj6hsamxx5819ujv7lco2vo2kd2z...@mail.gmail.com%3E

 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Commented] (NUTCH-1005) Index headings plugin

2011-09-21 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13109406#comment-13109406
 ] 

Markus Jelsma commented on NUTCH-1005:
--

Comments?

 Index headings plugin
 -

 Key: NUTCH-1005
 URL: https://issues.apache.org/jira/browse/NUTCH-1005
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4

 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, 
 NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch


 Very simple plugin for extracting and indexing a comma separated list of 
 headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1005) Index headings plugin

2011-09-21 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13109435#comment-13109435
 ] 

Julien Nioche commented on NUTCH-1005:
--

let's try and come up with a single plugin for index-extra, urlmeta, NUTCH-809 
and this one. Much of these things are related and could be done in a generic 
way

 Index headings plugin
 -

 Key: NUTCH-1005
 URL: https://issues.apache.org/jira/browse/NUTCH-1005
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4

 Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java, 
 NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch


 Very simple plugin for extracting and indexing a comma separated list of 
 headings via the headings configuration directive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Extension of NUTCH-585 - blacklist whitelist plugin

2011-09-21 Thread Elisabeth Adler

Hi,

Based on the suggestions/code from 
https://issues.apache.org/jira/browse/NUTCH-585, I have created a plugin 
toblacklist or whitelist html elements. This was based on the need for 
not indexing header/footer/navigation, so the user gets really only 
relevant results, e.g. even if the term shows up in the navigation.


The elements to be parsed (or not) can be defined by using CSS-like 
selectors. A new field called strippedContent is available in the 
index which can be used for searching. Links are still crawled and 
parsed from the content field, allowing all pages to be parsed. The 
full documentation is in the README.txt in the patch.


The patch can be found on: 
http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch


Maybe it is of help to someone:)
Best,
Elisabeth


Re: Extension of NUTCH-585 - blacklist whitelist plugin

2011-09-21 Thread Julien Nioche
Elisabeth,

Great. Could you attach your patch to the original issue in JIRA instead and
check the box : Grant license to ASF for inclusion in ASF works?

Julien

On 21 September 2011 16:47, Elisabeth Adler elisabeth.ad...@gmail.comwrote:

 Hi,

 Based on the suggestions/code from https://issues.apache.org/**
 jira/browse/NUTCH-585 https://issues.apache.org/jira/browse/NUTCH-585, I
 have created a plugin toblacklist or whitelist html elements. This was based
 on the need for not indexing header/footer/navigation, so the user gets
 really only relevant results, e.g. even if the term shows up in the
 navigation.

 The elements to be parsed (or not) can be defined by using CSS-like
 selectors. A new field called strippedContent is available in the index
 which can be used for searching. Links are still crawled and parsed from the
 content field, allowing all pages to be parsed. The full documentation is
 in the README.txt in the patch.

 The patch can be found on: http://www.scintillation.at/**
 files/nutwe03mnyzwb/blacklist_**whitelist_plugin.patchhttp://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch

 Maybe it is of help to someone:)
 Best,
 Elisabeth




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


Re: Extension of NUTCH-585 - blacklist whitelist plugin

2011-09-21 Thread Elisabeth Adler

Done. Patch is available in https://issues.apache.org/jira/browse/NUTCH-585
Best,
Elisabeth

On 21.09.2011 17:51, Julien Nioche wrote:

Elisabeth,

Great. Could you attach your patch to the original issue in JIRA 
instead and check the box : Grant license to ASF for inclusion in ASF 
works?


Julien

On 21 September 2011 16:47, Elisabeth Adler elisabeth.ad...@gmail.com 
mailto:elisabeth.ad...@gmail.com wrote:


Hi,

Based on the suggestions/code from
https://issues.apache.org/jira/browse/NUTCH-585, I have created a
plugin toblacklist or whitelist html elements. This was based on
the need for not indexing header/footer/navigation, so the user
gets really only relevant results, e.g. even if the term shows up
in the navigation.

The elements to be parsed (or not) can be defined by using
CSS-like selectors. A new field called strippedContent is
available in the index which can be used for searching. Links are
still crawled and parsed from the content field, allowing all
pages to be parsed. The full documentation is in the README.txt in
the patch.

The patch can be found on:

http://www.scintillation.at/files/nutwe03mnyzwb/blacklist_whitelist_plugin.patch

Maybe it is of help to someone:)
Best,
Elisabeth




--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2011-09-21 Thread Elisabeth Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elisabeth Adler updated NUTCH-585:
--

Attachment: blacklist_whitelist_plugin.patch

Based on the suggestions/code above, I have created a plugin to blacklist or 
whitelist html elements (blacklist_whitelist_plugin.patch). This was based on 
the need for not indexing header/footer/navigation, so the user gets really 
only relevant results, e.g. even if the term shows up in the navigation.

The elements to be parsed (or not) can be defined by using CSS-like selectors. 
A new field called strippedContent is available in the index which can be 
used for searching. Links are still crawled and parsed from the content 
field, allowing all pages to be parsed. The full documentation is in the 
README.txt in the patch.

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [RESULT] [VOTE] Move 2.0 out of trunk

2011-09-21 Thread Mattmann, Chris A (388J)
Guys,

If no one objects, I will execute the move Friday by 12pm PDT.

Will that work?

Cheers,
Chris

On Sep 21, 2011, at 3:09 AM, Julien Nioche wrote:

 Hi Folks,
 
 Okey dok, this VOTE has passed with the following tallies:
 
 +1 PMC
 Markus Jelsma
 Sami Siren
 Chris Mattmann
 Lewis John McGibbney
 Dennis Kubes
 Julien Nioche
 Andrzej Bialecki
 
 -1 PMC
 Alexis de Tréglodé
 
 -1 Community
 Radim Kola
 
 
 Accordingly we will move the current Nutch trunk to a bew branch nutchgora 
 and then will move the current 1.4-development branch into trunk.
 
 I assume the two commands below would do the trick?
 
 svn mv https://svn.apache.org/repos/asf/nutch/trunk 
 https://svn.apache.org/repos/asf/nutch/branches/nutchgora
 svn mv https://svn.apache.org/repos/asf/nutch/branches/branch-1.4/ 
 https://svn.apache.org/repos/asf/nutch/trunk
 
 
 Thanks
 
 Julien
 
 
 
 
 On 18 September 2011 10:21, Julien Nioche lists.digitalpeb...@gmail.com 
 wrote:
 Hi, 
 
 Following the discussions [1] on the dev-list about the future of Nutch 2.0, 
 I would like to call for a vote on moving Nutch 2.0 from the trunk to a 
 separate branch, promote 1.4 to trunk and consider 2.0 as unmaintained. The 
 arguments for / against can be found in the thread I mentioned.
 
 The vote is open for the next 72 hours. 
 
 [ ] +1 : Shelve 2.0 and move 1.4 to trunk
 [] 0 : No opinion
 [] -1 : Bad idea.  Please give justification.
 
 Thanks
 
 Julien
 
 [1] http://www.mail-archive.com/gora-dev@incubator.apache.org/msg00483.html
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++