[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-585:
--
Priority: Major  (was: Minor)

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>  Components: HTML, parse-filter, parser, plugin
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-585:
--
Component/s: parse-filter
 HTML
 parser
 plugin

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>  Components: HTML, parse-filter, parser, plugin
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.20
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-585:
--
Fix Version/s: 1.20

> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> ---
>
> Key: NUTCH-585
> URL: https://issues.apache.org/jira/browse/NUTCH-585
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
> Environment: All operating systems
>Reporter: Andrea Spinelli
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: blacklist_whitelist_plugin.patch, 
> nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index 
> certain parts of our pages, because we know they are not relevant (for 
> instance, there are several links to change the background color) and 
> generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML 
> comments, like
> 
> ... ignored part ...
> 
> We feel this might be useful to someone else, maybe factorizing the comment 
> strings as constants in the configuration files (say parser.html.ignore.start 
> and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any 
> expression of  interest - or for an explanation why waht we are doing is 
> plain wrong!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-10-29 Thread Roberto Gardenier (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roberto Gardenier updated NUTCH-585:


Comment: was deleted

(was: I have compiled nutch 1.5.1 with the provided plugin and used the 
configuration as described above. This all without success. 
Could anyone assist me on troubleshooting ?

Nutch crawls and SOLR indexes with success but the content field still includes 
content of which are supposed to be blacklisted.

Steps:
1. Patched Nutch 1.5.1. with above blacklist_whitelist_plugin.patch
2. Enabled the plugin in nutch-default.xml plugin.includes: 
index-blacklist-whitelist.
3. Added the new field strippedContent to schema.xml (both nutch and solr) !-- 
fields for the blacklist/whitelist plugin -- field name=strippedContent 
type=text stored=true indexed=true/.
4. Configured parser.html.blacklist to blacklist div.kruimelspoor in 
nutch-default.xml.

I pointed nutch at my site and fired it. I dont get warnings/errors or any kind 
of showstoppers, the crawling goes well and the index is filled. But still with 
everything inside div.kruimelspoor.
)

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2012-04-03 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-585:


Fix Version/s: (was: 1.5)
   1.6

20120304-push-1.6

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2011-09-28 Thread Julien Nioche (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-585:


Fix Version/s: (was: 1.4)
   1.5

Marking for 1.5. Needs reviewing and won't make it into 1.4

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.5

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2011-09-21 Thread Elisabeth Adler (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elisabeth Adler updated NUTCH-585:
--

Attachment: blacklist_whitelist_plugin.patch

Based on the suggestions/code above, I have created a plugin to blacklist or 
whitelist html elements (blacklist_whitelist_plugin.patch). This was based on 
the need for not indexing header/footer/navigation, so the user gets really 
only relevant results, e.g. even if the term shows up in the navigation.

The elements to be parsed (or not) can be defined by using CSS-like selectors. 
A new field called strippedContent is available in the index which can be 
used for searching. Links are still crawled and parsed from the content 
field, allowing all pages to be parsed. The full documentation is in the 
README.txt in the patch.

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4

 Attachments: blacklist_whitelist_plugin.patch, 
 nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2011-09-17 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-585:


   Patch Info: [Patch Available]
Fix Version/s: 1.4
 Assignee: Markus Jelsma

Marked for 1.4. Thanks!

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4

 Attachments: nutch-585-excludeNodes.patch, 
 nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2010-12-30 Thread N. Hira (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

N. Hira updated NUTCH-585:
--

Attachment: nutch-585-jostens-excludeDIVs.patch

We use Solr/Nutch on our corporate web site and are very happy with the 
results.  Thank you.  We have struggled with something similar to NUTCH-585 for 
a few months now.

Although it is different from the original intent, here's a quick/short patch 
that might help get this feature going again.


h4.Intended use:
- Let's assume you're crawling a set of internal web sites and would like to 
exclude certain HTML fragments (from indexing) like the navigation and other 
common content.
- If these fragments are contained in DIVs with IDs like menuNav, 
footerNav, etc., then you can now add a new property to nutch-site.xml to 
exclude these DIVs.
- If you don't set this property, the normal behavior remains (backward 
compatible)
{code:xml}
property
  nameparser.html.divIDsToExclude/name
  
valueaccount_menu_container,footer_menu_container,legal,main_menu_container/value
  description
  A comma-delimited list of DIV IDs whose content will not be indexed.  Use 
this to tell
  the HTML parser to ignore, for example, site navigation text.
  Note that DIVs with these IDs, and their children, will be silently ignored 
by the parser
  so verify the indexed content with Luke to confirm results.
  /description
/property
{code}


h4.Inclusion/growth:
- This code was written against nutch 1.2 and is backward compatible in that 
the new behavior is only present if configured.
- In future, it might be good to have different strategy patterns for how 
exclusions are determined; some might need algorithmic detection (whole web 
crawls), others might prefer jquery-selectors for HTML fragments, etc.



Best regards,

-h

Hira, N.R.  (Jostens, Inc.)

 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
 ---

 Key: NUTCH-585
 URL: https://issues.apache.org/jira/browse/NUTCH-585
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: All operating systems
Reporter: Andrea Spinelli
Priority: Minor
 Attachments: nutch-585-jostens-excludeDIVs.patch


 We are using nutch to index our own web sites; we would like not to index 
 certain parts of our pages, because we know they are not relevant (for 
 instance, there are several links to change the background color) and 
 generate spurious matches.
 We have modified the plugin so that it ignores HTML code between certain HTML 
 comments, like
 !-- START-IGNORE --
 ... ignored part ...
 !-- STOP-IGNORE --
 We feel this might be useful to someone else, maybe factorizing the comment 
 strings as constants in the configuration files (say parser.html.ignore.start 
 and parser.html.ignore.stop in nutch-site.xml).
 We are almost ready to contribute our code snippet.  Looking forward for any 
 expression of  interest - or for an explanation why waht we are doing is 
 plain wrong!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.