[jira] [Commented] (NUTCH-2806) Nutch can't parse links

2020-07-10 Thread Jorge Luis Betancourt Gonzalez (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17155785#comment-17155785
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2806:
---

Hi [~immobilier-dz] can you check the value of the {{db.ignore.external.links}} 
setting in your configuration? By default, it is set to false, which means that 
Nutch should be able to at least detect/add the external links for crawling in 
a future crawl. See 
[https://github.com/apache/nutch/blob/2.x/conf/nutch-default.xml#L498-L505]

Finally, keep in mind that normally is best to send this type of inquiries to 
the users/developers mailing lists 
([https://nutch.apache.org/mailing_lists.html]).

> Nutch can't parse links 
> 
>
> Key: NUTCH-2806
> URL: https://issues.apache.org/jira/browse/NUTCH-2806
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4
>Reporter: lina dziri
>Priority: Major
> Fix For: 2.4
>
>
> Testing with the following site: 
> [https://www.algeriahome.com|https://www.algeriahome.com/] , nutch only parse 
> links that does contain the base url. 
>  Tried tika as parser, tried to update db.max.outlinks.per.page to -1, tried 
> practically every comments about detecting all the links, doubted urlfilter 
> or regex-normalizer so it was disabled but having the same results. 
>  each time I rebuild nutch and test the parser, it gives the same urls count 
> arround 378. 
>  Can somebody help out to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1749) Optionally exclude title from content field

2019-09-04 Thread Jorge Luis Betancourt Gonzalez (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922388#comment-16922388
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1749:
---

Do we want to put this into the upcoming release? I've added some comments to 
the PR. But will take a closer look at a later time.

> Optionally exclude title from content field
> ---
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.7
>Reporter: Greg Padiasek
>Priority: Major
> Fix For: 1.16
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (NUTCH-2665) Upgrade to Apache Tika 1.19.1

2018-10-24 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662072#comment-16662072
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2665:
---

+1 [~markus17] I think it's safe to update the test.

> Upgrade to Apache Tika 1.19.1
> -
>
> Key: NUTCH-2665
> URL: https://issues.apache.org/jira/browse/NUTCH-2665
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 2.4
>
> Attachments: NUTCH-2665.patch, NUTCH-2665.patch
>
>
> Borrowing from [~wastl-nagel]'s efforts on NUTCH-2651, 2.x can be upgraded to 
> Apache Tika 1.19.1 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (NUTCH-2663) Improve index-jexl-filter syntax for scripts

2018-10-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2663 started by Jorge Luis Betancourt Gonzalez.
-
> Improve index-jexl-filter syntax for scripts
> 
>
> Key: NUTCH-2663
> URL: https://issues.apache.org/jira/browse/NUTCH-2663
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.16
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> JEXL scripts need to be written using the array syntax to get the actual 
> value (for instance, example extracted from the tests):
> {code}
> doc.lang[0]=='en'
> {code}
> Ideally, this would only be required if the actual value is really an array, 
> and not for single value elements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2663) Improve index-jexl-filter syntax for scripts

2018-10-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2663:
-

 Summary: Improve index-jexl-filter syntax for scripts
 Key: NUTCH-2663
 URL: https://issues.apache.org/jira/browse/NUTCH-2663
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.16
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez


JEXL scripts need to be written using the array syntax to get the actual value 
(for instance, example extracted from the tests):

{code}
doc.lang[0]=='en'
{code}

Ideally, this would only be required if the actual value is really an array, 
and not for single value elements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured

2018-10-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655180#comment-16655180
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2662:
---

[~yossi] Yes, I know :). At the time, I didn't notice that the actual 
recommendation (the message shown to the user) is to pick an expression 
(true/false) which can be done automatically and log the default value. The 
second validation that checks if the expression is syntactically correct it's 
still valid.

> index-jexl-filter plugin throws a RuntimeException if its enabled but not 
> configured
> 
>
> Key: NUTCH-2662
> URL: https://issues.apache.org/jira/browse/NUTCH-2662
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.16
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> If the index-jexl-filter plugin is enabled but no configuration is provided 
> in the {{index.jexl.filter}} property the plugin throws a RuntimeException. 
> In the same exception message, we advise to either set true or false to index 
> all/none. 
> This is a case where we can just select a sane default and log a warning, but 
> not stop the entire process. I think this is more consistent with how we 
> approach configuration in general: Only fail if there is an actual error in 
> the configuration (i.e parse error on the expression).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2662) index-jexl-filter plugin throws a RuntimeException if its enabled but not configured

2018-10-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2662:
-

 Summary: index-jexl-filter plugin throws a RuntimeException if its 
enabled but not configured
 Key: NUTCH-2662
 URL: https://issues.apache.org/jira/browse/NUTCH-2662
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.16
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez


If the index-jexl-filter plugin is enabled but no configuration is provided in 
the {{index.jexl.filter}} property the plugin throws a RuntimeException. In the 
same exception message, we advise to either set true or false to index 
all/none. 

This is a case where we can just select a sane default and log a warning, but 
not stop the entire process. I think this is more consistent with how we 
approach configuration in general: Only fail if there is an actual error in the 
configuration (i.e parse error on the expression).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2661) Move TestOutlinks to the proper path

2018-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2661:
-

 Summary: Move TestOutlinks to the proper path
 Key: NUTCH-2661
 URL: https://issues.apache.org/jira/browse/NUTCH-2661
 Project: Nutch
  Issue Type: Improvement
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
 Fix For: 1.16


Initially, I placed the {{TestOutlinks}} class in the index-links plugin, 
although this was when I found the bug with the {{hashCode}}. Now I realised 
that this test is best to have in the {{test/org/apache/nutch/nutch/parse}} 
directory. 

Even more because since this test is not covering any plugin-specific code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653495#comment-16653495
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2658:
---

[~wastl-nagel] exactly what I was thinking. Right now in order to configure a 
given plugin you need to look at the nutch-default.xml to see what options are 
available, and read the documentation from there. If it's an indexing plugin 
you need to check the schema, or in the worst case the actual code to figure 
out what fields are going to be added. 

I consider that at least these 2 components should be made more visible to the 
users, the advantage of the README is that lives right next to the code so it's 
easier to "remember" to update it.

[~yossi] I agree that having the documentation also on the Wiki is very helpful 
and the README it's not intended to replace that.

+1 on generating the wiki from the README (or something else) this will at 
least guarantees that is updated with each release. 

We can also add a check/step to the release procedure to check if any new 
plugins have been added and if the README is there. Of course, there is always 
the risk that the README contains dummy/not useful data. But through PRs we can 
keep an eye on that.

As a side note, I kind of like how elasticsearch has it's documentation 
versioned and updated per release. Not sure how to integrate this with our wiki.

> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653252#comment-16653252
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2658:
---

I'm thinking of having at least 2 general sections:

* Configuration: Covers all parameters that are included in the 
nutch-default.xml (although could be a bit of a repetition)
* Fields: Includes information about which fields should be added to your 
storage backend configuration (if applicable). 

Including documentation on how to configure Solr fields would be a nice default 
configuration, although we support different backends.



> Add README file to all plugins in src/plugin
> 
>
> Key: NUTCH-2658
> URL: https://issues.apache.org/jira/browse/NUTCH-2658
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Trivial
>
> Since we've migrated a good portion of our workflow to Github we could 
> consider adding a {{README.md}} file to the root of each plugin in 
> {{src/plugins}}. 
> This is a good place to have plugin-specific documentation. Wich fields the 
> plugin adds to the indexer, which configuration options, etc. Also, since the 
> README.md is rendered by Github automatically is a good link to point users.
> I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
> that it's a good source of information to point users when asking questions 
> regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2658) Add README file to all plugins in src/plugin

2018-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2658:
-

 Summary: Add README file to all plugins in src/plugin
 Key: NUTCH-2658
 URL: https://issues.apache.org/jira/browse/NUTCH-2658
 Project: Nutch
  Issue Type: Improvement
  Components: documentation, plugin
Reporter: Jorge Luis Betancourt Gonzalez


Since we've migrated a good portion of our workflow to Github we could consider 
adding a {{README.md}} file to the root of each plugin in {{src/plugins}}. 

This is a good place to have plugin-specific documentation. Wich fields the 
plugin adds to the indexer, which configuration options, etc. Also, since the 
README.md is rendered by Github automatically is a good link to point users.

I think that a good example is the {{indexer-cloudsearch}} plugin, on top of 
that it's a good source of information to point users when asking questions 
regarding a specific plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408075#comment-16408075
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2541:
---

I've tested against master, I still see the same issue:

{code}
➜  local (master) ✔ bin/nutch parsechecker 
"http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html;
fetching: 
http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html
robots.txt whitelist not configured.
Fetch failed with protocol status: exception(16), lastModified=0: 
java.lang.IllegalArgumentException: Invalid uri 
'http://agahi.safirak.com/ads/850/پیچ-بند-بادی-هفتیری-1800-دور-بادی-جیسون.html':
 escaped absolute path not valid
{code}


> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407880#comment-16407880
 ] 

Jorge Luis Betancourt Gonzalez edited comment on NUTCH-2541 at 3/21/18 3:18 PM:


Normal characters (like a space) in the URL path {{/some other}} (the space) 
are escaped without any issues.


was (Author: jorgelbg):
Normal characters (like a space) in the URL path {{/some other}} the space is 
escaped without any issues.

> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407880#comment-16407880
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2541:
---

Normal characters (like a space) in the URL path {{/some other}} the space is 
escaped without any issues.

> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2541) Arabic characters in the URL path are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2541:
--
Summary: Arabic characters in the URL path are not properly escaped by the 
protocol-httpclient plugin  (was: Arabic characters in the URL are not properly 
escaped by the protocol-httpclient plugin)

> Arabic characters in the URL path are not properly escaped by the 
> protocol-httpclient plugin
> 
>
> Key: NUTCH-2541
> URL: https://issues.apache.org/jira/browse/NUTCH-2541
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Affects Versions: 2.3.1, 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Major
>
> As reported on [1] 
> When trying to crawl some URLs with Arabic characters Nutch will complain due 
> to an {{InvalidArgumentException}}. This happens because the HTTP client 
> library is using internally the {{java.net.URI}} which does not support this 
> characters unless they're properly escaped.
> [1] 
> https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2541) Arabic characters in the URL are not properly escaped by the protocol-httpclient plugin

2018-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2541:
-

 Summary: Arabic characters in the URL are not properly escaped by 
the protocol-httpclient plugin
 Key: NUTCH-2541
 URL: https://issues.apache.org/jira/browse/NUTCH-2541
 Project: Nutch
  Issue Type: Bug
  Components: plugin, protocol
Affects Versions: 1.14, 2.3.1
Reporter: Jorge Luis Betancourt Gonzalez


As reported on [1] 

When trying to crawl some URLs with Arabic characters Nutch will complain due 
to an {{InvalidArgumentException}}. This happens because the HTTP client 
library is using internally the {{java.net.URI}} which does not support this 
characters unless they're properly escaped.

[1] 
https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2506) host is not available for filtering on the JEXL indexing plugin

2018-01-30 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345135#comment-16345135
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2506:
---

If the {{index-basic}} plugin is enabled, then the host is available under 
{{doc.host}}. The question is if we should add it as a default option for 
filtering or rely on the combination of {{index-basic}} + 
{{index-jexl-filter}}. In this case, I'm good with any option, I don't think a 
lot of people normally disable the {{index-basic}} plugin. But while using the 
plugin took me a second to notice it. We can also update the documentation for 
the plugin indicating at least what is usually available under the {{doc}} key 
(if the index-basic is enabled). @sebastian nagel

> host is not available for filtering on the JEXL indexing plugin
> ---
>
> Key: NUTCH-2506
> URL: https://issues.apache.org/jira/browse/NUTCH-2506
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.14
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.15
>
>
> The {{host}} attribute is not available for filtering on the 
> {{JexlIndexingFilter}}. Take a look at the documentation on 
> https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1653-L1667.
>  
> This could be quite useful, although the {{url}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2506) host is not available for filtering on the JEXL indexing plugin

2018-01-30 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2506:
-

 Summary: host is not available for filtering on the JEXL indexing 
plugin
 Key: NUTCH-2506
 URL: https://issues.apache.org/jira/browse/NUTCH-2506
 Project: Nutch
  Issue Type: Bug
  Components: indexer, plugin
Affects Versions: 1.14
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
 Fix For: 1.15


The {{host}} attribute is not available for filtering on the 
{{JexlIndexingFilter}}. Take a look at the documentation on 
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1653-L1667. 

This could be quite useful, although the {{url}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed

2017-11-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2464:
--
Affects Version/s: (was: 2.3)
   1.13

> Headers That Contain HTML Elements Are Not Parsed
> -
>
> Key: NUTCH-2464
> URL: https://issues.apache.org/jira/browse/NUTCH-2464
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 1.13
> Environment: Internal development/test environments.
>Reporter: Cass Pallansch
> Attachments: NUTCH-2464-complex-header.html
>
>
> Nutch does not appear to traverse the HTML elements that may be contained 
> within header elements (e.g., H1, H2, H3, etc. tags).  Many times there are 
> anchors and/or  tags within these elements that contain the actual text 
> nodes that should be picked up as the header value for indexing purposes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2464) Headers That Contain HTML Elements Are Not Parsed

2017-11-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262795#comment-16262795
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2464:
---

I've tested this on master and I can reproduce the bug, but you reported that 
this was happening on the {{2.3}} version which doesn't ship with the headings 
plugin. 

Using the attached example by [~wastl-nagel] the extracted metadata is:

{code}
Parse Metadata: ... h1=header with
{code}

The problem is that the method is stopping after finding the first 
{{TEXT_NODE}}. So we get a truncated title in this case. One option would be to 
just allow it to continue traversing the DOM tree. Your patch only needed one 
more tweak:

{code}
for (int i = 0; i < children.getLength(); i++) {
  if (children.item(i).getNodeType() == Node.TEXT_NODE) {
buffer.append(children.item(i).getNodeValue());
  } else {
buffer.append(getNodeValue(children.item(i)));
  }
}
{code}

We could move this into using the {{NodeWalker}} as suggested by [~wastl-nagel]

> Headers That Contain HTML Elements Are Not Parsed
> -
>
> Key: NUTCH-2464
> URL: https://issues.apache.org/jira/browse/NUTCH-2464
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 2.3
> Environment: Internal development/test environments.
>Reporter: Cass Pallansch
> Attachments: NUTCH-2464-complex-header.html
>
>
> Nutch does not appear to traverse the HTML elements that may be contained 
> within header elements (e.g., H1, H2, H3, etc. tags).  Many times there are 
> anchors and/or  tags within these elements that contain the actual text 
> nodes that should be picked up as the header value for indexing purposes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2462) Cleanup Tika Boilerpipe patch

2017-11-15 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2462:
-

 Summary: Cleanup Tika Boilerpipe patch
 Key: NUTCH-2462
 URL: https://issues.apache.org/jira/browse/NUTCH-2462
 Project: Nutch
  Issue Type: Improvement
  Components: parser, plugin
Affects Versions: 2.4
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
Priority: Trivial


* Remove unused imports and some generic ({{.*}}) imports
* Apply the formatting rules
* Refactor configurations variables into the {{setConf}} method (for 
consistency)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-10-19 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211065#comment-16211065
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2443:
---

It's not hard to add more tags, but honestly I'm seeing a lot of those tags 
with URL-value attributes for the first time, the question is should have them 
_all_ in the actual implementation? 

> Extract links from the video tag with the parse-html plugin
> ---
>
> Key: NUTCH-2443
> URL: https://issues.apache.org/jira/browse/NUTCH-2443
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.13
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> At the moment the {{parse-html}} extracts links from the tags {{a, area, 
> form}} (configurable){{, frame, iframe, script, link, img}}. Since we allow 
> extracting links to binary files (images) extracting links also from the 
> {{video}} tag should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2443) Extract links from the video tag with the parse-html plugin

2017-10-17 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2443:
-

 Summary: Extract links from the video tag with the parse-html 
plugin
 Key: NUTCH-2443
 URL: https://issues.apache.org/jira/browse/NUTCH-2443
 Project: Nutch
  Issue Type: Improvement
  Components: parser, plugin
Affects Versions: 1.13
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
Priority: Minor
 Fix For: 1.14


At the moment the {{parse-html}} extracts links from the tags {{a, area, form}} 
(configurable){{, frame, iframe, script, link, img}}. Since we allow extracting 
links to binary files (images) extracting links also from the {{video}} tag 
should be supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2424) Mirror git repository to gitlab.com

2017-10-09 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196875#comment-16196875
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2424:
---

I really like GitLab, but for us there is no real value in adding yet another 
mirror into GitLab just to use GitLab CI. We use the infrastructure provided by 
the ASF (based on Jenkins) https://builds.apache.org/job/Nutch-trunk/ for us 
this is enough at the moment. We use Github due to the features/visibility that 
provides and honestly, I don't see that GitLab is going to offer anything that 
we don't already have at the moment with Github + Jenkins.

> Mirror git repository to gitlab.com
> ---
>
> Key: NUTCH-2424
> URL: https://issues.apache.org/jira/browse/NUTCH-2424
> Project: Nutch
>  Issue Type: Task
>Reporter: Karl Richter
>
> GitLab is a free (as in speech) code hosting platform which has a continuous 
> integration service and provides merge requests. An instance is provided at 
> gitlab.com or a project can host its own instance.
> The long term goal is to get a CI service working for Nutch. I already 
> provided a `.gitlab-ci.yml` at 
> https://gitlab.com/krichter/nutch/merge_requests/1 which reveals failure of 
> crucial build commands, so this is definitely useful and necessary (and at 
> https://gitlab.com/krichter/nutch/merge_requests/2 for Maven-based builds 
> based on branch NUTCH-2292 and the corresponding issue).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2017-09-04 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152957#comment-16152957
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1480:
---

[~markus17] do you mind taking a look at the linked PR? I think that the PR 
covers more than the original intent of this issue, since you've already worked 
in something similar, I think that your input would be really valuable on this 
case.

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: adding-support-for-sharding-indexer-for-solr.patch, 
> NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.
> edit:
> This does not replace NUTCH-1377 but complements it. With NUTCH-1377 this 
> issue allows you to index to multiple SolrCloud clusters at the same time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-08-31 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez reassigned NUTCH-2415:
-

Assignee: Jorge Luis Betancourt Gonzalez

> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144357#comment-16144357
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2414:
---

[~yossi] I think that [~markus.jel...@openindex.io] is suggesting implementing 
a generic {{IndexingFilter}} that supports JEXL expressions, this way we don't 
need to modify every possible {{IndexingFilter}}, this will be easier to 
maintain in the long run and provides a better separation.

> Allow LanguageIndexingFilter to actually filter documents by language.
> --
>
> Key: NUTCH-2414
> URL: https://issues.apache.org/jira/browse/NUTCH-2414
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those 
> languages that we intend to search in). At first glance it seems that this is 
> done by LanguageIndexingFilter, but currently all the filter does is add the 
> language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow 
> it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2414) Allow LanguageIndexingFilter to actually filter documents by language.

2017-08-28 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144264#comment-16144264
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2414:
---

+1 This would allow also help to deprecate the {{mimetype-filter}} plugin and 
avoid having the responsibility of indexing/allowing/blocking documents (from 
being indexed) scattered across several plugins

> Allow LanguageIndexingFilter to actually filter documents by language.
> --
>
> Key: NUTCH-2414
> URL: https://issues.apache.org/jira/browse/NUTCH-2414
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those 
> languages that we intend to search in). At first glance it seems that this is 
> done by LanguageIndexingFilter, but currently all the filter does is add the 
> language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow 
> it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-06-07 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16040881#comment-16040881
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2392:
---

In this case, Nutch is detecting a relative URL and doing the work to make it 
"fetchable" which is making it a full URL, in this case. But you'll find the 
same issue not only with relative URLs, you could find the same situation where 
you find totally different URLs with the same content thanks to the "magic" of 
some CMS, one case that I've found quite often is the presence/lack of 
{{index.php}} in some URLs with exactly the same content. I've also found this 
issue with OCS (Open Conference Systems) https://pkp.sfu.ca/ocs/.

Can you provide the exact URLs that you've found? Are both URLs being indexed 
in Solr? Even if both URLs are being fetched they should be deduplicated later 
on. Even if both URLs are totally different they should have the same 
signature/digest calculated using the text extracted, see 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/TextMD5Signature.java.

The problem is that you need to actually fetch/parse the URL to be able to know 
that they are duplicated, we need to assume that both URLs are different until 
proven otherwise :).

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
> Fix For: 1.14
>
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2353) Create seed file with metadata using the REST API

2017-01-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2353:
--
Description: 
At the moment its not possible to create a seed file and specify any metadata 
when using the REST API. The file gets created but there is no option to add 
any metadata to the seed URLs.

If we use a payload like this:

{code}
{
"name":"name-of-seedlist", 
"seedUrls":[
{
"url" : "http://example.com;,
"metadata" : {
"key1" : "value1",
"key2" : "value2",
"key3" : "value3"
}
}
]
}
{code}

It should be easy to specify the desired metadata. Also this should keep BC 
with the previous array syntax if we only want to specify the list of URLs 
without any metadata at all.

  was:
At the moment its not possible to create a seed file and specify any metadata 
when using the REST API. The file gets created but there is no option to add 
any metadata to the seed URLs.

If we use a payload like this:

{code}
{
"name":"name-of-seedlist", 
"seedUrls":[
{
"url" : "http://example.com;,
"metadata" : {
"key1" : "value1",
"key2" : "value2",
"key3" : "value3"
}
}
]
}
{code}

It should be easy to specify the desired metadata.


> Create seed file with metadata using the REST API
> -
>
> Key: NUTCH-2353
> URL: https://issues.apache.org/jira/browse/NUTCH-2353
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: rest_api
> Fix For: 1.13
>
>
> At the moment its not possible to create a seed file and specify any metadata 
> when using the REST API. The file gets created but there is no option to add 
> any metadata to the seed URLs.
> If we use a payload like this:
> {code}
> {
> "name":"name-of-seedlist", 
> "seedUrls":[
> {
> "url" : "http://example.com;,
> "metadata" : {
> "key1" : "value1",
> "key2" : "value2",
> "key3" : "value3"
> }
> }
> ]
> }
> {code}
> It should be easy to specify the desired metadata. Also this should keep BC 
> with the previous array syntax if we only want to specify the list of URLs 
> without any metadata at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2353) Create seed file with metadata using the REST API

2017-01-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2353:
-

 Summary: Create seed file with metadata using the REST API
 Key: NUTCH-2353
 URL: https://issues.apache.org/jira/browse/NUTCH-2353
 Project: Nutch
  Issue Type: Improvement
  Components: injector, REST_api
Affects Versions: 1.12
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
Priority: Minor
 Fix For: 1.13


At the moment its not possible to create a seed file and specify any metadata 
when using the REST API. The file gets created but there is no option to add 
any metadata to the seed URLs.

If we use a payload like this:

{code}
{
"name":"name-of-seedlist", 
"seedUrls":[
{
"url" : "http://example.com;,
"metadata" : {
"key1" : "value1",
"key2" : "value2",
"key3" : "value3"
}
}
]
}
{code}

It should be easy to specify the desired metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2171) Upgrade Nutch Trunk to Java 1.8

2016-01-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113192#comment-15113192
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2171:
---

Perhaps an approach using checkstyle could be useful, combined with this recipe 
http://www.puppycrawl.com/blog/2015/09/03/checkstyle-force-lambdas.html could 
help us move forward. This could address at least the code analysis part.

> Upgrade Nutch Trunk to Java 1.8
> ---
>
> Key: NUTCH-2171
> URL: https://issues.apache.org/jira/browse/NUTCH-2171
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>
> Lambda expressions are fantastic. I tried to undertake a small exercise which 
> would indicate how many we could implement however this was a fruitless 
> effort. A patch is going to be a better approach. This task involves 
> upgrading various properties in default.properties as well as a systemic 
> source code analysis with the aim of implementing Java 8 goodies throughout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2146) hashCode on the Outlink class

2015-10-20 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2146:
-

 Summary: hashCode on the Outlink class
 Key: NUTCH-2146
 URL: https://issues.apache.org/jira/browse/NUTCH-2146
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.10, 1.11
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor


The {{Outlink}} class doesn't have a {{hashCode}} method. This doesn't cause 
any trouble with the already implemented plugins but if a developer tries to 
use a {{HashSet}} of outlinks in a custom plugin the {{Outlink}} instances with 
same data (toUrl, anchor) gets added several times. In contrast the {{Inlink}} 
class does have a {{hashCode}} method:

https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Inlink.java#L75-L77.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2146) hashCode on the Outlink class

2015-10-20 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965934#comment-14965934
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2146:
---

For instance a simple JUnit test 
https://github.com/jorgelbg/nutch/blob/NUTCH-2146/src/plugin/index-links/src/test/org/apache/nutch/parse/TestOutlinks.java#L40-L53
 that fails with the current implementation of the {{Outlink}} class

> hashCode on the Outlink class
> -
>
> Key: NUTCH-2146
> URL: https://issues.apache.org/jira/browse/NUTCH-2146
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.10, 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>
> The {{Outlink}} class doesn't have a {{hashCode}} method. This doesn't cause 
> any trouble with the already implemented plugins but if a developer tries to 
> use a {{HashSet}} of outlinks in a custom plugin the {{Outlink}} instances 
> with same data (toUrl, anchor) gets added several times. In contrast the 
> {{Inlink}} class does have a {{hashCode}} method:
> https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Inlink.java#L75-L77.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2139:
--
External issue ID: https://github.com/apache/nutch/pull/78

> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-13 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2139:
-

 Summary: Basic plugin to index inlinks and outlinks
 Key: NUTCH-2139
 URL: https://issues.apache.org/jira/browse/NUTCH-2139
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
 Fix For: 1.11


Basic plugin that allows to index the inlinks and outlinks of the web pages, 
this could be very useful for analytic purposes, including neat visualizations 
using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902600#comment-14902600
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2095:
---

Committed the updated test

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902784#comment-14902784
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2095:
---

Updated the CHANGES.txt file

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902737#comment-14902737
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2095:
---

[~jnioche] Nice catch! 

Will do! But didn't get the same behavior locally, I can't even find Guava v17 
on the university maven mirror. 

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2095:
--
Attachment: NUTCH-2095.patch

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2095:
--
Attachment: (was: NUTCH-2095.patch)

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2102) WARC Exporter

2015-09-18 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875754#comment-14875754
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2102:
---

+1 It looks good, the nutch entry will definitively will make it easier to use 
:)

> WARC Exporter
> -
>
> Key: NUTCH-2102
> URL: https://issues.apache.org/jira/browse/NUTCH-2102
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, dumpers
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2102.patch
>
>
> This patch adds a WARC exporter 
> [http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf]. Unlike 
> the code submitted in [https://github.com/apache/nutch/pull/55] which is 
> based on the CommonCrawlDataDumper, this exporter is a MapReduce job and 
> hence should be able to cope with large segments in a timely fashion and also 
> is not limited to the local file system.
> Later on we could have a WARCImporter to generate segments from WARC files, 
> which is outside the scope of the CCDD anyway. Also WARC is not specific to 
> CommonCrawl, which is why the package name does not reflect it.
> I don't think it would be a problem to have both the modified CCDD and this 
> class providing similar functionalities.
> This class is called in the following way 
> ./nutch org.apache.nutch.tools.warc.WARCExporter 
> /data/nutch-dipe/1kcrawl/warc -dir /data/nutch-dipe/1kcrawl/segments/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-11 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2095:
--
Attachment: NUTCH-2095.patch

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-11 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2095:
-

 Summary: WARC exporter for the CommonCrawlDataDumper
 Key: NUTCH-2095
 URL: https://issues.apache.org/jira/browse/NUTCH-2095
 Project: Nutch
  Issue Type: Improvement
  Components: commoncrawl, tool
Affects Versions: 1.11
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor


Adds the possibility of exporting the nutch segments to a WARC files.

>From the usage point of view a couple of new command line options are 
>available:

{{-warc}}: enables the functionality to export into WARC files, if not 
specified the default JACKSON formatter is used.
{{-warcSize}}: enable the option to define a max file size for each WARC file, 
if not specified a default of 1GB per file is used as recommended by the WARC 
ISO standard.

The usual {{-gzip}} flag can be used to enable compression on the WARC files.

Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
some changes to the Factory and to the Formats. This changes avoid creating a 
new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1517) CloudSearch indexer

2015-08-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14707078#comment-14707078
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1517:
---

+1 I haven't been able to do some tests (no access to CloudSearch), but so far 
looking good! does anyone else wants to comment?

 CloudSearch indexer
 ---

 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.11

 Attachments: 0023883254_1377197869_indexer-cloudsearch.patch, 
 0025666929_1382393138_indexer-cloudsearch.20131021.patch, NUTCH-1517.v2.patch


 Once we have made the indexers pluggable, we should add a plugin for Amazon 
 CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a 
 JSON based representation Search Data Format (SDF), which we could reuse for 
 a file based indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-25 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601602#comment-14601602
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-2036:
---

Thanks I was waiting for someone else to review, but glad to see it committed

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Fix For: 1.11

 Attachments: NUTCH-2036-v2.patch, NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-04 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2036:
-

 Summary: Adding some continuous crawl goodies to the crawl script
 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10, 1.11
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor


Although Nutch does not support continuous crawling out of the box, and yes 
this is somehow doable using cron or even sometimes irrelevant due the size of 
the crawl its a nice feature to have. 

This patch basically just adds a new parameter option to the {{bin/crawl}} 
script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
no URLs are scheduled for fetching). 

This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
provided the amount of time is assumed to be in seconds. Other valid suffixes 
are: 

s - second
m - minutes
h - hours
d - days

If a {{-1}} value is passed to the parameter or its not used at all the default 
behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-04 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2036:
--
Attachment: NUTCH-2036.patch

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10, 1.11
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Attachments: NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1994) Upgrade to Apache Tika 1.8

2015-04-25 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512856#comment-14512856
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1994:
---

This is due to Tika upgraded version or due to the change in the MIME type 
detection code in code NUTCH-1991? 

 Upgrade to Apache Tika 1.8
 --

 Key: NUTCH-1994
 URL: https://issues.apache.org/jira/browse/NUTCH-1994
 Project: Nutch
  Issue Type: Improvement
  Components: build, parser
Affects Versions: 1.10, 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10, 2.3.1

 Attachments: NUTCH-1994-2.x.patch, NUTCH-1994-trunk.patch


 Tika 1.8 was released this morning.
 Lets upgrade then release Nutch trunk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509167#comment-14509167
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1985:
---

Should we commit this for 1.10 release? or wait for 1.11 ?

 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510341#comment-14510341
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1985:
---

Committed revision 1675743.

 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez resolved NUTCH-1985.
---
Resolution: Fixed

 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1934) Refactor Fetcher in trunk

2015-04-20 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503904#comment-14503904
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1934:
---

+1 to [~chrismattmann] comment, 

If the tests pass without any problem I think we can commit and do some more 
testing, the basic test that covers the monolithic fetcher right now is a great 
starting point, and of course take it for a spin :) I plan on taking some time 
to prepare some midsize crawl before/after the commit if it helps.

 Refactor Fetcher in trunk
 -

 Key: NUTCH-1934
 URL: https://issues.apache.org/jira/browse/NUTCH-1934
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.10
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
  Labels: memex
 Fix For: 1.11

 Attachments: NUTCH-1934-trunkv2.patch, NUTCH-1934.patch


 Put simply 
 [Fetcher|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java]
  is too big.
 This is kinda strange as the size of this file is unique (I think) from every 
 other class within Nutch. The others are reasonably well modularized and 
 split into constituent classes which make sense.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-14 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-1985:
-

 Summary: Adding a main() method to the MimeTypeIndexingFilter
 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
 Fix For: 1.10


This make very easy the testing of different rules files to check the 
expressions used to filter the content based on the MIME type detected. Until 
now the only way to check this was to do test crawls and check the stored data 
in Solr/Elasticsearch. 

This allows calling the file using the {{bin/nutch plugin}} command, something 
like:

{{bin/nutch plugin mimetype-filter 
org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}

Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
for specifying a rules file to be used, this makes easy to play with different 
rules file until you get the desired behavior. 

After invoking the class, a valid MIME type must be entered for each line, and 
the output will be the same MIME type with a {{+}} or {{-}} sign in the 
beginning, indicating if the given MIME type is allowed or denied respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1985) Adding a main() method to the MimeTypeIndexingFilter

2015-04-14 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1985:
--
Attachment: NUTCH-1985.patch

 Adding a main() method to the MimeTypeIndexingFilter
 

 Key: NUTCH-1985
 URL: https://issues.apache.org/jira/browse/NUTCH-1985
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, metadata, plugin
Affects Versions: 1.10
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: features, patch, test
 Fix For: 1.10

 Attachments: NUTCH-1985.patch


 This make very easy the testing of different rules files to check the 
 expressions used to filter the content based on the MIME type detected. Until 
 now the only way to check this was to do test crawls and check the stored 
 data in Solr/Elasticsearch. 
 This allows calling the file using the {{bin/nutch plugin}} command, 
 something like:
 {{bin/nutch plugin mimetype-filter 
 org.apache.nutch.indexer.filter.MimeTypeIndexingFilter -h}}
 Two options are accepted, {{-h, --help}} for showing the help and {{-rules}} 
 for specifying a rules file to be used, this makes easy to play with 
 different rules file until you get the desired behavior. 
 After invoking the class, a valid MIME type must be entered for each line, 
 and the output will be the same MIME type with a {{+}} or {{-}} sign in the 
 beginning, indicating if the given MIME type is allowed or denied 
 respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1980) Jexl expressions for CrawlDbReader

2015-04-01 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392129#comment-14392129
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1980:
---

+1 this looks awesome, can't wait to test

 Jexl expressions for CrawlDbReader
 --

 Key: NUTCH-1980
 URL: https://issues.apache.org/jira/browse/NUTCH-1980
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.11


 We are already using Jexl expressions to filter records from HostDb dumps and 
 it is really helpful when your CrawlDb is stuffed with metadata generated by 
 parser filters, in our case mostly scores generated by classification plugins 
 that run on text or structure.
 In the case of the HostDb, it operates on hosts only, so it is easy to 
 collect a set of sites that host mostly a specific language, pornographic 
 content, or just host topics that your classifiers are trained for.
 By adding this magic to the CrawlDbReader, you can get lists of actual 
 records that contain the stuff you are looking for.
 Most work is already in the HostDb patch so it is easy to translate to 
 individual records. Patch tomorrow, probably...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1771) Solrindex fails if a segment is corrupted or incomplete

2015-04-01 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392138#comment-14392138
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1771:
---

+1 for this patch and for [~wastl-nagel], moving to a new class will allow to 
write a little segment checker if the crawl process is stopped due to a hard 
reboot, for instance, this tool could help locate the problematic segment 
before starting the crawling process again.

 Solrindex fails if a segment is corrupted or incomplete
 ---

 Key: NUTCH-1771
 URL: https://issues.apache.org/jira/browse/NUTCH-1771
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.8, 1.10
Reporter: Diaa
Priority: Minor
 Fix For: 1.11


 When using solrindex to index multiple segments via -dir segment,
 the indexing fails if one or more segments are corrupted/incomplete 
 (generated but not fetched for example)
 The failure is simply java.io exception.
 Deleting the segment fixes the issue.
 The expected behavior should be one of the following:
 * skipping the segment and proceeding with others (while logging)
 * stopping the indexing and logging the failed segment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1325) HostDB for Nutch

2015-03-27 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385131#comment-14385131
 ] 

Jorge Luis Betancourt Gonzalez edited comment on NUTCH-1325 at 3/28/15 5:01 AM:


Indeed this looks good! could be a great addition to the crawl reports :) 
Thanks [~markus17]


was (Author: jorgelbg):
Indeed this looks good! I would be a great addition to the crawl reports :) 
Thanks [~markus17]

 HostDB for Nutch
 

 Key: NUTCH-1325
 URL: https://issues.apache.org/jira/browse/NUTCH-1325
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
 Fix For: 1.11

 Attachments: NUTCH-1325-1.6-1.patch, 
 NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch, 
 NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch, 
 NUTCH-1325.trunk.v2.path, oi-hostdb.patch, oi-hostdb.patch, oi-hostdb.patch


 A HostDB for Nutch and associated tools to create and read a database 
 containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2015-03-27 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385131#comment-14385131
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1325:
---

Indeed this looks good! I would be a great addition to the crawl reports :) 
Thanks [~markus17]

 HostDB for Nutch
 

 Key: NUTCH-1325
 URL: https://issues.apache.org/jira/browse/NUTCH-1325
 Project: Nutch
  Issue Type: New Feature
Reporter: Markus Jelsma
 Fix For: 1.11

 Attachments: NUTCH-1325-1.6-1.patch, 
 NUTCH-1325-removed-from-1.8.patch, NUTCH-1325-trunk-v3.patch, 
 NUTCH-1325-trunk-v4.patch, NUTCH-1325-trunk-v5.patch, NUTCH-1325-v4-v5.patch, 
 NUTCH-1325.trunk.v2.path, oi-hostdb.patch, oi-hostdb.patch, oi-hostdb.patch


 A HostDB for Nutch and associated tools to create and read a database 
 containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez resolved NUTCH-1962.
---
Resolution: Fixed

 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1958) Remove scoring-opic from nutch-default.xml

2015-03-21 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372572#comment-14372572
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1958:
---

+1 

 Remove scoring-opic from nutch-default.xml
 --

 Key: NUTCH-1958
 URL: https://issues.apache.org/jira/browse/NUTCH-1958
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3, 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 2.4, 1.10


 I propose we remove scoring-opic from nutch-default. We all know it is flawed 
 for any kind of incremental crawl, which most of us do. It is also useless if 
 you want to perform a single crawl, if you must crawl all records of a 
 domain, using OPIC for prioritizing URLS makes no sense. It also confuses 
 users as we have seen in the past and recently [1].
 What do you think?
 [1]: 
 http://lucene.472066.n3.nabble.com/Nutch-documents-have-huge-scores-in-Solr-td4192064.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-12 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1962:
--
Attachment: NUTCH-1962.patch

 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-12 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359931#comment-14359931
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1962:
---

Committed r1666356.

 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10

 Attachments: NUTCH-1962.patch


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1962) Need to have mimetype-filter.txt file available by default

2015-03-11 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357883#comment-14357883
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1962:
---

+1 actually I have an example file prepared, and I'm ready to commit.

 Need to have mimetype-filter.txt file available by default
 --

 Key: NUTCH-1962
 URL: https://issues.apache.org/jira/browse/NUTCH-1962
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Lewis John McGibbney
 Fix For: 1.10


 By default the mimetype-filter.txt file quoted within nutch-default.xml is 
 not available. We need to provide this as it is a PITA to constantly have to 
 add it it new crawler configurations.
 https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L1616-L1625



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346265#comment-14346265
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1949:
---

+1 

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1933) nutch-selenium plugin

2015-02-26 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14339080#comment-14339080
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1933:
---

I see a {{target}} folder in 
/nutch/trunk/src/plugin/protocol-selenium/src/target/ is this suppose to be 
there? I see that is posible to use a phantomjs driver with selenium to provide 
headless browsing. Is there any way to configure the selenium driver used?

 nutch-selenium plugin
 -

 Key: NUTCH-1933
 URL: https://issues.apache.org/jira/browse/NUTCH-1933
 Project: Nutch
  Issue Type: New Feature
  Components: protocol
Reporter: Mo Omer
Assignee: Mohammad Al-Mohsin
 Fix For: 1.10

 Attachments: NUTCH-selenium-trunk.patch, 
 NUTCH-selenium-trunk.v2.1.patch, NUTCH-selenium-trunk.v2.patch


 I updated the plugin [nutch-selenium|https://github.com/momer/nutch-selenium] 
 plugin to run against trunk.
 I feel that there is a good bit of work to be done here however early testing 
 on my system are that it works. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1421#comment-1421
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---

Committed revision 1661600.

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
 NUTCH-1928v6.patch, mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1423#comment-1423
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---

Done! thanks for the encouragement! 

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
 NUTCH-1928v6.patch, mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-23 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez closed NUTCH-1928.
-
Resolution: Fixed

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
 NUTCH-1928v6.patch, mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: NUTCH-1928v6.patch

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
 NUTCH-1928v6.patch, mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-22 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14332500#comment-14332500
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---

[~lewismc] Committed successfully ;) also I've updated the JIRA with the last 
patch to keep it with sync with the committed version, I fixed a problem when 
running {{ant test}} for the whole project (which takes ~12 mins in my laptop).

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
 NUTCH-1928v6.patch, mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-13 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: NUTCH-1928v5.patch

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
 mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-13 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319744#comment-14319744
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---

[~lewismc] I've added the configuration key in the {{nutch-default.xml}} file 
and an example content for the {{mimetype-filter.txt}} file including the 
description of the format, by default the configuration is set to block all 
mimetypes except {{text/html}}, could be wise to include a couple more of 
mimetypes to allow or with this is sufficient? I haven't included the plugin to 
be activated by default, so I don't see any problem in allowing only 
{{text/html}} as an usage example, but comments are welcome.

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: NUTCH-1928v4.patch, NUTCH-1928v5.patch, 
 mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-05 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14307974#comment-14307974
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---

[~lewismc] I've updated the patch:

* Actually I was generating the patch from our internal SVN repository, and we 
keep our plugins separated from the rest of the Nutch distribution, so the 
previous patch couldn't be applied from the $NUTCH_HOME. I've generated the 
patch from the 1.9  $NUTCH_HOME (sources).
* As usual you were right, I was using the deprecated syntax of the JUnit 
tests, sorry for that. 

As usual really useful feedback!

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-05 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: mimetype-patch-v3.patch

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mimetype-patch-v3.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-05 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: (was: mimetype-patch-v2.patch)

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
Assignee: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-01 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: (was: mime-filter.patch)

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mimetype-patch-v2.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-01 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: mimetype-patch-v2.patch

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mimetype-patch-v2.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-02-01 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300892#comment-14300892
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1928:
---

[~lewismc] done both issues, sorry for using the wrong coding standard.

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mimetype-patch-v2.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-01-30 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-1928:
-

 Summary: Indexing filter of documents by the MIME type
 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
 Fix For: 1.10


This allows to filter the indexed documents by the MIME type property of the 
crawled content. Basically this will allow you to restrict the MIME type of the 
contents that will be stored in Solr/Elasticsearch index without the need to 
restrict the crawling/parsing process, so no need to use URLFilter plugin 
family. Also this address one particular corner case when certain URLs doesn't 
have any format to filter such as some RSS feeds 
(http://www.awesomesite.com/feed) and it will end in your index mixed with all 
your HTML content.

A configuration can file specified on the {{mimetype.filter.file}} property in 
the {{nutch-site.xml}}. This file use the same format as the 
{{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
{{allow all}} policy is used instead, so all your crawled documents will be 
indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1928) Indexing filter of documents by the MIME type

2015-01-30 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-1928:
--
Attachment: mime-filter.patch

Adding the first version of the code

 Indexing filter of documents by the MIME type
 -

 Key: NUTCH-1928
 URL: https://issues.apache.org/jira/browse/NUTCH-1928
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, plugin
Reporter: Jorge Luis Betancourt Gonzalez
  Labels: filter, mime-type, plugin
 Fix For: 1.10

 Attachments: mime-filter.patch


 This allows to filter the indexed documents by the MIME type property of the 
 crawled content. Basically this will allow you to restrict the MIME type of 
 the contents that will be stored in Solr/Elasticsearch index without the need 
 to restrict the crawling/parsing process, so no need to use URLFilter plugin 
 family. Also this address one particular corner case when certain URLs 
 doesn't have any format to filter such as some RSS feeds 
 (http://www.awesomesite.com/feed) and it will end in your index mixed with 
 all your HTML content.
 A configuration can file specified on the {{mimetype.filter.file}} property 
 in the {{nutch-site.xml}}. This file use the same format as the 
 {{urlfilter-suffix}} plugin. If no {{mimetype.filter.file}} key is found an 
 {{allow all}} policy is used instead, so all your crawled documents will be 
 indexed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)