[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: NUTCH-2234.patch

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: NUTCH-1687-2.patch

Here it is:
I update my initial patch for version 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added which happens quite frequent.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-:

Description: 
This problem happens at the the second time I crawl a page
{code}
bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all
{code}
seconde time (re-fetch) : 
{code}
bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all
{code}
I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2

It happens only if the page has not changed

To reproduce easily, please add to nutch-site.xml :
{code}

  db.fetch.interval.default
  60
  The default number of seconds between re-fetches of a page (1 
minute)

{code}


  was:
This problem happens at the the second time I crawl a page

bin/nutch inject urls/
bin/nutch generate -topN 1000
bin/nutch fetch  -all
bin/nutch parse -force   -all
bin/nutch updatedb  -all

seconde time : 

bin/nutch generate -topN 1000 --> batchid changes for all existing pages
bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
crawled  **
bin/nutch parse -force   -all
bin/nutch updatedb  -all

I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2

It happens only if the page has not changed

To reproduce easily, please add to nutch-site.xml :

db.fetch.interval.default
60
The default number of seconds between re-fetches of a page (1 
minute)




> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-:

Summary: re-fetch deletes all  metadata except _csh_ and _rs_  (was: fetch 
deletes all  metadata except _csh_ and _rs_)

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> 
> db.fetch.interval.default
> 60
> The default number of seconds between re-fetches of a page (1 
> minute)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-24 Thread Thamme Gowda N (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166447#comment-15166447
 ] 

Thamme Gowda N commented on NUTCH-2144:
---

Hi [~wastl-nagel],
Were you able to test this plugin?

I agree on both the points.
The supplied plugin is just a start and we can have sophisticated plugins with 
this extension point. 

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Update of "SimilarityScoringFilter" by SujenShah

2016-02-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "SimilarityScoringFilter" page has been changed by SujenShah:
https://wiki.apache.org/nutch/SimilarityScoringFilter?action=diff=3=4

  1. Copy the gold-standard file into the conf directory and enter the name of 
this file in nutch-site.xml. 
  {{{
  
- scoring.similarity.model.path
+ cosine.goldstandard.file
  goldstandard.txt
  
  }}} 


[jira] [Assigned] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_

2016-02-24 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-:
---

Assignee: Lewis John McGibbney

> fetch deletes all  metadata except _csh_ and _rs_
> -
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
>
> This problem happens at the the second time I crawl a page
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> seconde time : 
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> 
> db.fetch.interval.default
> 60
> The default number of seconds between re-fetches of a page (1 
> minute)
> 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2231) Jexl support in generator job

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2231.
--
Resolution: Fixed

Committed to trunk in revision 1732177. This Jexl stuff is awesome!


> Jexl support in generator job
> -
>
> Key: NUTCH-2231
> URL: https://issues.apache.org/jira/browse/NUTCH-2231
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2231.patch, NUTCH-2231.patch
>
>
> Generator should support Jexl expressions. This would make it much easier to 
> implement focussing crawlers that rely on information stored in the CrawlDB. 
> With the HostDB it is possible to restrict the generator to select only 
> interesting records but it is very cumbersome and involves 
> domainblacklist-urlfiltering.
> With Jexl support, it is no hassle!
> Crawl only english records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
> {code}
> Crawl only HTML records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 
> 'text/html' || Content_Type == 'application/xhtml+xml')"
> {code}
> Keep in mind:
> * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
> to underscores
> * string literals must be in quotes, only surrounding qoute needs to be 
> escaped by backslash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2231) Jexl support in generator job

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2231:
-
Attachment: NUTCH-2231.patch

Updated patch that transforms hyphens in field identifiers to underscores!

> Jexl support in generator job
> -
>
> Key: NUTCH-2231
> URL: https://issues.apache.org/jira/browse/NUTCH-2231
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2231.patch, NUTCH-2231.patch
>
>
> Generator should support Jexl expressions. This would make it much easier to 
> implement focussing crawlers that rely on information stored in the CrawlDB. 
> With the HostDB it is possible to restrict the generator to select only 
> interesting records but it is very cumbersome and involves 
> domainblacklist-urlfiltering.
> With Jexl support, it is no hassle!
> Crawl only english records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
> {code}
> Crawl only HTML records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 
> 'text/html' || Content_Type == 'application/xhtml+xml')"
> {code}
> Keep in mind:
> * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
> to underscores
> * string literals must be in quotes, only surrounding qoute needs to be 
> escaped by backslash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: (was: NUTCH-1687-2.patch)

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Comment: was deleted

(was: I update my initial patch for ver 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added.)

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2229:
-
Description: 
CrawlDatum allows Jexl expressions on its metadata fields nicely, but it lacks 
the opportunity to select on attributes like fetchTime and modifiedTime.

This includes a rudimentary date parser only supporting the 
-MM-dd'T'HH:mm:ss'Z' format:

Dump everything with a modifiedTime higher than 2016-03-20T00:00:00Z
{code}
bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
2016-03-20T00:00:00Z)"
{code}

Dump everything that is an HTML file
{code}
bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(Content_Type == 
'text/html' || Content_Type == 'application/xhtml+xml')"
{code}

Keep in mind:
* Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
to underscores
* string literals must be in quotes, only surrounding qoute needs to be escaped 
by backslash


  was:
CrawlDatum allows Jexl expressions on its metadata fields nicely, but it lacks 
the opportunity to select on attributes like fetchTime and modifiedTime.

This includes a rudimentary date parser only supporting the 
-MM-dd'T'HH:mm:ss'Z' format:

{code}
bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
2016-03-20T00:00:00Z)"
{code}


> Allow Jexl expressions on CrawlDatum's fixed attributes
> ---
>
> Key: NUTCH-2229
> URL: https://issues.apache.org/jira/browse/NUTCH-2229
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2229.patch
>
>
> CrawlDatum allows Jexl expressions on its metadata fields nicely, but it 
> lacks the opportunity to select on attributes like fetchTime and modifiedTime.
> This includes a rudimentary date parser only supporting the 
> -MM-dd'T'HH:mm:ss'Z' format:
> Dump everything with a modifiedTime higher than 2016-03-20T00:00:00Z
> {code}
> bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
> 2016-03-20T00:00:00Z)"
> {code}
> Dump everything that is an HTML file
> {code}
> bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(Content_Type == 
> 'text/html' || Content_Type == 'application/xhtml+xml')"
> {code}
> Keep in mind:
> * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
> to underscores
> * string literals must be in quotes, only surrounding qoute needs to be 
> escaped by backslash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2231) Jexl support in generator job

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2231:
-
Description: 
Generator should support Jexl expressions. This would make it much easier to 
implement focussing crawlers that rely on information stored in the CrawlDB. 
With the HostDB it is possible to restrict the generator to select only 
interesting records but it is very cumbersome and involves 
domainblacklist-urlfiltering.

With Jexl support, it is no hassle!

Crawl only english records:
{code}
bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
{code}

Crawl only HTML records:
{code}
bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 
'text/html' || Content_Type == 'application/xhtml+xml')"
{code}

Keep in mind:
* Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
to underscores
* string literals must be in quotes, only surrounding qoute needs to be escaped 
by backslash


  was:
Generator should support Jexl expressions. This would make it much easier to 
implement focussing crawlers that rely on information stored in the CrawlDB. 
With the HostDB it is possible to restrict the generator to select only 
interesting records but it is very cumbersome and involves 
domainblacklist-urlfiltering.

With Jexl support, it is no hassle!


> Jexl support in generator job
> -
>
> Key: NUTCH-2231
> URL: https://issues.apache.org/jira/browse/NUTCH-2231
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2231.patch
>
>
> Generator should support Jexl expressions. This would make it much easier to 
> implement focussing crawlers that rely on information stored in the CrawlDB. 
> With the HostDB it is possible to restrict the generator to select only 
> interesting records but it is very cumbersome and involves 
> domainblacklist-urlfiltering.
> With Jexl support, it is no hassle!
> Crawl only english records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
> {code}
> Crawl only HTML records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 
> 'text/html' || Content_Type == 'application/xhtml+xml')"
> {code}
> Keep in mind:
> * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
> to underscores
> * string literals must be in quotes, only surrounding qoute needs to be 
> escaped by backslash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: (was: NUTCH-2234.patch)

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2234:

Attachment: NUTCH-2234.patch

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2234:
---

 Summary: Upgrade to elasticsearch 2.1.1
 Key: NUTCH-2234
 URL: https://issues.apache.org/jira/browse/NUTCH-2234
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.11
Reporter: Tien Nguyen Manh


Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2232) DeduplicationJob should decode URL's before length is compared

2016-02-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163138#comment-15163138
 ] 

Hudson commented on NUTCH-2232:
---

SUCCESS: Integrated in Nutch-trunk #3354 (See 
[https://builds.apache.org/job/Nutch-trunk/3354/])
NUTCH-2232 DeduplicationJob should decode URL's before length is compared 
(markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1732160])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/crawl/DeduplicationJob.java


> DeduplicationJob should decode URL's before length is compared
> --
>
> Key: NUTCH-2232
> URL: https://issues.apache.org/jira/browse/NUTCH-2232
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Ron van der Vegt
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2232.patch, NUTCH-2232.patch
>
>
> When certain documents have the same signature de deduplication script will 
> elect one as duplicate. The urls are stored url encoded in the crawldb. When 
> two urls are compared by url length, the urls are not first decoded. This 
> could lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2231) Jexl support in generator job

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2231:
-
Attachment: NUTCH-2231.patch

Patch for trunk! It adds a JexlUtil where the expression parsing is done. 
CrawlDbReader has been updated accordingly.

> Jexl support in generator job
> -
>
> Key: NUTCH-2231
> URL: https://issues.apache.org/jira/browse/NUTCH-2231
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2231.patch
>
>
> Generator should support Jexl expressions. This would make it much easier to 
> implement focussing crawlers that rely on information stored in the CrawlDB. 
> With the HostDB it is possible to restrict the generator to select only 
> interesting records but it is very cumbersome and involves 
> domainblacklist-urlfiltering.
> With Jexl support, it is no hassle!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1687:

Attachment: NUTCH-1687-2.patch

I update my initial patch for ver 1.11.
I crawl large number of hosts, so using circular linked list prevents creating 
new iterator every time a new hosts is added.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, 
> NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (NUTCH-1179) Option to restrict generated records by metadata

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-1179.


> Option to restrict generated records by metadata
> 
>
> Key: NUTCH-1179
> URL: https://issues.apache.org/jira/browse/NUTCH-1179
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>
> The generator should be able to select entries based on a metadata key/value 
> pair.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-2215.


> Generator to restrict crawl to mime type
> 
>
> Key: NUTCH-2215
> URL: https://issues.apache.org/jira/browse/NUTCH-2215
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
> Attachments: NUTCH-2215.patch, NUTCH-2215.patch
>
>
> Large crawls fail to restrict crawling non-html via suffix filter alone, due 
> to URL's hiding mime-types. This issue only passes records with a 
> Content-Type that match a regex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2215.
--
Resolution: Duplicate

> Generator to restrict crawl to mime type
> 
>
> Key: NUTCH-2215
> URL: https://issues.apache.org/jira/browse/NUTCH-2215
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
> Attachments: NUTCH-2215.patch, NUTCH-2215.patch
>
>
> Large crawls fail to restrict crawling non-html via suffix filter alone, due 
> to URL's hiding mime-types. This issue only passes records with a 
> Content-Type that match a regex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2215:
-
Affects Version/s: (was: 1.11)

> Generator to restrict crawl to mime type
> 
>
> Key: NUTCH-2215
> URL: https://issues.apache.org/jira/browse/NUTCH-2215
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
> Attachments: NUTCH-2215.patch, NUTCH-2215.patch
>
>
> Large crawls fail to restrict crawling non-html via suffix filter alone, due 
> to URL's hiding mime-types. This issue only passes records with a 
> Content-Type that match a regex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2215) Generator to restrict crawl to mime type

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2215:
-
Fix Version/s: (was: 1.12)

> Generator to restrict crawl to mime type
> 
>
> Key: NUTCH-2215
> URL: https://issues.apache.org/jira/browse/NUTCH-2215
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
> Attachments: NUTCH-2215.patch, NUTCH-2215.patch
>
>
> Large crawls fail to restrict crawling non-html via suffix filter alone, due 
> to URL's hiding mime-types. This issue only passes records with a 
> Content-Type that match a regex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2232) DeduplicationJob should decode URL's before length is compared

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2232.
--
Resolution: Fixed
  Assignee: Markus Jelsma

Committed to trunk in revision 1732160. Thanks Ron van der Vegt


> DeduplicationJob should decode URL's before length is compared
> --
>
> Key: NUTCH-2232
> URL: https://issues.apache.org/jira/browse/NUTCH-2232
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Ron van der Vegt
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2232.patch, NUTCH-2232.patch
>
>
> When certain documents have the same signature de deduplication script will 
> elect one as duplicate. The urls are stored url encoded in the crawldb. When 
> two urls are compared by url length, the urls are not first decoded. This 
> could lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2232) DeduplicationJob should decode URL's before length is compared

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2232:
-
Attachment: NUTCH-2232.patch

Updated patch with only the following modification:
* moved imports to their alphabetic location

> DeduplicationJob should decode URL's before length is compared
> --
>
> Key: NUTCH-2232
> URL: https://issues.apache.org/jira/browse/NUTCH-2232
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Ron van der Vegt
> Fix For: 1.12
>
> Attachments: NUTCH-2232.patch, NUTCH-2232.patch
>
>
> When certain documents have the same signature de deduplication script will 
> elect one as duplicate. The urls are stored url encoded in the crawldb. When 
> two urls are compared by url length, the urls are not first decoded. This 
> could lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2016-02-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163025#comment-15163025
 ] 

Hudson commented on NUTCH-2229:
---

SUCCESS: Integrated in Nutch-trunk #3353 (See 
[https://builds.apache.org/job/Nutch-trunk/3353/])
NUTCH-2229 Allow Jexl expressions on CrawlDatum's fixed attributes (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1732140])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java
* trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java


> Allow Jexl expressions on CrawlDatum's fixed attributes
> ---
>
> Key: NUTCH-2229
> URL: https://issues.apache.org/jira/browse/NUTCH-2229
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2229.patch
>
>
> CrawlDatum allows Jexl expressions on its metadata fields nicely, but it 
> lacks the opportunity to select on attributes like fetchTime and modifiedTime.
> This includes a rudimentary date parser only supporting the 
> -MM-dd'T'HH:mm:ss'Z' format:
> {code}
> bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
> 2016-03-20T00:00:00Z)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2233) Index-basic incorrect assignment of next fetch time when using Mongodb as storage backend

2016-02-24 Thread Pablo Torres (JIRA)
Pablo Torres created NUTCH-2233:
---

 Summary: Index-basic incorrect assignment of next fetch time when 
using Mongodb as storage backend
 Key: NUTCH-2233
 URL: https://issues.apache.org/jira/browse/NUTCH-2233
 Project: Nutch
  Issue Type: Bug
  Components: plugin
Affects Versions: 2.3.1
 Environment: Mongodb, Elasticsearch.
Reporter: Pablo Torres


This patch https://issues.apache.org/jira/browse/NUTCH-2045 does not work when 
using Mongodb as storage since date properties are stored as Longs in mongodb 
rather than objects, therefore the null date in this case is 0 which is 
accepted as valid by this patch. The system indexes 01/01/1970 as tstamp.

I found this issue using Mongodb as storage and Elastic Search as index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2232) DeduplicationJob should decode URL's before length is compared

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2232:
-
Summary: DeduplicationJob should decode URL's before length is compared  
(was: DeduplicationJob: Url is not decoded before the url length is compared.)

> DeduplicationJob should decode URL's before length is compared
> --
>
> Key: NUTCH-2232
> URL: https://issues.apache.org/jira/browse/NUTCH-2232
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Ron van der Vegt
> Fix For: 1.12
>
> Attachments: NUTCH-2232.patch
>
>
> When certain documents have the same signature de deduplication script will 
> elect one as duplicate. The urls are stored url encoded in the crawldb. When 
> two urls are compared by url length, the urls are not first decoded. This 
> could lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2232) DeduplicationJob: Url is not decoded before the url length is compared.

2016-02-24 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163003#comment-15163003
 ] 

Markus Jelsma commented on NUTCH-2232:
--

Yes, there is clearly a difference in length between 
{{https://zh.wikipedia.org/wiki/馬伯利訴麥迪遜案}} and 
{{https://zh.wikipedia.org/wiki/%E9%A9%AC%E4%BC%AF%E5%88%A9%E8%AF%89%E9%BA%A6%E8%BF%AA%E9%80%8A%E6%A1%88}}.
 This could in some cases result in weird unexpected behaviour.

> DeduplicationJob: Url is not decoded before the url length is compared.
> ---
>
> Key: NUTCH-2232
> URL: https://issues.apache.org/jira/browse/NUTCH-2232
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Ron van der Vegt
> Fix For: 1.12
>
> Attachments: NUTCH-2232.patch
>
>
> When certain documents have the same signature de deduplication script will 
> elect one as duplicate. The urls are stored url encoded in the crawldb. When 
> two urls are compared by url length, the urls are not first decoded. This 
> could lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2232) DeduplicationJob: Url is not decoded before the url length is compared.

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2232:
-
Affects Version/s: 1.11

> DeduplicationJob: Url is not decoded before the url length is compared.
> ---
>
> Key: NUTCH-2232
> URL: https://issues.apache.org/jira/browse/NUTCH-2232
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Ron van der Vegt
> Fix For: 1.12
>
> Attachments: NUTCH-2232.patch
>
>
> When certain documents have the same signature de deduplication script will 
> elect one as duplicate. The urls are stored url encoded in the crawldb. When 
> two urls are compared by url length, the urls are not first decoded. This 
> could lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2232) DeduplicationJob: Url is not decoded before the url length is compared.

2016-02-24 Thread Ron van der Vegt (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron van der Vegt updated NUTCH-2232:

Attachment: NUTCH-2232.patch

> DeduplicationJob: Url is not decoded before the url length is compared.
> ---
>
> Key: NUTCH-2232
> URL: https://issues.apache.org/jira/browse/NUTCH-2232
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Reporter: Ron van der Vegt
> Attachments: NUTCH-2232.patch
>
>
> When certain documents have the same signature de deduplication script will 
> elect one as duplicate. The urls are stored url encoded in the crawldb. When 
> two urls are compared by url length, the urls are not first decoded. This 
> could lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2232) DeduplicationJob: Url is not decoded before the url length is compared.

2016-02-24 Thread Ron van der Vegt (JIRA)
Ron van der Vegt created NUTCH-2232:
---

 Summary: DeduplicationJob: Url is not decoded before the url 
length is compared.
 Key: NUTCH-2232
 URL: https://issues.apache.org/jira/browse/NUTCH-2232
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Reporter: Ron van der Vegt


When certain documents have the same signature de deduplication script will 
elect one as duplicate. The urls are stored url encoded in the crawldb. When 
two urls are compared by url length, the urls are not first decoded. This could 
lead to misleading url length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2229.
--
Resolution: Fixed

Committed to trunk in revision 1732140.


> Allow Jexl expressions on CrawlDatum's fixed attributes
> ---
>
> Key: NUTCH-2229
> URL: https://issues.apache.org/jira/browse/NUTCH-2229
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2229.patch
>
>
> CrawlDatum allows Jexl expressions on its metadata fields nicely, but it 
> lacks the opportunity to select on attributes like fetchTime and modifiedTime.
> This includes a rudimentary date parser only supporting the 
> -MM-dd'T'HH:mm:ss'Z' format:
> {code}
> bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
> 2016-03-20T00:00:00Z)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2016-02-24 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162945#comment-15162945
 ] 

Markus Jelsma commented on NUTCH-2229:
--

Ah, this works very nicely! I'll commit shortly!

> Allow Jexl expressions on CrawlDatum's fixed attributes
> ---
>
> Key: NUTCH-2229
> URL: https://issues.apache.org/jira/browse/NUTCH-2229
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2229.patch
>
>
> CrawlDatum allows Jexl expressions on its metadata fields nicely, but it 
> lacks the opportunity to select on attributes like fetchTime and modifiedTime.
> This includes a rudimentary date parser only supporting the 
> -MM-dd'T'HH:mm:ss'Z' format:
> {code}
> bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
> 2016-03-20T00:00:00Z)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2231) Jexl support in generator job

2016-02-24 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2231:


 Summary: Jexl support in generator job
 Key: NUTCH-2231
 URL: https://issues.apache.org/jira/browse/NUTCH-2231
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.11
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.12


Generator should support Jexl expressions. This would make it much easier to 
implement focussing crawlers that rely on information stored in the CrawlDB. 
With the HostDB it is possible to restrict the generator to select only 
interesting records but it is very cumbersome and involves 
domainblacklist-urlfiltering.

With Jexl support, it is no hassle!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2229:
-
Attachment: NUTCH-2229.patch

Patch for trunk!

> Allow Jexl expressions on CrawlDatum's fixed attributes
> ---
>
> Key: NUTCH-2229
> URL: https://issues.apache.org/jira/browse/NUTCH-2229
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2229.patch
>
>
> CrawlDatum allows Jexl expressions on its metadata fields nicely, but it 
> lacks the opportunity to select on attributes like fetchTime and modifiedTime.
> This includes a rudimentary date parser only supporting the 
> -MM-dd'T'HH:mm:ss'Z' format:
> {code}
> bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
> 2016-03-20T00:00:00Z)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes

2016-02-24 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2229:
-
 Patch Info: Patch Available
Description: 
CrawlDatum allows Jexl expressions on its metadata fields nicely, but it lacks 
the opportunity to select on attributes like fetchTime and modifiedTime.

This includes a rudimentary date parser only supporting the 
-MM-dd'T'HH:mm:ss'Z' format:

{code}
bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
2016-03-20T00:00:00Z)"
{code}

  was:CrawlDatum allows Jexl expressions on its metadata fields nicely, but it 
lacks the opportunity to select on attributes like fetchTime and modifiedTime.


> Allow Jexl expressions on CrawlDatum's fixed attributes
> ---
>
> Key: NUTCH-2229
> URL: https://issues.apache.org/jira/browse/NUTCH-2229
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
>
> CrawlDatum allows Jexl expressions on its metadata fields nicely, but it 
> lacks the opportunity to select on attributes like fetchTime and modifiedTime.
> This includes a rudimentary date parser only supporting the 
> -MM-dd'T'HH:mm:ss'Z' format:
> {code}
> bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 
> 2016-03-20T00:00:00Z)"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


I have one small question that always intrigue me

2016-02-24 Thread Zara Parst
Hi everyone,

I am really need your help, please read below


If we have to run solr in cloud mode, we are going to use zookeeper,   now
any zookeeper client can connect to zookeeper server, Zookeeper has
facility to protect znode however any one can see znode acl however
password could be encrypted.  Decrypting password or guessing password is
not a big deal. As we know password is SHA encrypted also there is no
limitation of number of try to authorize with ACL. So my point is how to
safegard zookeeper.

I can guess few things

a. Don't reveal ip of your zookeeper ( security with obscurity )
b. ip table which is also not a very good idea
c. what else ??

My guess was if some how we can protect zookeeper server itself by asking
client to authorize them self before it can make connection to ensemble
even at root ( /) znode.

Please please at least comment on this , I really need your help.