[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: NUTCH-2234.patch > Upgrade to elasticsearch 2.1.1 > -- > > Key: NUTCH-2234 > URL: https://issues.apache.org/jira/browse/NUTCH-2234 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh > Attachments: NUTCH-2234.patch > > > Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: NUTCH-1687-2.patch Here it is: I update my initial patch for version 1.11. I crawl large number of hosts, so using circular linked list prevents creating new iterator every time a new hosts is added which happens quite frequent. > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, > NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-: Description: This problem happens at the the second time I crawl a page {code} bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all {code} seconde time (re-fetch) : {code} bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all {code} I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : {code} db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) {code} was: This problem happens at the the second time I crawl a page bin/nutch inject urls/ bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -force -all bin/nutch updatedb -all seconde time : bin/nutch generate -topN 1000 --> batchid changes for all existing pages bin/nutch fetch -all --> *** metadatas are delete for all pages already crawled ** bin/nutch parse -force -all bin/nutch updatedb -all I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 It happens only if the page has not changed To reproduce easily, please add to nutch-site.xml : db.fetch.interval.default 60 The default number of seconds between re-fetches of a page (1 minute) > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-: Summary: re-fetch deletes all metadata except _csh_ and _rs_ (was: fetch deletes all metadata except _csh_ and _rs_) > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15166447#comment-15166447 ] Thamme Gowda N commented on NUTCH-2144: --- Hi [~wastl-nagel], Were you able to test this plugin? I agree on both the points. The supplied plugin is just a start and we can have sophisticated plugins with this extension point. > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Update of "SimilarityScoringFilter" by SujenShah
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "SimilarityScoringFilter" page has been changed by SujenShah: https://wiki.apache.org/nutch/SimilarityScoringFilter?action=diff=3=4 1. Copy the gold-standard file into the conf directory and enter the name of this file in nutch-site.xml. {{{ - scoring.similarity.model.path + cosine.goldstandard.file goldstandard.txt }}}
[jira] [Assigned] (NUTCH-2222) fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-: --- Assignee: Lewis John McGibbney > fetch deletes all metadata except _csh_ and _rs_ > - > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > > This problem happens at the the second time I crawl a page > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > seconde time : > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2231) Jexl support in generator job
[ https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2231. -- Resolution: Fixed Committed to trunk in revision 1732177. This Jexl stuff is awesome! > Jexl support in generator job > - > > Key: NUTCH-2231 > URL: https://issues.apache.org/jira/browse/NUTCH-2231 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2231.patch, NUTCH-2231.patch > > > Generator should support Jexl expressions. This would make it much easier to > implement focussing crawlers that rely on information stored in the CrawlDB. > With the HostDB it is possible to restrict the generator to select only > interesting records but it is very cumbersome and involves > domainblacklist-urlfiltering. > With Jexl support, it is no hassle! > Crawl only english records: > {code} > bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')" > {code} > Crawl only HTML records: > {code} > bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == > 'text/html' || Content_Type == 'application/xhtml+xml')" > {code} > Keep in mind: > * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed > to underscores > * string literals must be in quotes, only surrounding qoute needs to be > escaped by backslash -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2231) Jexl support in generator job
[ https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2231: - Attachment: NUTCH-2231.patch Updated patch that transforms hyphens in field identifiers to underscores! > Jexl support in generator job > - > > Key: NUTCH-2231 > URL: https://issues.apache.org/jira/browse/NUTCH-2231 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2231.patch, NUTCH-2231.patch > > > Generator should support Jexl expressions. This would make it much easier to > implement focussing crawlers that rely on information stored in the CrawlDB. > With the HostDB it is possible to restrict the generator to select only > interesting records but it is very cumbersome and involves > domainblacklist-urlfiltering. > With Jexl support, it is no hassle! > Crawl only english records: > {code} > bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')" > {code} > Crawl only HTML records: > {code} > bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == > 'text/html' || Content_Type == 'application/xhtml+xml')" > {code} > Keep in mind: > * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed > to underscores > * string literals must be in quotes, only surrounding qoute needs to be > escaped by backslash -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: (was: NUTCH-1687-2.patch) > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Comment: was deleted (was: I update my initial patch for ver 1.11. I crawl large number of hosts, so using circular linked list prevents creating new iterator every time a new hosts is added.) > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2229: - Description: CrawlDatum allows Jexl expressions on its metadata fields nicely, but it lacks the opportunity to select on attributes like fetchTime and modifiedTime. This includes a rudimentary date parser only supporting the -MM-dd'T'HH:mm:ss'Z' format: Dump everything with a modifiedTime higher than 2016-03-20T00:00:00Z {code} bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 2016-03-20T00:00:00Z)" {code} Dump everything that is an HTML file {code} bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(Content_Type == 'text/html' || Content_Type == 'application/xhtml+xml')" {code} Keep in mind: * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed to underscores * string literals must be in quotes, only surrounding qoute needs to be escaped by backslash was: CrawlDatum allows Jexl expressions on its metadata fields nicely, but it lacks the opportunity to select on attributes like fetchTime and modifiedTime. This includes a rudimentary date parser only supporting the -MM-dd'T'HH:mm:ss'Z' format: {code} bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 2016-03-20T00:00:00Z)" {code} > Allow Jexl expressions on CrawlDatum's fixed attributes > --- > > Key: NUTCH-2229 > URL: https://issues.apache.org/jira/browse/NUTCH-2229 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2229.patch > > > CrawlDatum allows Jexl expressions on its metadata fields nicely, but it > lacks the opportunity to select on attributes like fetchTime and modifiedTime. > This includes a rudimentary date parser only supporting the > -MM-dd'T'HH:mm:ss'Z' format: > Dump everything with a modifiedTime higher than 2016-03-20T00:00:00Z > {code} > bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > > 2016-03-20T00:00:00Z)" > {code} > Dump everything that is an HTML file > {code} > bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(Content_Type == > 'text/html' || Content_Type == 'application/xhtml+xml')" > {code} > Keep in mind: > * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed > to underscores > * string literals must be in quotes, only surrounding qoute needs to be > escaped by backslash -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2231) Jexl support in generator job
[ https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2231: - Description: Generator should support Jexl expressions. This would make it much easier to implement focussing crawlers that rely on information stored in the CrawlDB. With the HostDB it is possible to restrict the generator to select only interesting records but it is very cumbersome and involves domainblacklist-urlfiltering. With Jexl support, it is no hassle! Crawl only english records: {code} bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')" {code} Crawl only HTML records: {code} bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 'text/html' || Content_Type == 'application/xhtml+xml')" {code} Keep in mind: * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed to underscores * string literals must be in quotes, only surrounding qoute needs to be escaped by backslash was: Generator should support Jexl expressions. This would make it much easier to implement focussing crawlers that rely on information stored in the CrawlDB. With the HostDB it is possible to restrict the generator to select only interesting records but it is very cumbersome and involves domainblacklist-urlfiltering. With Jexl support, it is no hassle! > Jexl support in generator job > - > > Key: NUTCH-2231 > URL: https://issues.apache.org/jira/browse/NUTCH-2231 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2231.patch > > > Generator should support Jexl expressions. This would make it much easier to > implement focussing crawlers that rely on information stored in the CrawlDB. > With the HostDB it is possible to restrict the generator to select only > interesting records but it is very cumbersome and involves > domainblacklist-urlfiltering. > With Jexl support, it is no hassle! > Crawl only english records: > {code} > bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')" > {code} > Crawl only HTML records: > {code} > bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == > 'text/html' || Content_Type == 'application/xhtml+xml')" > {code} > Keep in mind: > * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed > to underscores > * string literals must be in quotes, only surrounding qoute needs to be > escaped by backslash -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: (was: NUTCH-2234.patch) > Upgrade to elasticsearch 2.1.1 > -- > > Key: NUTCH-2234 > URL: https://issues.apache.org/jira/browse/NUTCH-2234 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh > > Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: NUTCH-2234.patch > Upgrade to elasticsearch 2.1.1 > -- > > Key: NUTCH-2234 > URL: https://issues.apache.org/jira/browse/NUTCH-2234 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh > Attachments: NUTCH-2234.patch > > > Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
Tien Nguyen Manh created NUTCH-2234: --- Summary: Upgrade to elasticsearch 2.1.1 Key: NUTCH-2234 URL: https://issues.apache.org/jira/browse/NUTCH-2234 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.11 Reporter: Tien Nguyen Manh Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2232) DeduplicationJob should decode URL's before length is compared
[ https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163138#comment-15163138 ] Hudson commented on NUTCH-2232: --- SUCCESS: Integrated in Nutch-trunk #3354 (See [https://builds.apache.org/job/Nutch-trunk/3354/]) NUTCH-2232 DeduplicationJob should decode URL's before length is compared (markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1732160]) * trunk/CHANGES.txt * trunk/src/java/org/apache/nutch/crawl/DeduplicationJob.java > DeduplicationJob should decode URL's before length is compared > -- > > Key: NUTCH-2232 > URL: https://issues.apache.org/jira/browse/NUTCH-2232 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Ron van der Vegt >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2232.patch, NUTCH-2232.patch > > > When certain documents have the same signature de deduplication script will > elect one as duplicate. The urls are stored url encoded in the crawldb. When > two urls are compared by url length, the urls are not first decoded. This > could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2231) Jexl support in generator job
[ https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2231: - Attachment: NUTCH-2231.patch Patch for trunk! It adds a JexlUtil where the expression parsing is done. CrawlDbReader has been updated accordingly. > Jexl support in generator job > - > > Key: NUTCH-2231 > URL: https://issues.apache.org/jira/browse/NUTCH-2231 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2231.patch > > > Generator should support Jexl expressions. This would make it much easier to > implement focussing crawlers that rely on information stored in the CrawlDB. > With the HostDB it is possible to restrict the generator to select only > interesting records but it is very cumbersome and involves > domainblacklist-urlfiltering. > With Jexl support, it is no hassle! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: NUTCH-1687-2.patch I update my initial patch for ver 1.11. I crawl large number of hosts, so using circular linked list prevents creating new iterator every time a new hosts is added. > Pick queue in Round Robin > - > > Key: NUTCH-1687 > URL: https://issues.apache.org/jira/browse/NUTCH-1687 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Reporter: Tien Nguyen Manh >Priority: Minor > Attachments: NUTCH-1687-2.patch, NUTCH-1687.patch, > NUTCH-1687.tejasp.v1.patch > > > Currently we chose queue to pick url from start of queues list, so queue at > the start of list have more change to be pick first, that can cause problem > of long tail queue, which only few queue available at the end which have many > urls. > public synchronized FetchItem getFetchItem() { > final Iterator> it = > queues.entrySet().iterator(); ==> always reset to find queue from > start > while (it.hasNext()) { > > I think it is better to pick queue in round robin, that can make reduce time > to find the available queue and make all queue was picked in round robin and > if we use TopN during generator there are no long tail queue at the end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (NUTCH-1179) Option to restrict generated records by metadata
[ https://issues.apache.org/jira/browse/NUTCH-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-1179. > Option to restrict generated records by metadata > > > Key: NUTCH-1179 > URL: https://issues.apache.org/jira/browse/NUTCH-1179 > Project: Nutch > Issue Type: New Feature > Components: generator >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > > The generator should be able to select entries based on a metadata key/value > pair. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (NUTCH-2215) Generator to restrict crawl to mime type
[ https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma closed NUTCH-2215. > Generator to restrict crawl to mime type > > > Key: NUTCH-2215 > URL: https://issues.apache.org/jira/browse/NUTCH-2215 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma > Attachments: NUTCH-2215.patch, NUTCH-2215.patch > > > Large crawls fail to restrict crawling non-html via suffix filter alone, due > to URL's hiding mime-types. This issue only passes records with a > Content-Type that match a regex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2215) Generator to restrict crawl to mime type
[ https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2215. -- Resolution: Duplicate > Generator to restrict crawl to mime type > > > Key: NUTCH-2215 > URL: https://issues.apache.org/jira/browse/NUTCH-2215 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma > Attachments: NUTCH-2215.patch, NUTCH-2215.patch > > > Large crawls fail to restrict crawling non-html via suffix filter alone, due > to URL's hiding mime-types. This issue only passes records with a > Content-Type that match a regex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2215) Generator to restrict crawl to mime type
[ https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2215: - Affects Version/s: (was: 1.11) > Generator to restrict crawl to mime type > > > Key: NUTCH-2215 > URL: https://issues.apache.org/jira/browse/NUTCH-2215 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma > Attachments: NUTCH-2215.patch, NUTCH-2215.patch > > > Large crawls fail to restrict crawling non-html via suffix filter alone, due > to URL's hiding mime-types. This issue only passes records with a > Content-Type that match a regex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2215) Generator to restrict crawl to mime type
[ https://issues.apache.org/jira/browse/NUTCH-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2215: - Fix Version/s: (was: 1.12) > Generator to restrict crawl to mime type > > > Key: NUTCH-2215 > URL: https://issues.apache.org/jira/browse/NUTCH-2215 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.11 >Reporter: Markus Jelsma > Attachments: NUTCH-2215.patch, NUTCH-2215.patch > > > Large crawls fail to restrict crawling non-html via suffix filter alone, due > to URL's hiding mime-types. This issue only passes records with a > Content-Type that match a regex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2232) DeduplicationJob should decode URL's before length is compared
[ https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2232. -- Resolution: Fixed Assignee: Markus Jelsma Committed to trunk in revision 1732160. Thanks Ron van der Vegt > DeduplicationJob should decode URL's before length is compared > -- > > Key: NUTCH-2232 > URL: https://issues.apache.org/jira/browse/NUTCH-2232 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Ron van der Vegt >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2232.patch, NUTCH-2232.patch > > > When certain documents have the same signature de deduplication script will > elect one as duplicate. The urls are stored url encoded in the crawldb. When > two urls are compared by url length, the urls are not first decoded. This > could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2232) DeduplicationJob should decode URL's before length is compared
[ https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2232: - Attachment: NUTCH-2232.patch Updated patch with only the following modification: * moved imports to their alphabetic location > DeduplicationJob should decode URL's before length is compared > -- > > Key: NUTCH-2232 > URL: https://issues.apache.org/jira/browse/NUTCH-2232 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Ron van der Vegt > Fix For: 1.12 > > Attachments: NUTCH-2232.patch, NUTCH-2232.patch > > > When certain documents have the same signature de deduplication script will > elect one as duplicate. The urls are stored url encoded in the crawldb. When > two urls are compared by url length, the urls are not first decoded. This > could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163025#comment-15163025 ] Hudson commented on NUTCH-2229: --- SUCCESS: Integrated in Nutch-trunk #3353 (See [https://builds.apache.org/job/Nutch-trunk/3353/]) NUTCH-2229 Allow Jexl expressions on CrawlDatum's fixed attributes (markus: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1732140]) * trunk/CHANGES.txt * trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java * trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java > Allow Jexl expressions on CrawlDatum's fixed attributes > --- > > Key: NUTCH-2229 > URL: https://issues.apache.org/jira/browse/NUTCH-2229 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2229.patch > > > CrawlDatum allows Jexl expressions on its metadata fields nicely, but it > lacks the opportunity to select on attributes like fetchTime and modifiedTime. > This includes a rudimentary date parser only supporting the > -MM-dd'T'HH:mm:ss'Z' format: > {code} > bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > > 2016-03-20T00:00:00Z)" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2233) Index-basic incorrect assignment of next fetch time when using Mongodb as storage backend
Pablo Torres created NUTCH-2233: --- Summary: Index-basic incorrect assignment of next fetch time when using Mongodb as storage backend Key: NUTCH-2233 URL: https://issues.apache.org/jira/browse/NUTCH-2233 Project: Nutch Issue Type: Bug Components: plugin Affects Versions: 2.3.1 Environment: Mongodb, Elasticsearch. Reporter: Pablo Torres This patch https://issues.apache.org/jira/browse/NUTCH-2045 does not work when using Mongodb as storage since date properties are stored as Longs in mongodb rather than objects, therefore the null date in this case is 0 which is accepted as valid by this patch. The system indexes 01/01/1970 as tstamp. I found this issue using Mongodb as storage and Elastic Search as index. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2232) DeduplicationJob should decode URL's before length is compared
[ https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2232: - Summary: DeduplicationJob should decode URL's before length is compared (was: DeduplicationJob: Url is not decoded before the url length is compared.) > DeduplicationJob should decode URL's before length is compared > -- > > Key: NUTCH-2232 > URL: https://issues.apache.org/jira/browse/NUTCH-2232 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Ron van der Vegt > Fix For: 1.12 > > Attachments: NUTCH-2232.patch > > > When certain documents have the same signature de deduplication script will > elect one as duplicate. The urls are stored url encoded in the crawldb. When > two urls are compared by url length, the urls are not first decoded. This > could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2232) DeduplicationJob: Url is not decoded before the url length is compared.
[ https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163003#comment-15163003 ] Markus Jelsma commented on NUTCH-2232: -- Yes, there is clearly a difference in length between {{https://zh.wikipedia.org/wiki/馬伯利訴麥迪遜案}} and {{https://zh.wikipedia.org/wiki/%E9%A9%AC%E4%BC%AF%E5%88%A9%E8%AF%89%E9%BA%A6%E8%BF%AA%E9%80%8A%E6%A1%88}}. This could in some cases result in weird unexpected behaviour. > DeduplicationJob: Url is not decoded before the url length is compared. > --- > > Key: NUTCH-2232 > URL: https://issues.apache.org/jira/browse/NUTCH-2232 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Ron van der Vegt > Fix For: 1.12 > > Attachments: NUTCH-2232.patch > > > When certain documents have the same signature de deduplication script will > elect one as duplicate. The urls are stored url encoded in the crawldb. When > two urls are compared by url length, the urls are not first decoded. This > could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2232) DeduplicationJob: Url is not decoded before the url length is compared.
[ https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2232: - Affects Version/s: 1.11 > DeduplicationJob: Url is not decoded before the url length is compared. > --- > > Key: NUTCH-2232 > URL: https://issues.apache.org/jira/browse/NUTCH-2232 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.11 >Reporter: Ron van der Vegt > Fix For: 1.12 > > Attachments: NUTCH-2232.patch > > > When certain documents have the same signature de deduplication script will > elect one as duplicate. The urls are stored url encoded in the crawldb. When > two urls are compared by url length, the urls are not first decoded. This > could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2232) DeduplicationJob: Url is not decoded before the url length is compared.
[ https://issues.apache.org/jira/browse/NUTCH-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron van der Vegt updated NUTCH-2232: Attachment: NUTCH-2232.patch > DeduplicationJob: Url is not decoded before the url length is compared. > --- > > Key: NUTCH-2232 > URL: https://issues.apache.org/jira/browse/NUTCH-2232 > Project: Nutch > Issue Type: Bug > Components: crawldb >Reporter: Ron van der Vegt > Attachments: NUTCH-2232.patch > > > When certain documents have the same signature de deduplication script will > elect one as duplicate. The urls are stored url encoded in the crawldb. When > two urls are compared by url length, the urls are not first decoded. This > could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2232) DeduplicationJob: Url is not decoded before the url length is compared.
Ron van der Vegt created NUTCH-2232: --- Summary: DeduplicationJob: Url is not decoded before the url length is compared. Key: NUTCH-2232 URL: https://issues.apache.org/jira/browse/NUTCH-2232 Project: Nutch Issue Type: Bug Components: crawldb Reporter: Ron van der Vegt When certain documents have the same signature de deduplication script will elect one as duplicate. The urls are stored url encoded in the crawldb. When two urls are compared by url length, the urls are not first decoded. This could lead to misleading url length. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-2229. -- Resolution: Fixed Committed to trunk in revision 1732140. > Allow Jexl expressions on CrawlDatum's fixed attributes > --- > > Key: NUTCH-2229 > URL: https://issues.apache.org/jira/browse/NUTCH-2229 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2229.patch > > > CrawlDatum allows Jexl expressions on its metadata fields nicely, but it > lacks the opportunity to select on attributes like fetchTime and modifiedTime. > This includes a rudimentary date parser only supporting the > -MM-dd'T'HH:mm:ss'Z' format: > {code} > bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > > 2016-03-20T00:00:00Z)" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162945#comment-15162945 ] Markus Jelsma commented on NUTCH-2229: -- Ah, this works very nicely! I'll commit shortly! > Allow Jexl expressions on CrawlDatum's fixed attributes > --- > > Key: NUTCH-2229 > URL: https://issues.apache.org/jira/browse/NUTCH-2229 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2229.patch > > > CrawlDatum allows Jexl expressions on its metadata fields nicely, but it > lacks the opportunity to select on attributes like fetchTime and modifiedTime. > This includes a rudimentary date parser only supporting the > -MM-dd'T'HH:mm:ss'Z' format: > {code} > bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > > 2016-03-20T00:00:00Z)" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2231) Jexl support in generator job
Markus Jelsma created NUTCH-2231: Summary: Jexl support in generator job Key: NUTCH-2231 URL: https://issues.apache.org/jira/browse/NUTCH-2231 Project: Nutch Issue Type: Improvement Affects Versions: 1.11 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.12 Generator should support Jexl expressions. This would make it much easier to implement focussing crawlers that rely on information stored in the CrawlDB. With the HostDB it is possible to restrict the generator to select only interesting records but it is very cumbersome and involves domainblacklist-urlfiltering. With Jexl support, it is no hassle! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2229: - Attachment: NUTCH-2229.patch Patch for trunk! > Allow Jexl expressions on CrawlDatum's fixed attributes > --- > > Key: NUTCH-2229 > URL: https://issues.apache.org/jira/browse/NUTCH-2229 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2229.patch > > > CrawlDatum allows Jexl expressions on its metadata fields nicely, but it > lacks the opportunity to select on attributes like fetchTime and modifiedTime. > This includes a rudimentary date parser only supporting the > -MM-dd'T'HH:mm:ss'Z' format: > {code} > bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > > 2016-03-20T00:00:00Z)" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2229) Allow Jexl expressions on CrawlDatum's fixed attributes
[ https://issues.apache.org/jira/browse/NUTCH-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2229: - Patch Info: Patch Available Description: CrawlDatum allows Jexl expressions on its metadata fields nicely, but it lacks the opportunity to select on attributes like fetchTime and modifiedTime. This includes a rudimentary date parser only supporting the -MM-dd'T'HH:mm:ss'Z' format: {code} bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > 2016-03-20T00:00:00Z)" {code} was:CrawlDatum allows Jexl expressions on its metadata fields nicely, but it lacks the opportunity to select on attributes like fetchTime and modifiedTime. > Allow Jexl expressions on CrawlDatum's fixed attributes > --- > > Key: NUTCH-2229 > URL: https://issues.apache.org/jira/browse/NUTCH-2229 > Project: Nutch > Issue Type: Improvement > Components: crawldb >Affects Versions: 1.11 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.12 > > > CrawlDatum allows Jexl expressions on its metadata fields nicely, but it > lacks the opportunity to select on attributes like fetchTime and modifiedTime. > This includes a rudimentary date parser only supporting the > -MM-dd'T'HH:mm:ss'Z' format: > {code} > bin/nutch readdb crawl/crawldb/ -dump out -format csv -expr "(modifiedTime > > 2016-03-20T00:00:00Z)" > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
I have one small question that always intrigue me
Hi everyone, I am really need your help, please read below If we have to run solr in cloud mode, we are going to use zookeeper, now any zookeeper client can connect to zookeeper server, Zookeeper has facility to protect znode however any one can see znode acl however password could be encrypted. Decrypting password or guessing password is not a big deal. As we know password is SHA encrypted also there is no limitation of number of try to authorize with ACL. So my point is how to safegard zookeeper. I can guess few things a. Don't reveal ip of your zookeeper ( security with obscurity ) b. ip table which is also not a very good idea c. what else ?? My guess was if some how we can protect zookeeper server itself by asking client to authorize them self before it can make connection to ensemble even at root ( /) znode. Please please at least comment on this , I really need your help.