Re: use hbase as distributed crawl's scheduler
Please take a look at our Apache incubator proposal, as I think that may answer your questions: https://wiki.apache.org/incubator/PhoenixProposal On Fri, Jan 3, 2014 at 11:47 PM, Li Li fancye...@gmail.com wrote: so what's the relationship of Phoenix and HBase? something like hadoop and hive? On Sat, Jan 4, 2014 at 3:43 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Phoenix isn't an experimental project. We're on our 2.2 release, and many companies (including the company for which I'm employed, Salesforce.com) use it in production today. Thanks, James On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote: hi James, phoenix seems great but it's now only a experimental project. I want to use only hbase. could you tell me the difference of Phoenix and hbase? If I use hbase only, how should I design the schema and some extra things for my goal? thank you On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com wrote: On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. Not sure on this, but seemed to me that setting the status field is forcing the urls that have been processed to be at the end of the sort order. 2. Wouldn't salting ruin the sort order required? Priority, date added? No, as Phoenix maintains returning rows in row key order even when they're salted. We do parallel scans for each bucket and do a merge sort on the client, so the cost is pretty low for this (we also provide a way of turning this off if your use case doesn't need it). Two years, JM? Now you're really going to have to start using Phoenix :-) On Friday, January 3, 2014, James Taylor wrote: Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me
Re: use hbase as distributed crawl's scheduler
Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ , they use redis to store urls. I think hbase is originated from bigtable and google use bigtable to store webpage, so for huge number of urls, I prefer distributed system like hbase.
Re: use hbase as distributed crawl's scheduler
Interesting. This is exactly what I'm doing ;) I'm using 3 tables to achieve this. One table with the URL already crawled (80 millions), one URL with the URL to crawle (2 billions) and one URL with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs doing many different things. I have many other tables to help with the work on the URLs. I'm salting the keys using the URL hash so I can find them back very quickly. There can be some collisions so I store also the URL itself on the key. So very small scans returning 1 or something 2 rows allow me to quickly find a row knowing the URL. I also have secondary index tables to store the CRCs of the pages to identify duplicate pages based on this value. And so on ;) Working on that for 2 years now. I might have been able to use Nuthc and others, but my goal was to learn and do that with a distributed client on a single dataset... Enjoy. JM 2014/1/3 James Taylor jtay...@salesforce.com Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article
Re: use hbase as distributed crawl's scheduler
bq. One URL ... I guess you mean one table ... Cheers On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Interesting. This is exactly what I'm doing ;) I'm using 3 tables to achieve this. One table with the URL already crawled (80 millions), one URL with the URL to crawle (2 billions) and one URL with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs doing many different things. I have many other tables to help with the work on the URLs. I'm salting the keys using the URL hash so I can find them back very quickly. There can be some collisions so I store also the URL itself on the key. So very small scans returning 1 or something 2 rows allow me to quickly find a row knowing the URL. I also have secondary index tables to store the CRCs of the pages to identify duplicate pages based on this value. And so on ;) Working on that for 2 years now. I might have been able to use Nuthc and others, but my goal was to learn and do that with a distributed client on a single dataset... Enjoy. JM 2014/1/3 James Taylor jtay...@salesforce.com Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ , they use redis to store urls. I think hbase is originated from bigtable
Re: use hbase as distributed crawl's scheduler
Yes, sorry ;) Thanks for the correction. Should have been: One table with the URL already crawled (80 millions), one table with the URL to crawle (2 billions) and one table with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs doing many different things. I have many other tables to help with the work on the URLs. 2014/1/3 Ted Yu yuzhih...@gmail.com bq. One URL ... I guess you mean one table ... Cheers On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Interesting. This is exactly what I'm doing ;) I'm using 3 tables to achieve this. One table with the URL already crawled (80 millions), one URL with the URL to crawle (2 billions) and one URL with the URLs been processed. I'm not running any SQL requests against my dataset but I have MR jobs doing many different things. I have many other tables to help with the work on the URLs. I'm salting the keys using the URL hash so I can find them back very quickly. There can be some collisions so I store also the URL itself on the key. So very small scans returning 1 or something 2 rows allow me to quickly find a row knowing the URL. I also have secondary index tables to store the CRCs of the pages to identify duplicate pages based on this value. And so on ;) Working on that for 2 years now. I might have been able to use Nuthc and others, but my goal was to learn and do that with a distributed client on a single dataset... Enjoy. JM
Re: use hbase as distributed crawl's scheduler
Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. 2. Wouldn't salting ruin the sort order required? Priority, date added? On Friday, January 3, 2014, James Taylor wrote: Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li
Re: use hbase as distributed crawl's scheduler
On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. Not sure on this, but seemed to me that setting the status field is forcing the urls that have been processed to be at the end of the sort order. 2. Wouldn't salting ruin the sort order required? Priority, date added? No, as Phoenix maintains returning rows in row key order even when they're salted. We do parallel scans for each bucket and do a merge sort on the client, so the cost is pretty low for this (we also provide a way of turning this off if your use case doesn't need it). Two years, JM? Now you're really going to have to start using Phoenix :-) On Friday, January 3, 2014, James Taylor wrote: Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li
Re: use hbase as distributed crawl's scheduler
hi James, phoenix seems great but it's now only a experimental project. I want to use only hbase. could you tell me the difference of Phoenix and hbase? If I use hbase only, how should I design the schema and some extra things for my goal? thank you On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com wrote: On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. Not sure on this, but seemed to me that setting the status field is forcing the urls that have been processed to be at the end of the sort order. 2. Wouldn't salting ruin the sort order required? Priority, date added? No, as Phoenix maintains returning rows in row key order even when they're salted. We do parallel scans for each bucket and do a merge sort on the client, so the cost is pretty low for this (we also provide a way of turning this off if your use case doesn't need it). Two years, JM? Now you're really going to have to start using Phoenix :-) On Friday, January 3, 2014, James Taylor wrote: Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li
Re: use hbase as distributed crawl's scheduler
Hi LiLi, Phoenix isn't an experimental project. We're on our 2.2 release, and many companies (including the company for which I'm employed, Salesforce.com) use it in production today. Thanks, James On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote: hi James, phoenix seems great but it's now only a experimental project. I want to use only hbase. could you tell me the difference of Phoenix and hbase? If I use hbase only, how should I design the schema and some extra things for my goal? thank you On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com wrote: On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. Not sure on this, but seemed to me that setting the status field is forcing the urls that have been processed to be at the end of the sort order. 2. Wouldn't salting ruin the sort order required? Priority, date added? No, as Phoenix maintains returning rows in row key order even when they're salted. We do parallel scans for each bucket and do a merge sort on the client, so the cost is pretty low for this (we also provide a way of turning this off if your use case doesn't need it). Two years, JM? Now you're really going to have to start using Phoenix :-) On Friday, January 3, 2014, James Taylor wrote: Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li
Re: use hbase as distributed crawl's scheduler
so what's the relationship of Phoenix and HBase? something like hadoop and hive? On Sat, Jan 4, 2014 at 3:43 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Phoenix isn't an experimental project. We're on our 2.2 release, and many companies (including the company for which I'm employed, Salesforce.com) use it in production today. Thanks, James On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote: hi James, phoenix seems great but it's now only a experimental project. I want to use only hbase. could you tell me the difference of Phoenix and hbase? If I use hbase only, how should I design the schema and some extra things for my goal? thank you On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com wrote: On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote: Couple of notes: 1. When updating to status you essentially add a new rowkey into HBase, I would give it up all together. The essential requirement seems to point at retrieving a list of urls in a certain order. Not sure on this, but seemed to me that setting the status field is forcing the urls that have been processed to be at the end of the sort order. 2. Wouldn't salting ruin the sort order required? Priority, date added? No, as Phoenix maintains returning rows in row key order even when they're salted. We do parallel scans for each bucket and do a merge sort on the client, so the cost is pretty low for this (we also provide a way of turning this off if your use case doesn't need it). Two years, JM? Now you're really going to have to start using Phoenix :-) On Friday, January 3, 2014, James Taylor wrote: Sure, no problem. One addition: depending on the cardinality of your priority column, you may want to salt your table to prevent hotspotting, since you'll have a monotonically increasing date in the key. To do that, just add SALT_BUCKETS=n on to your query, where n is the number of machines in your cluster. You can read more about salting here: http://phoenix.incubator.apache.org/salted.html On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote: thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search
Re: use hbase as distributed crawl's scheduler
Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ , they use redis to store urls. I think hbase is originated from bigtable and google use bigtable to store webpage, so for huge number of urls, I prefer distributed system like hbase.
Re: use hbase as distributed crawl's scheduler
thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ , they use redis to store urls. I think hbase is originated from bigtable and google use bigtable to store webpage, so for huge number of urls, I prefer distributed system like hbase.
Re: use hbase as distributed crawl's scheduler
Otis, I didn't realize Nutch uses HBase underneath. Might be interesting if you serialized data in a Phoenix-compliant manner, as you could run SQL queries directly on top of it. Thanks, James On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ , they use redis to store urls. I think hbase is originated from bigtable and google use bigtable to store webpage, so for huge number of urls, I prefer distributed system like hbase.
Re: use hbase as distributed crawl's scheduler
Hi, Yes. I'm sure that would be a welcome addition. Topic for user@nutch.a.o... Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:23 AM, James Taylor jtay...@salesforce.com wrote: Otis, I didn't realize Nutch uses HBase underneath. Might be interesting if you serialized data in a Phoenix-compliant manner, as you could run SQL queries directly on top of it. Thanks, James On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ , they use redis to store urls. I think hbase is originated from bigtable and google use bigtable to store webpage, so for huge number of urls, I prefer distributed system like hbase.
Re: use hbase as distributed crawl's scheduler
Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ , they use redis to store urls. I think hbase is originated from bigtable and google use bigtable to store webpage, so for huge number of urls, I prefer distributed system like hbase.
Re: use hbase as distributed crawl's scheduler
thank you. it's great. On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote: Hi LiLi, Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL skin on top of HBase. You can model your schema and issue your queries just like you would with MySQL. Something like this: // Create table that optimizes for your most common query // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your rows ordered) CREATE TABLE url_db ( status TINYINT, priority INTEGER NOT NULL, added_time DATE, url VARCHAR NOT NULL CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); int lastStatus = 0; int lastPriority = 0; Date lastAddedTime = new Date(0); String lastUrl = ; while (true) { // Use row value constructor to page through results in batches of 1000 String query = SELECT * FROM url_db WHERE status=0 AND (status, priority, added_time, url) (?, ?, ?, ?) ORDER BY status, priority, added_time, url LIMIT 1000 PreparedStatement stmt = connection.prepareStatement(query); // Bind parameters stmt.setInt(1, lastStatus); stmt.setInt(2, lastPriority); stmt.setDate(3, lastAddedTime); stmt.setString(4, lastUrl); ResultSet resultSet = stmt.executeQuery(); while (resultSet.next()) { // Remember last row processed so that you can start after that for next batch lastStatus = resultSet.getInt(1); lastPriority = resultSet.getInt(2); lastAddedTime = resultSet.getDate(3); lastUrl = resultSet.getString(4); doSomethingWithUrls(); UPSERT INTO url_db(status, priority, added_time, url) VALUES (1, ?, CURRENT_DATE(), ?); } } If you need to efficiently query on url, add a secondary index like this: CREATE INDEX url_index ON url_db (url); Please let me know if you have questions. Thanks, James On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote: thank you. But I can't use nutch. could you tell me how hbase is used in nutch? or hbase is only used to store webpage. On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Have a look at http://nutch.apache.org . Version 2.x uses HBase under the hood. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote: hi all, I want to use hbase to store all urls(crawled or not crawled). And each url will has a column named priority which represent the priority of the url. I want to get the top N urls order by priority(if priority is the same then url whose timestamp is ealier is prefered). in using something like mysql, my client application may like: while true: select url from url_db order by priority,addedTime limit 1000 where status='not_crawled'; do something with this urls; extract more urls and insert them into url_db; How should I design hbase schema for this application? Is hbase suitable for me? I found in this article http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ , they use redis to store urls. I think hbase is originated from bigtable and google use bigtable to store webpage, so for huge number of urls, I prefer distributed system like hbase.