subject:"Re\: use hbase as distributed crawl's scheduler"

Re: use hbase as distributed crawl's scheduler

2014-01-04 Thread James Taylor

Please take a look at our Apache incubator proposal, as I think that may
answer your questions: https://wiki.apache.org/incubator/PhoenixProposal


On Fri, Jan 3, 2014 at 11:47 PM, Li Li fancye...@gmail.com wrote:

 so what's the relationship of Phoenix and HBase? something like hadoop and
 hive?


 On Sat, Jan 4, 2014 at 3:43 PM, James Taylor jtay...@salesforce.com
 wrote:
  Hi LiLi,
  Phoenix isn't an experimental project. We're on our 2.2 release, and many
  companies (including the company for which I'm employed, Salesforce.com)
  use it in production today.
  Thanks,
  James
 
 
  On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote:
 
  hi James,
  phoenix seems great but it's now only a experimental project. I
  want to use only hbase. could you tell me the difference of Phoenix
  and hbase? If I use hbase only, how should I design the schema and
  some extra things for my goal? thank you
 
  On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com
  wrote:
   On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com
  wrote:
  
   Couple of notes:
   1. When updating to status you essentially add a new rowkey into
 HBase,
  I
   would give it up all together. The essential requirement seems to
 point
  at
   retrieving a list of urls in a certain order.
  
   Not sure on this, but seemed to me that setting the status field is
  forcing
   the urls that have been processed to be at the end of the sort order.
  
   2. Wouldn't salting ruin the sort order required? Priority, date
 added?
  
   No, as Phoenix maintains returning rows in row key order even when
  they're
   salted. We do parallel scans for each bucket and do a merge sort on
 the
   client, so the cost is pretty low for this (we also provide a way of
   turning this off if your use case doesn't need it).
  
   Two years, JM? Now you're really going to have to start using Phoenix
 :-)
  
  
   On Friday, January 3, 2014, James Taylor wrote:
  
Sure, no problem. One addition: depending on the cardinality of
 your
priority column, you may want to salt your table to prevent
  hotspotting,
since you'll have a monotonically increasing date in the key. To do
  that,
just add  SALT_BUCKETS=n on to your query, where n is the
  number of
machines in your cluster. You can read more about salting here:
http://phoenix.incubator.apache.org/salted.html
   
   
On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com
 wrote:
   
 thank you. it's great.

 On Fri, Jan 3, 2014 at 3:15 PM, James Taylor 
  jtay...@salesforce.com
 wrote:
  Hi LiLi,
  Have a look at Phoenix (http://phoenix.incubator.apache.org/).
  It's
   a
 SQL
  skin on top of HBase. You can model your schema and issue your
   queries
 just
  like you would with MySQL. Something like this:
 
  // Create table that optimizes for your most common query
  // (i.e. the PRIMARY KEY constraint should be ordered as you'd
  want
your
  rows ordered)
  CREATE TABLE url_db (
  status TINYINT,
  priority INTEGER NOT NULL,
  added_time DATE,
  url VARCHAR NOT NULL
  CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
  url));
 
  int lastStatus = 0;
  int lastPriority = 0;
  Date lastAddedTime = new Date(0);
  String lastUrl = ;
 
  while (true) {
  // Use row value constructor to page through results in
  batches
   of
 1000
  String query = 
  SELECT * FROM url_db
  WHERE status=0 AND (status, priority, added_time, url)
 
  (?,
   ?,
 ?,
  ?)
  ORDER BY status, priority, added_time, url
  LIMIT 1000
  PreparedStatement stmt =
 connection.prepareStatement(query);
 
  // Bind parameters
  stmt.setInt(1, lastStatus);
  stmt.setInt(2, lastPriority);
  stmt.setDate(3, lastAddedTime);
  stmt.setString(4, lastUrl);
  ResultSet resultSet = stmt.executeQuery();
 
  while (resultSet.next()) {
  // Remember last row processed so that you can start
 after
   that
 for
  next batch
  lastStatus = resultSet.getInt(1);
  lastPriority = resultSet.getInt(2);
  lastAddedTime = resultSet.getDate(3);
  lastUrl = resultSet.getString(4);
 
  doSomethingWithUrls();
 
  UPSERT INTO url_db(status, priority, added_time, url)
  VALUES (1, ?, CURRENT_DATE(), ?);
 
  }
  }
 
  If you need to efficiently query on url, add a secondary index
  like
this:
 
  CREATE INDEX url_index ON url_db (url);
 
  Please let me know if you have questions.
 
  Thanks,
  James
 
 
 
 
  On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com
  wrote:
 
  thank you. But I can't use nutch. could you tell me

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor

Sure, no problem. One addition: depending on the cardinality of your
priority column, you may want to salt your table to prevent hotspotting,
since you'll have a monotonically increasing date in the key. To do that,
just add  SALT_BUCKETS=n on to your query, where n is the number of
machines in your cluster. You can read more about salting here:
http://phoenix.incubator.apache.org/salted.html


On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:

 thank you. it's great.

 On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com
 wrote:
  Hi LiLi,
  Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
 SQL
  skin on top of HBase. You can model your schema and issue your queries
 just
  like you would with MySQL. Something like this:
 
  // Create table that optimizes for your most common query
  // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your
  rows ordered)
  CREATE TABLE url_db (
  status TINYINT,
  priority INTEGER NOT NULL,
  added_time DATE,
  url VARCHAR NOT NULL
  CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
 
  int lastStatus = 0;
  int lastPriority = 0;
  Date lastAddedTime = new Date(0);
  String lastUrl = ;
 
  while (true) {
  // Use row value constructor to page through results in batches of
 1000
  String query = 
  SELECT * FROM url_db
  WHERE status=0 AND (status, priority, added_time, url)  (?, ?,
 ?,
  ?)
  ORDER BY status, priority, added_time, url
  LIMIT 1000
  PreparedStatement stmt = connection.prepareStatement(query);
 
  // Bind parameters
  stmt.setInt(1, lastStatus);
  stmt.setInt(2, lastPriority);
  stmt.setDate(3, lastAddedTime);
  stmt.setString(4, lastUrl);
  ResultSet resultSet = stmt.executeQuery();
 
  while (resultSet.next()) {
  // Remember last row processed so that you can start after that
 for
  next batch
  lastStatus = resultSet.getInt(1);
  lastPriority = resultSet.getInt(2);
  lastAddedTime = resultSet.getDate(3);
  lastUrl = resultSet.getString(4);
 
  doSomethingWithUrls();
 
  UPSERT INTO url_db(status, priority, added_time, url)
  VALUES (1, ?, CURRENT_DATE(), ?);
 
  }
  }
 
  If you need to efficiently query on url, add a secondary index like this:
 
  CREATE INDEX url_index ON url_db (url);
 
  Please let me know if you have questions.
 
  Thanks,
  James
 
 
 
 
  On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:
 
  thank you. But I can't use nutch. could you tell me how hbase is used
  in nutch? or hbase is only used to store webpage.
 
  On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
  otis.gospodne...@gmail.com wrote:
   Hi,
  
   Have a look at http://nutch.apache.org .  Version 2.x uses HBase
 under
  the
   hood.
  
   Otis
   --
   Performance Monitoring * Log Analytics * Search Analytics
   Solr  Elasticsearch Support * http://sematext.com/
  
  
   On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:
  
   hi all,
I want to use hbase to store all urls(crawled or not crawled).
   And each url will has a column named priority which represent the
   priority of the url. I want to get the top N urls order by
 priority(if
   priority is the same then url whose timestamp is ealier is prefered).
in using something like mysql, my client application may like:
while true:
select  url from url_db order by priority,addedTime limit
   1000 where status='not_crawled';
do something with this urls;
extract more urls and insert them into url_db;
How should I design hbase schema for this application? Is hbase
   suitable for me?
I found in this article
  
 
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
   ,
   they use redis to store urls. I think hbase is originated from
   bigtable and google use bigtable to store webpage, so for huge number
   of urls, I prefer distributed system like hbase.

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Jean-Marc Spaggiari

Interesting. This is exactly what I'm doing ;)

I'm using 3 tables to achieve this.

One table with the URL already crawled (80 millions), one URL with the URL
to crawle (2 billions) and one URL with the URLs been processed. I'm not
running any SQL requests against my dataset but I have MR jobs doing many
different things. I have many other tables to help with the work on the
URLs.

I'm salting the keys using the URL hash so I can find them back very
quickly. There can be some collisions so I store also the URL itself on the
key. So very small scans returning 1 or something 2 rows allow me to
quickly find a row knowing the URL.

I also have secondary index tables to store the CRCs of the pages to
identify duplicate pages based on this value.

And so on ;) Working on that for 2 years now. I might have been able to use
Nuthc and others, but my goal was to learn and do that with a distributed
client on a single dataset...

Enjoy.

JM


2014/1/3 James Taylor jtay...@salesforce.com

 Sure, no problem. One addition: depending on the cardinality of your
 priority column, you may want to salt your table to prevent hotspotting,
 since you'll have a monotonically increasing date in the key. To do that,
 just add  SALT_BUCKETS=n on to your query, where n is the number of
 machines in your cluster. You can read more about salting here:
 http://phoenix.incubator.apache.org/salted.html


 On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:

  thank you. it's great.
 
  On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com
  wrote:
   Hi LiLi,
   Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
  SQL
   skin on top of HBase. You can model your schema and issue your queries
  just
   like you would with MySQL. Something like this:
  
   // Create table that optimizes for your most common query
   // (i.e. the PRIMARY KEY constraint should be ordered as you'd want
 your
   rows ordered)
   CREATE TABLE url_db (
   status TINYINT,
   priority INTEGER NOT NULL,
   added_time DATE,
   url VARCHAR NOT NULL
   CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
  
   int lastStatus = 0;
   int lastPriority = 0;
   Date lastAddedTime = new Date(0);
   String lastUrl = ;
  
   while (true) {
   // Use row value constructor to page through results in batches of
  1000
   String query = 
   SELECT * FROM url_db
   WHERE status=0 AND (status, priority, added_time, url)  (?, ?,
  ?,
   ?)
   ORDER BY status, priority, added_time, url
   LIMIT 1000
   PreparedStatement stmt = connection.prepareStatement(query);
  
   // Bind parameters
   stmt.setInt(1, lastStatus);
   stmt.setInt(2, lastPriority);
   stmt.setDate(3, lastAddedTime);
   stmt.setString(4, lastUrl);
   ResultSet resultSet = stmt.executeQuery();
  
   while (resultSet.next()) {
   // Remember last row processed so that you can start after that
  for
   next batch
   lastStatus = resultSet.getInt(1);
   lastPriority = resultSet.getInt(2);
   lastAddedTime = resultSet.getDate(3);
   lastUrl = resultSet.getString(4);
  
   doSomethingWithUrls();
  
   UPSERT INTO url_db(status, priority, added_time, url)
   VALUES (1, ?, CURRENT_DATE(), ?);
  
   }
   }
  
   If you need to efficiently query on url, add a secondary index like
 this:
  
   CREATE INDEX url_index ON url_db (url);
  
   Please let me know if you have questions.
  
   Thanks,
   James
  
  
  
  
   On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:
  
   thank you. But I can't use nutch. could you tell me how hbase is used
   in nutch? or hbase is only used to store webpage.
  
   On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
   otis.gospodne...@gmail.com wrote:
Hi,
   
Have a look at http://nutch.apache.org .  Version 2.x uses HBase
  under
   the
hood.
   
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/
   
   
On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:
   
hi all,
 I want to use hbase to store all urls(crawled or not crawled).
And each url will has a column named priority which represent the
priority of the url. I want to get the top N urls order by
  priority(if
priority is the same then url whose timestamp is ealier is
 prefered).
 in using something like mysql, my client application may like:
 while true:
 select  url from url_db order by priority,addedTime limit
1000 where status='not_crawled';
 do something with this urls;
 extract more urls and insert them into url_db;
 How should I design hbase schema for this application? Is
 hbase
suitable for me?
 I found in this article

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Ted Yu

bq. One URL ...

I guess you mean one table ...

Cheers

On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote:

 Interesting. This is exactly what I'm doing ;)
 
 I'm using 3 tables to achieve this.
 
 One table with the URL already crawled (80 millions), one URL with the URL
 to crawle (2 billions) and one URL with the URLs been processed. I'm not
 running any SQL requests against my dataset but I have MR jobs doing many
 different things. I have many other tables to help with the work on the
 URLs.
 
 I'm salting the keys using the URL hash so I can find them back very
 quickly. There can be some collisions so I store also the URL itself on the
 key. So very small scans returning 1 or something 2 rows allow me to
 quickly find a row knowing the URL.
 
 I also have secondary index tables to store the CRCs of the pages to
 identify duplicate pages based on this value.
 
 And so on ;) Working on that for 2 years now. I might have been able to use
 Nuthc and others, but my goal was to learn and do that with a distributed
 client on a single dataset...
 
 Enjoy.
 
 JM
 
 
 2014/1/3 James Taylor jtay...@salesforce.com
 
 Sure, no problem. One addition: depending on the cardinality of your
 priority column, you may want to salt your table to prevent hotspotting,
 since you'll have a monotonically increasing date in the key. To do that,
 just add  SALT_BUCKETS=n on to your query, where n is the number of
 machines in your cluster. You can read more about salting here:
 http://phoenix.incubator.apache.org/salted.html
 
 
 On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:
 
 thank you. it's great.
 
 On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com
 wrote:
 Hi LiLi,
 Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
 SQL
 skin on top of HBase. You can model your schema and issue your queries
 just
 like you would with MySQL. Something like this:
 
 // Create table that optimizes for your most common query
 // (i.e. the PRIMARY KEY constraint should be ordered as you'd want
 your
 rows ordered)
 CREATE TABLE url_db (
status TINYINT,
priority INTEGER NOT NULL,
added_time DATE,
url VARCHAR NOT NULL
CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
 
 int lastStatus = 0;
 int lastPriority = 0;
 Date lastAddedTime = new Date(0);
 String lastUrl = ;
 
 while (true) {
// Use row value constructor to page through results in batches of
 1000
String query = 
SELECT * FROM url_db
WHERE status=0 AND (status, priority, added_time, url)  (?, ?,
 ?,
 ?)
ORDER BY status, priority, added_time, url
LIMIT 1000
PreparedStatement stmt = connection.prepareStatement(query);
 
// Bind parameters
stmt.setInt(1, lastStatus);
stmt.setInt(2, lastPriority);
stmt.setDate(3, lastAddedTime);
stmt.setString(4, lastUrl);
ResultSet resultSet = stmt.executeQuery();
 
while (resultSet.next()) {
// Remember last row processed so that you can start after that
 for
 next batch
lastStatus = resultSet.getInt(1);
lastPriority = resultSet.getInt(2);
lastAddedTime = resultSet.getDate(3);
lastUrl = resultSet.getString(4);
 
doSomethingWithUrls();
 
UPSERT INTO url_db(status, priority, added_time, url)
VALUES (1, ?, CURRENT_DATE(), ?);
 
}
 }
 
 If you need to efficiently query on url, add a secondary index like
 this:
 
 CREATE INDEX url_index ON url_db (url);
 
 Please let me know if you have questions.
 
 Thanks,
 James
 
 
 
 
 On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:
 
 thank you. But I can't use nutch. could you tell me how hbase is used
 in nutch? or hbase is only used to store webpage.
 
 On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
 Hi,
 
 Have a look at http://nutch.apache.org .  Version 2.x uses HBase
 under
 the
 hood.
 
 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/
 
 
 On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:
 
 hi all,
 I want to use hbase to store all urls(crawled or not crawled).
 And each url will has a column named priority which represent the
 priority of the url. I want to get the top N urls order by
 priority(if
 priority is the same then url whose timestamp is ealier is
 prefered).
 in using something like mysql, my client application may like:
 while true:
 select  url from url_db order by priority,addedTime limit
 1000 where status='not_crawled';
 do something with this urls;
 extract more urls and insert them into url_db;
 How should I design hbase schema for this application? Is
 hbase
 suitable for me?
 I found in this article
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
 ,
 they use redis to store urls. I think hbase is originated from
 bigtable

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Jean-Marc Spaggiari

Yes, sorry ;) Thanks for the correction.

Should have been:
One table with the URL already crawled (80 millions), one table with the
URL
to crawle (2 billions) and one table with the URLs been processed. I'm not
running any SQL requests against my dataset but I have MR jobs doing many
different things. I have many other tables to help with the work on the
URLs.


2014/1/3 Ted Yu yuzhih...@gmail.com

 bq. One URL ...

 I guess you mean one table ...

 Cheers

 On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari jean-m...@spaggiari.org
 wrote:

  Interesting. This is exactly what I'm doing ;)
 
  I'm using 3 tables to achieve this.
 
  One table with the URL already crawled (80 millions), one URL with the
 URL
  to crawle (2 billions) and one URL with the URLs been processed. I'm not
  running any SQL requests against my dataset but I have MR jobs doing many
  different things. I have many other tables to help with the work on the
  URLs.
 
  I'm salting the keys using the URL hash so I can find them back very
  quickly. There can be some collisions so I store also the URL itself on
 the
  key. So very small scans returning 1 or something 2 rows allow me to
  quickly find a row knowing the URL.
 
  I also have secondary index tables to store the CRCs of the pages to
  identify duplicate pages based on this value.
 
  And so on ;) Working on that for 2 years now. I might have been able to
 use
  Nuthc and others, but my goal was to learn and do that with a distributed
  client on a single dataset...
 
  Enjoy.
 
  JM

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Asaf Mesika

Couple of notes:
1. When updating to status you essentially add a new rowkey into HBase, I
would give it up all together. The essential requirement seems to point at
retrieving a list of urls in a certain order.
2. Wouldn't salting ruin the sort order required? Priority, date added?

On Friday, January 3, 2014, James Taylor wrote:

 Sure, no problem. One addition: depending on the cardinality of your
 priority column, you may want to salt your table to prevent hotspotting,
 since you'll have a monotonically increasing date in the key. To do that,
 just add  SALT_BUCKETS=n on to your query, where n is the number of
 machines in your cluster. You can read more about salting here:
 http://phoenix.incubator.apache.org/salted.html


 On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:

  thank you. it's great.
 
  On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com
  wrote:
   Hi LiLi,
   Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
  SQL
   skin on top of HBase. You can model your schema and issue your queries
  just
   like you would with MySQL. Something like this:
  
   // Create table that optimizes for your most common query
   // (i.e. the PRIMARY KEY constraint should be ordered as you'd want
 your
   rows ordered)
   CREATE TABLE url_db (
   status TINYINT,
   priority INTEGER NOT NULL,
   added_time DATE,
   url VARCHAR NOT NULL
   CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
  
   int lastStatus = 0;
   int lastPriority = 0;
   Date lastAddedTime = new Date(0);
   String lastUrl = ;
  
   while (true) {
   // Use row value constructor to page through results in batches of
  1000
   String query = 
   SELECT * FROM url_db
   WHERE status=0 AND (status, priority, added_time, url)  (?, ?,
  ?,
   ?)
   ORDER BY status, priority, added_time, url
   LIMIT 1000
   PreparedStatement stmt = connection.prepareStatement(query);
  
   // Bind parameters
   stmt.setInt(1, lastStatus);
   stmt.setInt(2, lastPriority);
   stmt.setDate(3, lastAddedTime);
   stmt.setString(4, lastUrl);
   ResultSet resultSet = stmt.executeQuery();
  
   while (resultSet.next()) {
   // Remember last row processed so that you can start after that
  for
   next batch
   lastStatus = resultSet.getInt(1);
   lastPriority = resultSet.getInt(2);
   lastAddedTime = resultSet.getDate(3);
   lastUrl = resultSet.getString(4);
  
   doSomethingWithUrls();
  
   UPSERT INTO url_db(status, priority, added_time, url)
   VALUES (1, ?, CURRENT_DATE(), ?);
  
   }
   }
  
   If you need to efficiently query on url, add a secondary index like
 this:
  
   CREATE INDEX url_index ON url_db (url);
  
   Please let me know if you have questions.
  
   Thanks,
   James
  
  
  
  
   On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:
  
   thank you. But I can't use nutch. could you tell me how hbase is used
   in nutch? or hbase is only used to store webpage.
  
   On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
   otis.gospodne...@gmail.com wrote:
Hi,
   
Have a look at http://nutch.apache.org .  Version 2.x uses HBase
  under
   the
hood.
   
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/
   
   
On Fri, Jan 3, 2014 at 1:12 AM, Li Li

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor

On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote:

 Couple of notes:
 1. When updating to status you essentially add a new rowkey into HBase, I
 would give it up all together. The essential requirement seems to point at
 retrieving a list of urls in a certain order.

Not sure on this, but seemed to me that setting the status field is forcing
the urls that have been processed to be at the end of the sort order.

2. Wouldn't salting ruin the sort order required? Priority, date added?

No, as Phoenix maintains returning rows in row key order even when they're
salted. We do parallel scans for each bucket and do a merge sort on the
client, so the cost is pretty low for this (we also provide a way of
turning this off if your use case doesn't need it).

Two years, JM? Now you're really going to have to start using Phoenix :-)


 On Friday, January 3, 2014, James Taylor wrote:

  Sure, no problem. One addition: depending on the cardinality of your
  priority column, you may want to salt your table to prevent hotspotting,
  since you'll have a monotonically increasing date in the key. To do that,
  just add  SALT_BUCKETS=n on to your query, where n is the number of
  machines in your cluster. You can read more about salting here:
  http://phoenix.incubator.apache.org/salted.html
 
 
  On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:
 
   thank you. it's great.
  
   On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com
   wrote:
Hi LiLi,
Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's
 a
   SQL
skin on top of HBase. You can model your schema and issue your
 queries
   just
like you would with MySQL. Something like this:
   
// Create table that optimizes for your most common query
// (i.e. the PRIMARY KEY constraint should be ordered as you'd want
  your
rows ordered)
CREATE TABLE url_db (
status TINYINT,
priority INTEGER NOT NULL,
added_time DATE,
url VARCHAR NOT NULL
CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
   
int lastStatus = 0;
int lastPriority = 0;
Date lastAddedTime = new Date(0);
String lastUrl = ;
   
while (true) {
// Use row value constructor to page through results in batches
 of
   1000
String query = 
SELECT * FROM url_db
WHERE status=0 AND (status, priority, added_time, url)  (?,
 ?,
   ?,
?)
ORDER BY status, priority, added_time, url
LIMIT 1000
PreparedStatement stmt = connection.prepareStatement(query);
   
// Bind parameters
stmt.setInt(1, lastStatus);
stmt.setInt(2, lastPriority);
stmt.setDate(3, lastAddedTime);
stmt.setString(4, lastUrl);
ResultSet resultSet = stmt.executeQuery();
   
while (resultSet.next()) {
// Remember last row processed so that you can start after
 that
   for
next batch
lastStatus = resultSet.getInt(1);
lastPriority = resultSet.getInt(2);
lastAddedTime = resultSet.getDate(3);
lastUrl = resultSet.getString(4);
   
doSomethingWithUrls();
   
UPSERT INTO url_db(status, priority, added_time, url)
VALUES (1, ?, CURRENT_DATE(), ?);
   
}
}
   
If you need to efficiently query on url, add a secondary index like
  this:
   
CREATE INDEX url_index ON url_db (url);
   
Please let me know if you have questions.
   
Thanks,
James
   
   
   
   
On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:
   
thank you. But I can't use nutch. could you tell me how hbase is
 used
in nutch? or hbase is only used to store webpage.
   
On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 Have a look at http://nutch.apache.org .  Version 2.x uses HBase
   under
the
 hood.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Fri, Jan 3, 2014 at 1:12 AM, Li Li

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Li Li

hi James,
phoenix seems great but it's now only a experimental project. I
want to use only hbase. could you tell me the difference of Phoenix
and hbase? If I use hbase only, how should I design the schema and
some extra things for my goal? thank you

On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com wrote:
 On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com wrote:

 Couple of notes:
 1. When updating to status you essentially add a new rowkey into HBase, I
 would give it up all together. The essential requirement seems to point at
 retrieving a list of urls in a certain order.

 Not sure on this, but seemed to me that setting the status field is forcing
 the urls that have been processed to be at the end of the sort order.

 2. Wouldn't salting ruin the sort order required? Priority, date added?

 No, as Phoenix maintains returning rows in row key order even when they're
 salted. We do parallel scans for each bucket and do a merge sort on the
 client, so the cost is pretty low for this (we also provide a way of
 turning this off if your use case doesn't need it).

 Two years, JM? Now you're really going to have to start using Phoenix :-)


 On Friday, January 3, 2014, James Taylor wrote:

  Sure, no problem. One addition: depending on the cardinality of your
  priority column, you may want to salt your table to prevent hotspotting,
  since you'll have a monotonically increasing date in the key. To do that,
  just add  SALT_BUCKETS=n on to your query, where n is the number of
  machines in your cluster. You can read more about salting here:
  http://phoenix.incubator.apache.org/salted.html
 
 
  On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:
 
   thank you. it's great.
  
   On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com
   wrote:
Hi LiLi,
Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's
 a
   SQL
skin on top of HBase. You can model your schema and issue your
 queries
   just
like you would with MySQL. Something like this:
   
// Create table that optimizes for your most common query
// (i.e. the PRIMARY KEY constraint should be ordered as you'd want
  your
rows ordered)
CREATE TABLE url_db (
status TINYINT,
priority INTEGER NOT NULL,
added_time DATE,
url VARCHAR NOT NULL
CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
   
int lastStatus = 0;
int lastPriority = 0;
Date lastAddedTime = new Date(0);
String lastUrl = ;
   
while (true) {
// Use row value constructor to page through results in batches
 of
   1000
String query = 
SELECT * FROM url_db
WHERE status=0 AND (status, priority, added_time, url)  (?,
 ?,
   ?,
?)
ORDER BY status, priority, added_time, url
LIMIT 1000
PreparedStatement stmt = connection.prepareStatement(query);
   
// Bind parameters
stmt.setInt(1, lastStatus);
stmt.setInt(2, lastPriority);
stmt.setDate(3, lastAddedTime);
stmt.setString(4, lastUrl);
ResultSet resultSet = stmt.executeQuery();
   
while (resultSet.next()) {
// Remember last row processed so that you can start after
 that
   for
next batch
lastStatus = resultSet.getInt(1);
lastPriority = resultSet.getInt(2);
lastAddedTime = resultSet.getDate(3);
lastUrl = resultSet.getString(4);
   
doSomethingWithUrls();
   
UPSERT INTO url_db(status, priority, added_time, url)
VALUES (1, ?, CURRENT_DATE(), ?);
   
}
}
   
If you need to efficiently query on url, add a secondary index like
  this:
   
CREATE INDEX url_index ON url_db (url);
   
Please let me know if you have questions.
   
Thanks,
James
   
   
   
   
On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:
   
thank you. But I can't use nutch. could you tell me how hbase is
 used
in nutch? or hbase is only used to store webpage.
   
On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 Have a look at http://nutch.apache.org .  Version 2.x uses HBase
   under
the
 hood.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Fri, Jan 3, 2014 at 1:12 AM, Li Li

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread James Taylor

Hi LiLi,
Phoenix isn't an experimental project. We're on our 2.2 release, and many
companies (including the company for which I'm employed, Salesforce.com)
use it in production today.
Thanks,
James


On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote:

 hi James,
 phoenix seems great but it's now only a experimental project. I
 want to use only hbase. could you tell me the difference of Phoenix
 and hbase? If I use hbase only, how should I design the schema and
 some extra things for my goal? thank you

 On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com
 wrote:
  On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com
 wrote:
 
  Couple of notes:
  1. When updating to status you essentially add a new rowkey into HBase,
 I
  would give it up all together. The essential requirement seems to point
 at
  retrieving a list of urls in a certain order.
 
  Not sure on this, but seemed to me that setting the status field is
 forcing
  the urls that have been processed to be at the end of the sort order.
 
  2. Wouldn't salting ruin the sort order required? Priority, date added?
 
  No, as Phoenix maintains returning rows in row key order even when
 they're
  salted. We do parallel scans for each bucket and do a merge sort on the
  client, so the cost is pretty low for this (we also provide a way of
  turning this off if your use case doesn't need it).
 
  Two years, JM? Now you're really going to have to start using Phoenix :-)
 
 
  On Friday, January 3, 2014, James Taylor wrote:
 
   Sure, no problem. One addition: depending on the cardinality of your
   priority column, you may want to salt your table to prevent
 hotspotting,
   since you'll have a monotonically increasing date in the key. To do
 that,
   just add  SALT_BUCKETS=n on to your query, where n is the
 number of
   machines in your cluster. You can read more about salting here:
   http://phoenix.incubator.apache.org/salted.html
  
  
   On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:
  
thank you. it's great.
   
On Fri, Jan 3, 2014 at 3:15 PM, James Taylor 
 jtay...@salesforce.com
wrote:
 Hi LiLi,
 Have a look at Phoenix (http://phoenix.incubator.apache.org/).
 It's
  a
SQL
 skin on top of HBase. You can model your schema and issue your
  queries
just
 like you would with MySQL. Something like this:

 // Create table that optimizes for your most common query
 // (i.e. the PRIMARY KEY constraint should be ordered as you'd
 want
   your
 rows ordered)
 CREATE TABLE url_db (
 status TINYINT,
 priority INTEGER NOT NULL,
 added_time DATE,
 url VARCHAR NOT NULL
 CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
 url));

 int lastStatus = 0;
 int lastPriority = 0;
 Date lastAddedTime = new Date(0);
 String lastUrl = ;

 while (true) {
 // Use row value constructor to page through results in
 batches
  of
1000
 String query = 
 SELECT * FROM url_db
 WHERE status=0 AND (status, priority, added_time, url) 
 (?,
  ?,
?,
 ?)
 ORDER BY status, priority, added_time, url
 LIMIT 1000
 PreparedStatement stmt = connection.prepareStatement(query);

 // Bind parameters
 stmt.setInt(1, lastStatus);
 stmt.setInt(2, lastPriority);
 stmt.setDate(3, lastAddedTime);
 stmt.setString(4, lastUrl);
 ResultSet resultSet = stmt.executeQuery();

 while (resultSet.next()) {
 // Remember last row processed so that you can start after
  that
for
 next batch
 lastStatus = resultSet.getInt(1);
 lastPriority = resultSet.getInt(2);
 lastAddedTime = resultSet.getDate(3);
 lastUrl = resultSet.getString(4);

 doSomethingWithUrls();

 UPSERT INTO url_db(status, priority, added_time, url)
 VALUES (1, ?, CURRENT_DATE(), ?);

 }
 }

 If you need to efficiently query on url, add a secondary index
 like
   this:

 CREATE INDEX url_index ON url_db (url);

 Please let me know if you have questions.

 Thanks,
 James




 On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com
 wrote:

 thank you. But I can't use nutch. could you tell me how hbase is
  used
 in nutch? or hbase is only used to store webpage.

 On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  Have a look at http://nutch.apache.org .  Version 2.x uses
 HBase
under
 the
  hood.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Fri, Jan 3, 2014 at 1:12 AM, Li Li

Re: use hbase as distributed crawl's scheduler

2014-01-03 Thread Li Li

so what's the relationship of Phoenix and HBase? something like hadoop and hive?


On Sat, Jan 4, 2014 at 3:43 PM, James Taylor jtay...@salesforce.com wrote:
 Hi LiLi,
 Phoenix isn't an experimental project. We're on our 2.2 release, and many
 companies (including the company for which I'm employed, Salesforce.com)
 use it in production today.
 Thanks,
 James


 On Fri, Jan 3, 2014 at 11:39 PM, Li Li fancye...@gmail.com wrote:

 hi James,
 phoenix seems great but it's now only a experimental project. I
 want to use only hbase. could you tell me the difference of Phoenix
 and hbase? If I use hbase only, how should I design the schema and
 some extra things for my goal? thank you

 On Sat, Jan 4, 2014 at 3:41 AM, James Taylor jtay...@salesforce.com
 wrote:
  On Fri, Jan 3, 2014 at 10:50 AM, Asaf Mesika asaf.mes...@gmail.com
 wrote:
 
  Couple of notes:
  1. When updating to status you essentially add a new rowkey into HBase,
 I
  would give it up all together. The essential requirement seems to point
 at
  retrieving a list of urls in a certain order.
 
  Not sure on this, but seemed to me that setting the status field is
 forcing
  the urls that have been processed to be at the end of the sort order.
 
  2. Wouldn't salting ruin the sort order required? Priority, date added?
 
  No, as Phoenix maintains returning rows in row key order even when
 they're
  salted. We do parallel scans for each bucket and do a merge sort on the
  client, so the cost is pretty low for this (we also provide a way of
  turning this off if your use case doesn't need it).
 
  Two years, JM? Now you're really going to have to start using Phoenix :-)
 
 
  On Friday, January 3, 2014, James Taylor wrote:
 
   Sure, no problem. One addition: depending on the cardinality of your
   priority column, you may want to salt your table to prevent
 hotspotting,
   since you'll have a monotonically increasing date in the key. To do
 that,
   just add  SALT_BUCKETS=n on to your query, where n is the
 number of
   machines in your cluster. You can read more about salting here:
   http://phoenix.incubator.apache.org/salted.html
  
  
   On Thu, Jan 2, 2014 at 11:36 PM, Li Li fancye...@gmail.com wrote:
  
thank you. it's great.
   
On Fri, Jan 3, 2014 at 3:15 PM, James Taylor 
 jtay...@salesforce.com
wrote:
 Hi LiLi,
 Have a look at Phoenix (http://phoenix.incubator.apache.org/).
 It's
  a
SQL
 skin on top of HBase. You can model your schema and issue your
  queries
just
 like you would with MySQL. Something like this:

 // Create table that optimizes for your most common query
 // (i.e. the PRIMARY KEY constraint should be ordered as you'd
 want
   your
 rows ordered)
 CREATE TABLE url_db (
 status TINYINT,
 priority INTEGER NOT NULL,
 added_time DATE,
 url VARCHAR NOT NULL
 CONSTRAINT pk PRIMARY KEY (status, priority, added_time,
 url));

 int lastStatus = 0;
 int lastPriority = 0;
 Date lastAddedTime = new Date(0);
 String lastUrl = ;

 while (true) {
 // Use row value constructor to page through results in
 batches
  of
1000
 String query = 
 SELECT * FROM url_db
 WHERE status=0 AND (status, priority, added_time, url) 
 (?,
  ?,
?,
 ?)
 ORDER BY status, priority, added_time, url
 LIMIT 1000
 PreparedStatement stmt = connection.prepareStatement(query);

 // Bind parameters
 stmt.setInt(1, lastStatus);
 stmt.setInt(2, lastPriority);
 stmt.setDate(3, lastAddedTime);
 stmt.setString(4, lastUrl);
 ResultSet resultSet = stmt.executeQuery();

 while (resultSet.next()) {
 // Remember last row processed so that you can start after
  that
for
 next batch
 lastStatus = resultSet.getInt(1);
 lastPriority = resultSet.getInt(2);
 lastAddedTime = resultSet.getDate(3);
 lastUrl = resultSet.getString(4);

 doSomethingWithUrls();

 UPSERT INTO url_db(status, priority, added_time, url)
 VALUES (1, ?, CURRENT_DATE(), ?);

 }
 }

 If you need to efficiently query on url, add a secondary index
 like
   this:

 CREATE INDEX url_index ON url_db (url);

 Please let me know if you have questions.

 Thanks,
 James




 On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com
 wrote:

 thank you. But I can't use nutch. could you tell me how hbase is
  used
 in nutch? or hbase is only used to store webpage.

 On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  Have a look at http://nutch.apache.org .  Version 2.x uses
 HBase
under
 the
  hood.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread Otis Gospodnetic

Hi,

Have a look at http://nutch.apache.org .  Version 2.x uses HBase under the
hood.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:

 hi all,
  I want to use hbase to store all urls(crawled or not crawled).
 And each url will has a column named priority which represent the
 priority of the url. I want to get the top N urls order by priority(if
 priority is the same then url whose timestamp is ealier is prefered).
  in using something like mysql, my client application may like:
  while true:
  select  url from url_db order by priority,addedTime limit
 1000 where status='not_crawled';
  do something with this urls;
  extract more urls and insert them into url_db;
  How should I design hbase schema for this application? Is hbase
 suitable for me?
  I found in this article
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
 ,
 they use redis to store urls. I think hbase is originated from
 bigtable and google use bigtable to store webpage, so for huge number
 of urls, I prefer distributed system like hbase.

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread Li Li

thank you. But I can't use nutch. could you tell me how hbase is used
in nutch? or hbase is only used to store webpage.

On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 Hi,

 Have a look at http://nutch.apache.org .  Version 2.x uses HBase under the
 hood.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:

 hi all,
  I want to use hbase to store all urls(crawled or not crawled).
 And each url will has a column named priority which represent the
 priority of the url. I want to get the top N urls order by priority(if
 priority is the same then url whose timestamp is ealier is prefered).
  in using something like mysql, my client application may like:
  while true:
  select  url from url_db order by priority,addedTime limit
 1000 where status='not_crawled';
  do something with this urls;
  extract more urls and insert them into url_db;
  How should I design hbase schema for this application? Is hbase
 suitable for me?
  I found in this article
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
 ,
 they use redis to store urls. I think hbase is originated from
 bigtable and google use bigtable to store webpage, so for huge number
 of urls, I prefer distributed system like hbase.

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread James Taylor

Otis,
I didn't realize Nutch uses HBase underneath. Might be interesting if you
serialized data in a Phoenix-compliant manner, as you could run SQL queries
directly on top of it.

Thanks,
James


On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 Have a look at http://nutch.apache.org .  Version 2.x uses HBase under the
 hood.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:

  hi all,
   I want to use hbase to store all urls(crawled or not crawled).
  And each url will has a column named priority which represent the
  priority of the url. I want to get the top N urls order by priority(if
  priority is the same then url whose timestamp is ealier is prefered).
   in using something like mysql, my client application may like:
   while true:
   select  url from url_db order by priority,addedTime limit
  1000 where status='not_crawled';
   do something with this urls;
   extract more urls and insert them into url_db;
   How should I design hbase schema for this application? Is hbase
  suitable for me?
   I found in this article
 
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
  ,
  they use redis to store urls. I think hbase is originated from
  bigtable and google use bigtable to store webpage, so for huge number
  of urls, I prefer distributed system like hbase.

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread Otis Gospodnetic

Hi,

Yes. I'm sure that would be a welcome addition.  Topic for user@nutch.a.o...

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Jan 3, 2014 at 1:23 AM, James Taylor jtay...@salesforce.com wrote:

 Otis,
 I didn't realize Nutch uses HBase underneath. Might be interesting if you
 serialized data in a Phoenix-compliant manner, as you could run SQL queries
 directly on top of it.

 Thanks,
 James


 On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Hi,
 
  Have a look at http://nutch.apache.org .  Version 2.x uses HBase under
 the
  hood.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:
 
   hi all,
I want to use hbase to store all urls(crawled or not crawled).
   And each url will has a column named priority which represent the
   priority of the url. I want to get the top N urls order by priority(if
   priority is the same then url whose timestamp is ealier is prefered).
in using something like mysql, my client application may like:
while true:
select  url from url_db order by priority,addedTime limit
   1000 where status='not_crawled';
do something with this urls;
extract more urls and insert them into url_db;
How should I design hbase schema for this application? Is hbase
   suitable for me?
I found in this article
  
 
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
   ,
   they use redis to store urls. I think hbase is originated from
   bigtable and google use bigtable to store webpage, so for huge number
   of urls, I prefer distributed system like hbase.

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread James Taylor

Hi LiLi,
Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL
skin on top of HBase. You can model your schema and issue your queries just
like you would with MySQL. Something like this:

// Create table that optimizes for your most common query
// (i.e. the PRIMARY KEY constraint should be ordered as you'd want your
rows ordered)
CREATE TABLE url_db (
status TINYINT,
priority INTEGER NOT NULL,
added_time DATE,
url VARCHAR NOT NULL
CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));

int lastStatus = 0;
int lastPriority = 0;
Date lastAddedTime = new Date(0);
String lastUrl = ;

while (true) {
// Use row value constructor to page through results in batches of 1000
String query = 
SELECT * FROM url_db
WHERE status=0 AND (status, priority, added_time, url)  (?, ?, ?,
?)
ORDER BY status, priority, added_time, url
LIMIT 1000
PreparedStatement stmt = connection.prepareStatement(query);

// Bind parameters
stmt.setInt(1, lastStatus);
stmt.setInt(2, lastPriority);
stmt.setDate(3, lastAddedTime);
stmt.setString(4, lastUrl);
ResultSet resultSet = stmt.executeQuery();

while (resultSet.next()) {
// Remember last row processed so that you can start after that for
next batch
lastStatus = resultSet.getInt(1);
lastPriority = resultSet.getInt(2);
lastAddedTime = resultSet.getDate(3);
lastUrl = resultSet.getString(4);

doSomethingWithUrls();

UPSERT INTO url_db(status, priority, added_time, url)
VALUES (1, ?, CURRENT_DATE(), ?);

}
}

If you need to efficiently query on url, add a secondary index like this:

CREATE INDEX url_index ON url_db (url);

Please let me know if you have questions.

Thanks,
James




On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:

 thank you. But I can't use nutch. could you tell me how hbase is used
 in nutch? or hbase is only used to store webpage.

 On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  Have a look at http://nutch.apache.org .  Version 2.x uses HBase under
 the
  hood.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:
 
  hi all,
   I want to use hbase to store all urls(crawled or not crawled).
  And each url will has a column named priority which represent the
  priority of the url. I want to get the top N urls order by priority(if
  priority is the same then url whose timestamp is ealier is prefered).
   in using something like mysql, my client application may like:
   while true:
   select  url from url_db order by priority,addedTime limit
  1000 where status='not_crawled';
   do something with this urls;
   extract more urls and insert them into url_db;
   How should I design hbase schema for this application? Is hbase
  suitable for me?
   I found in this article
 
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
  ,
  they use redis to store urls. I think hbase is originated from
  bigtable and google use bigtable to store webpage, so for huge number
  of urls, I prefer distributed system like hbase.

Re: use hbase as distributed crawl's scheduler

2014-01-02 Thread Li Li

thank you. it's great.

On Fri, Jan 3, 2014 at 3:15 PM, James Taylor jtay...@salesforce.com wrote:
 Hi LiLi,
 Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a SQL
 skin on top of HBase. You can model your schema and issue your queries just
 like you would with MySQL. Something like this:

 // Create table that optimizes for your most common query
 // (i.e. the PRIMARY KEY constraint should be ordered as you'd want your
 rows ordered)
 CREATE TABLE url_db (
 status TINYINT,
 priority INTEGER NOT NULL,
 added_time DATE,
 url VARCHAR NOT NULL
 CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));

 int lastStatus = 0;
 int lastPriority = 0;
 Date lastAddedTime = new Date(0);
 String lastUrl = ;

 while (true) {
 // Use row value constructor to page through results in batches of 1000
 String query = 
 SELECT * FROM url_db
 WHERE status=0 AND (status, priority, added_time, url)  (?, ?, ?,
 ?)
 ORDER BY status, priority, added_time, url
 LIMIT 1000
 PreparedStatement stmt = connection.prepareStatement(query);

 // Bind parameters
 stmt.setInt(1, lastStatus);
 stmt.setInt(2, lastPriority);
 stmt.setDate(3, lastAddedTime);
 stmt.setString(4, lastUrl);
 ResultSet resultSet = stmt.executeQuery();

 while (resultSet.next()) {
 // Remember last row processed so that you can start after that for
 next batch
 lastStatus = resultSet.getInt(1);
 lastPriority = resultSet.getInt(2);
 lastAddedTime = resultSet.getDate(3);
 lastUrl = resultSet.getString(4);

 doSomethingWithUrls();

 UPSERT INTO url_db(status, priority, added_time, url)
 VALUES (1, ?, CURRENT_DATE(), ?);

 }
 }

 If you need to efficiently query on url, add a secondary index like this:

 CREATE INDEX url_index ON url_db (url);

 Please let me know if you have questions.

 Thanks,
 James




 On Thu, Jan 2, 2014 at 10:22 PM, Li Li fancye...@gmail.com wrote:

 thank you. But I can't use nutch. could you tell me how hbase is used
 in nutch? or hbase is only used to store webpage.

 On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  Have a look at http://nutch.apache.org .  Version 2.x uses HBase under
 the
  hood.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Fri, Jan 3, 2014 at 1:12 AM, Li Li fancye...@gmail.com wrote:
 
  hi all,
   I want to use hbase to store all urls(crawled or not crawled).
  And each url will has a column named priority which represent the
  priority of the url. I want to get the top N urls order by priority(if
  priority is the same then url whose timestamp is ealier is prefered).
   in using something like mysql, my client application may like:
   while true:
   select  url from url_db order by priority,addedTime limit
  1000 where status='not_crawled';
   do something with this urls;
   extract more urls and insert them into url_db;
   How should I design hbase schema for this application? Is hbase
  suitable for me?
   I found in this article
 
 http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
  ,
  they use redis to store urls. I think hbase is originated from
  bigtable and google use bigtable to store webpage, so for huge number
  of urls, I prefer distributed system like hbase.

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

Re: use hbase as distributed crawl's scheduler

16 matches

Site Navigation

Mail list logo

Footer information