bq. One URL ... I guess you mean one table ...
Cheers On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari <jean-m...@spaggiari.org> wrote: > Interesting. This is exactly what I'm doing ;) > > I'm using 3 tables to achieve this. > > One table with the URL already crawled (80 millions), one URL with the URL > to crawle (2 billions) and one URL with the URLs been processed. I'm not > running any SQL requests against my dataset but I have MR jobs doing many > different things. I have many other tables to help with the work on the > URLs. > > I'm "salting" the keys using the URL hash so I can find them back very > quickly. There can be some collisions so I store also the URL itself on the > key. So very small scans returning 1 or something 2 rows allow me to > quickly find a row knowing the URL. > > I also have secondary index tables to store the CRCs of the pages to > identify duplicate pages based on this value. > > And so on ;) Working on that for 2 years now. I might have been able to use > Nuthc and others, but my goal was to learn and do that with a distributed > client on a single dataset... > > Enjoy. > > JM > > > 2014/1/3 James Taylor <jtay...@salesforce.com> > >> Sure, no problem. One addition: depending on the cardinality of your >> priority column, you may want to salt your table to prevent hotspotting, >> since you'll have a monotonically increasing date in the key. To do that, >> just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of >> machines in your cluster. You can read more about salting here: >> http://phoenix.incubator.apache.org/salted.html >> >> >> On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fancye...@gmail.com> wrote: >> >>> thank you. it's great. >>> >>> On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jtay...@salesforce.com> >>> wrote: >>>> Hi LiLi, >>>> Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a >>> SQL >>>> skin on top of HBase. You can model your schema and issue your queries >>> just >>>> like you would with MySQL. Something like this: >>>> >>>> // Create table that optimizes for your most common query >>>> // (i.e. the PRIMARY KEY constraint should be ordered as you'd want >> your >>>> rows ordered) >>>> CREATE TABLE url_db ( >>>> status TINYINT, >>>> priority INTEGER NOT NULL, >>>> added_time DATE, >>>> url VARCHAR NOT NULL >>>> CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url)); >>>> >>>> int lastStatus = 0; >>>> int lastPriority = 0; >>>> Date lastAddedTime = new Date(0); >>>> String lastUrl = ""; >>>> >>>> while (true) { >>>> // Use row value constructor to page through results in batches of >>> 1000 >>>> String query = " >>>> SELECT * FROM url_db >>>> WHERE status=0 AND (status, priority, added_time, url) > (?, ?, >>> ?, >>>> ?) >>>> ORDER BY status, priority, added_time, url >>>> LIMIT 1000" >>>> PreparedStatement stmt = connection.prepareStatement(query); >>>> >>>> // Bind parameters >>>> stmt.setInt(1, lastStatus); >>>> stmt.setInt(2, lastPriority); >>>> stmt.setDate(3, lastAddedTime); >>>> stmt.setString(4, lastUrl); >>>> ResultSet resultSet = stmt.executeQuery(); >>>> >>>> while (resultSet.next()) { >>>> // Remember last row processed so that you can start after that >>> for >>>> next batch >>>> lastStatus = resultSet.getInt(1); >>>> lastPriority = resultSet.getInt(2); >>>> lastAddedTime = resultSet.getDate(3); >>>> lastUrl = resultSet.getString(4); >>>> >>>> doSomethingWithUrls(); >>>> >>>> UPSERT INTO url_db(status, priority, added_time, url) >>>> VALUES (1, ?, CURRENT_DATE(), ?); >>>> >>>> } >>>> } >>>> >>>> If you need to efficiently query on url, add a secondary index like >> this: >>>> >>>> CREATE INDEX url_index ON url_db (url); >>>> >>>> Please let me know if you have questions. >>>> >>>> Thanks, >>>> James >>>> >>>> >>>> >>>> >>>> On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fancye...@gmail.com> wrote: >>>> >>>>> thank you. But I can't use nutch. could you tell me how hbase is used >>>>> in nutch? or hbase is only used to store webpage. >>>>> >>>>> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic >>>>> <otis.gospodne...@gmail.com> wrote: >>>>>> Hi, >>>>>> >>>>>> Have a look at http://nutch.apache.org . Version 2.x uses HBase >>> under >>>>> the >>>>>> hood. >>>>>> >>>>>> Otis >>>>>> -- >>>>>> Performance Monitoring * Log Analytics * Search Analytics >>>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>>> >>>>>> >>>>>> On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fancye...@gmail.com> wrote: >>>>>> >>>>>>> hi all, >>>>>>> I want to use hbase to store all urls(crawled or not crawled). >>>>>>> And each url will has a column named priority which represent the >>>>>>> priority of the url. I want to get the top N urls order by >>> priority(if >>>>>>> priority is the same then url whose timestamp is ealier is >> prefered). >>>>>>> in using something like mysql, my client application may like: >>>>>>> while true: >>>>>>> select url from url_db order by priority,addedTime limit >>>>>>> 1000 where status='not_crawled'; >>>>>>> do something with this urls; >>>>>>> extract more urls and insert them into url_db; >>>>>>> How should I design hbase schema for this application? Is >> hbase >>>>>>> suitable for me? >>>>>>> I found in this article >> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/ >>>>>>> , >>>>>>> they use redis to store urls. I think hbase is originated from >>>>>>> bigtable and google use bigtable to store webpage, so for huge >> number >>>>>>> of urls, I prefer distributed system like hbase. >>