Hi Jigar, Take a look at Apache Phoenix: http://phoenix.incubator.apache.org/ It allows you to use SQL to query over your HBase data and supports composite primary keys, so you could create a schema like this:
create table news_message(guid varchar not null, version bigint not null, constraint pk primary key (guid, version desc)); The rows will then sort by guid plus version descending. Then you can issue sql queries directly against your hbase data without writing map/reduce. Note that we don't yet support all the sql constructs that postgres does. HTH, James On Sun, Mar 2, 2014 at 11:23 PM, Jigar Shah <jigar.s...@infodesk.com> wrote: > I am working in news processing industry, current system processes more > then million article per week. And provides this data in real time to > users, additionally it provides search capabilities via Lucene. > > We convert all news to a standard IPTC NewsML > G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ < > http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format, > before providing it to users (in real-time or via search) > > We have a requirement of component which provides analytical queries on > news data. I plan to load this all data in HBase and then have Map-Reduce > Jobs to compute analytical queries. More over current system is developed > on postgresql to store only 3 months data, anything more then this is big > data as it dosen't fit on one server. > > But i am bit confused in developing schema for it. > > Every news article has > > *"messageID" as guid*, unique id for news message. > *"version" as int,* incremented if newer version of same news message is > published. > there are other fields like location, channels, title, content, source > etc.. > > Current database primary key is a composite of (messageID & version). > > I thought that, i should use "messageID" as "rowKey" in HBase. and > "version" as "columnFamily" and all columns will be fields of news (like > location, channels ,title, body, sentTimstamp, ...) > > Keeping "version" as "columnFamily" is a good idea ? > > In reality "single message may have thousands of version". > >