thanks for all the suggestions ... >From my understanding so far,
as a primitive first cut i can use Hbase for indexing a client id <version as timestamp> -> log location If i have future requirements of more complex queries i can extend hive or pig over hbase ... ishwar On Sat, Oct 3, 2009 at 5:43 AM, Omer Trajman <o...@vertica.com> wrote: > You might consider loading logs to a parallel database for the ad-hoc queries > (full disclosure, I work for a database company). > > For repeated ad-hoc queries, a distributed database will give you the > scalability of hdfs and also structure the data to handle fast predicates and > relational aggregates. > > -Omer > > > -----Original Message----- > From: Amandeep Khurana <ama...@gmail.com> > Sent: Saturday, October 03, 2009 04:07 > To: common-user@hadoop.apache.org <common-user@hadoop.apache.org> > Subject: Re: indexing log files for adhoc queries - suggestions? > > Hbase is built on hdfs but just to read records from it, you don't > need map reduce. So, its possible to access it real time. The .20 > release compares to mysql as far as random reads go... > > I haven't heard of hive talking to hbase yet. But that'll be a good > feature to have for sure. > > On 10/2/09, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: >> My understanding is that *no* tools built on top of MapReduce (Hive, Pig, >> Cascading, CloudBase...) can be real-time where real-time is something that >> processes the data and produces output in under 5 seconds or so. >> >> I believe Hive can read HBase now, too. >> >> Otis >> -- >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR >> >> >> >> ----- Original Message ---- >>> From: Amandeep Khurana <ama...@gmail.com> >>> To: common-user@hadoop.apache.org >>> Sent: Saturday, October 3, 2009 1:18:57 AM >>> Subject: Re: indexing log files for adhoc queries - suggestions? >>> >>> There's another option - cascading. >>> >>> With pig and cascading you can use hbase as a backend. So that might >>> be something you can explore too... The choice will depend on what >>> kind of querying you want to do - real time or batch processed. >>> >>> On 10/2/09, Otis Gospodnetic wrote: >>> > Use Pig or Hive. Lots of overlap, some differences, but it looks like >>> > both >>> > projects' future plans mean even more overlap, though I didn't hear any >>> > mentions of convergence and merging. >>> > >>> > Otis >>> > -- >>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls >>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR >>> > >>> > >>> > >>> > ----- Original Message ---- >>> >> From: Amandeep Khurana >>> >> To: common-user@hadoop.apache.org >>> >> Sent: Friday, October 2, 2009 6:28:51 PM >>> >> Subject: Re: indexing log files for adhoc queries - suggestions? >>> >> >>> >> Hive is an sql-like abstraction over map reduce. It just enables you >>> >> to execute sql-like queries over data without actually having to write >>> >> the MR job. However it converts the query into a job at the back. >>> >> >>> >> Hbase might be what you are looking for. You can put your logs into >>> >> hbase and query them as well as run MR jobs over them... >>> >> >>> >> On 10/1/09, Mayuran Yogarajah wrote: >>> >> > ishwar ramani wrote: >>> >> >> Hi, >>> >> >> >>> >> >> I have a setup where logs are periodically bundled up and dumped >>> >> >> into >>> >> >> hadoop dfs as large sequence file. >>> >> >> >>> >> >> It works fine for all my map reduce jobs. >>> >> >> >>> >> >> Now i need to handle adhoc queries for pulling out logs based on >>> >> >> user >>> >> >> and time range. >>> >> >> >>> >> >> I really dont need a full indexer (like lucene) for this purpose. >>> >> >> >>> >> >> My first thought is to run a periodic mapreduce to generate a large >>> >> >> text file sorted by user id. >>> >> >> >>> >> >> The text file will have (sequence file name, offset) to retrieve the >>> >> >> logs >>> >> >> .... >>> >> >> >>> >> >> >>> >> >> I am guessing many of you ran into similar requirements... Any >>> >> >> suggestions on doing this better? >>> >> >> >>> >> >> ishwar >>> >> >> >>> >> > Have you looked into Hive? Its perfect for ad hoc queries.. >>> >> > >>> >> > M >>> >> > >>> >> >>> >> >>> >> -- >>> >> >>> >> >>> >> Amandeep Khurana >>> >> Computer Science Graduate Student >>> >> University of California, Santa Cruz >>> > >>> > >>> >>> >>> -- >>> >>> >>> Amandeep Khurana >>> Computer Science Graduate Student >>> University of California, Santa Cruz >> >> > > > -- > > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz >