Re: indexing log files for adhoc queries - suggestions?

Otis Gospodnetic Fri, 02 Oct 2009 22:23:44 -0700

My understanding is that *no* tools built on top of MapReduce (Hive, Pig, 
Cascading, CloudBase...) can be real-time where real-time is something that 
processes the data and produces output in under 5 seconds or so.


I believe Hive can read HBase now, too.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Amandeep Khurana <ama...@gmail.com>
> To: common-user@hadoop.apache.org
> Sent: Saturday, October 3, 2009 1:18:57 AM
> Subject: Re: indexing log files for adhoc queries - suggestions?
> 
> There's another option - cascading.
> 
> With pig and cascading you can use hbase as a backend. So that might
> be something you can explore too... The choice will depend on what
> kind of querying you want to do - real time or batch processed.
> 
> On 10/2/09, Otis Gospodnetic wrote:
> > Use Pig or Hive.  Lots of overlap, some differences, but it looks like both
> > projects' future plans mean even more overlap, though I didn't hear any
> > mentions of convergence and merging.
> >
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >
> >
> >
> > ----- Original Message ----
> >> From: Amandeep Khurana 
> >> To: common-user@hadoop.apache.org
> >> Sent: Friday, October 2, 2009 6:28:51 PM
> >> Subject: Re: indexing log files for adhoc queries - suggestions?
> >>
> >> Hive is an sql-like abstraction over map reduce. It just enables you
> >> to execute sql-like queries over data without actually having to write
> >> the MR job. However it converts the query into a job at the back.
> >>
> >> Hbase might be what you are looking for. You can put your logs into
> >> hbase and query them as well as run MR jobs over them...
> >>
> >> On 10/1/09, Mayuran Yogarajah wrote:
> >> > ishwar ramani wrote:
> >> >> Hi,
> >> >>
> >> >> I have a setup where logs are periodically bundled up and dumped into
> >> >> hadoop dfs as large sequence file.
> >> >>
> >> >> It works fine for all my map reduce jobs.
> >> >>
> >> >> Now i need to handle adhoc queries for pulling out logs based on user
> >> >> and time range.
> >> >>
> >> >> I really dont need a full indexer (like lucene) for this purpose.
> >> >>
> >> >> My first thought is to run a periodic mapreduce to generate a large
> >> >> text file sorted by user id.
> >> >>
> >> >> The text file will have (sequence file name, offset) to retrieve the
> >> >> logs
> >> >> ....
> >> >>
> >> >>
> >> >> I am guessing many of you ran into similar requirements... Any
> >> >> suggestions on doing this better?
> >> >>
> >> >> ishwar
> >> >>
> >> > Have you looked into Hive? Its perfect for ad hoc queries..
> >> >
> >> > M
> >> >
> >>
> >>
> >> --
> >>
> >>
> >> Amandeep Khurana
> >> Computer Science Graduate Student
> >> University of California, Santa Cruz
> >
> >
> 
> 
> -- 
> 
> 
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz

Re: indexing log files for adhoc queries - suggestions?

Reply via email to