Re: indexing log files for adhoc queries - suggestions?

ishwar ramani Mon, 05 Oct 2009 14:32:55 -0700

thanks for all the suggestions ...

>From my understanding so far,


as a primitive first cut i can use Hbase  for indexing a client id
<version as timestamp> -> log location

If i have future requirements of more complex queries i can extend
hive or pig over hbase ...

ishwar

On Sat, Oct 3, 2009 at 5:43 AM, Omer Trajman <o...@vertica.com> wrote:
> You might consider loading logs to a parallel database for the ad-hoc queries 
> (full disclosure, I work for a database company).
>
> For repeated ad-hoc queries, a distributed database will give you the 
> scalability of hdfs and also structure the data to handle fast predicates and 
> relational aggregates.
>
> -Omer
>
>
> -----Original Message-----
> From: Amandeep Khurana <ama...@gmail.com>
> Sent: Saturday, October 03, 2009 04:07
> To: common-user@hadoop.apache.org <common-user@hadoop.apache.org>
> Subject: Re: indexing log files for adhoc queries - suggestions?
>
> Hbase is built on hdfs but just to read records from it, you don't
> need map reduce. So, its possible to access it real time. The .20
> release compares to mysql as far as random reads go...
>
> I haven't heard of hive talking to hbase yet. But that'll be a good
> feature to have for sure.
>
> On 10/2/09, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote:
>> My understanding is that *no* tools built on top of MapReduce (Hive, Pig,
>> Cascading, CloudBase...) can be real-time where real-time is something that
>> processes the data and produces output in under 5 seconds or so.
>>
>> I believe Hive can read HBase now, too.
>>
>> Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>>> From: Amandeep Khurana <ama...@gmail.com>
>>> To: common-user@hadoop.apache.org
>>> Sent: Saturday, October 3, 2009 1:18:57 AM
>>> Subject: Re: indexing log files for adhoc queries - suggestions?
>>>
>>> There's another option - cascading.
>>>
>>> With pig and cascading you can use hbase as a backend. So that might
>>> be something you can explore too... The choice will depend on what
>>> kind of querying you want to do - real time or batch processed.
>>>
>>> On 10/2/09, Otis Gospodnetic wrote:
>>> > Use Pig or Hive.  Lots of overlap, some differences, but it looks like
>>> > both
>>> > projects' future plans mean even more overlap, though I didn't hear any
>>> > mentions of convergence and merging.
>>> >
>>> > Otis
>>> > --
>>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>> >
>>> >
>>> >
>>> > ----- Original Message ----
>>> >> From: Amandeep Khurana
>>> >> To: common-user@hadoop.apache.org
>>> >> Sent: Friday, October 2, 2009 6:28:51 PM
>>> >> Subject: Re: indexing log files for adhoc queries - suggestions?
>>> >>
>>> >> Hive is an sql-like abstraction over map reduce. It just enables you
>>> >> to execute sql-like queries over data without actually having to write
>>> >> the MR job. However it converts the query into a job at the back.
>>> >>
>>> >> Hbase might be what you are looking for. You can put your logs into
>>> >> hbase and query them as well as run MR jobs over them...
>>> >>
>>> >> On 10/1/09, Mayuran Yogarajah wrote:
>>> >> > ishwar ramani wrote:
>>> >> >> Hi,
>>> >> >>
>>> >> >> I have a setup where logs are periodically bundled up and dumped
>>> >> >> into
>>> >> >> hadoop dfs as large sequence file.
>>> >> >>
>>> >> >> It works fine for all my map reduce jobs.
>>> >> >>
>>> >> >> Now i need to handle adhoc queries for pulling out logs based on
>>> >> >> user
>>> >> >> and time range.
>>> >> >>
>>> >> >> I really dont need a full indexer (like lucene) for this purpose.
>>> >> >>
>>> >> >> My first thought is to run a periodic mapreduce to generate a large
>>> >> >> text file sorted by user id.
>>> >> >>
>>> >> >> The text file will have (sequence file name, offset) to retrieve the
>>> >> >> logs
>>> >> >> ....
>>> >> >>
>>> >> >>
>>> >> >> I am guessing many of you ran into similar requirements... Any
>>> >> >> suggestions on doing this better?
>>> >> >>
>>> >> >> ishwar
>>> >> >>
>>> >> > Have you looked into Hive? Its perfect for ad hoc queries..
>>> >> >
>>> >> > M
>>> >> >
>>> >>
>>> >>
>>> >> --
>>> >>
>>> >>
>>> >> Amandeep Khurana
>>> >> Computer Science Graduate Student
>>> >> University of California, Santa Cruz
>>> >
>>> >
>>>
>>>
>>> --
>>>
>>>
>>> Amandeep Khurana
>>> Computer Science Graduate Student
>>> University of California, Santa Cruz
>>
>>
>
>
> --
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>

Re: indexing log files for adhoc queries - suggestions?

Reply via email to