This is an interesting problem because native hive does not support
windowed functions (or any other of the so-called analytic functions),
and yet it seems like it would be a very valuable use case for this
platform.

If I understand the OP problem correctly then the central question is
how to do it at scale. I do believe it should be possible to do a
Map/Reduce for that, but it could be tricky to get all the interval
arithmetic exactly right.  I think it would be worthwhile if someone in
the community would develop a library or at least collection of M/R
patterns that handle this sort of thing.



On 03/29/2012 10:09 AM, Robert Evans wrote:
> I am not aware of anyone that does this for you directly, but it should not 
> be too difficult for you to write what you want using pig or hive.  I am not 
> as familiar with Jaql but I assume that you can do it there too.  Although it 
> might be simpler to write it using Map/Reduce because we can abuse Map/Reduce 
> in ways that the higher level languages disallow so that they can do 
> optimizations.
>
> What I would do is in the mapper scan through each entry and look for 
> transitions of $value around $threshold, and the time that they occurred.  
> You can then look for 30+ second windows where $value > $threshold within 
> that partition and output them to the reducer.  The trick with this is that 
> you need to pay special attention to the beginning and end of the partition.  
> You need to also send to the reducer the state at the beginning and end of 
> each partition and how long it was in that state.  The reducer can then 
> combine these pieces together and see if they meet the 30+ second criteria. 
> If so output them with the rest, otherwise don't.  The known times when it is 
> > 30 seconds can be sent to any reducer, so they can have any key, but for 
> the transitions to work correctly you need to send them to a single reducer, 
> so they should have a very specific key.  You could also try to divide them 
> up if you have to scale very very large, but that would be rather difficult 
> to get right.
>
> --Bobby Evans
>
>
> On 3/29/12 4:02 AM, "banermatt" <banerm...@hotmail.fr> wrote:
>
>
>
> Hello,
>
> I'm developping a log file anomaly detection system on an hadoop cluster.
> I'm looking for a way to process query like: "select all values when
> value>threshold for a duration>30 secondes". Do you know a tool which could
> help me to process such a query?
> I documented on the script langages pig, hive and jaql which seem to have
> very similar application. I tried it but I was not be able to do what I
> want.
>
> Thank you in advance,
>
> Matthieu
>
> --
> View this message in context: 
> http://old.nabble.com/Temporal-query-tp33544869p33544869.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
>


Reply via email to