This is an interesting problem because native hive does not support windowed functions (or any other of the so-called analytic functions), and yet it seems like it would be a very valuable use case for this platform.
If I understand the OP problem correctly then the central question is how to do it at scale. I do believe it should be possible to do a Map/Reduce for that, but it could be tricky to get all the interval arithmetic exactly right. I think it would be worthwhile if someone in the community would develop a library or at least collection of M/R patterns that handle this sort of thing. On 03/29/2012 10:09 AM, Robert Evans wrote: > I am not aware of anyone that does this for you directly, but it should not > be too difficult for you to write what you want using pig or hive. I am not > as familiar with Jaql but I assume that you can do it there too. Although it > might be simpler to write it using Map/Reduce because we can abuse Map/Reduce > in ways that the higher level languages disallow so that they can do > optimizations. > > What I would do is in the mapper scan through each entry and look for > transitions of $value around $threshold, and the time that they occurred. > You can then look for 30+ second windows where $value > $threshold within > that partition and output them to the reducer. The trick with this is that > you need to pay special attention to the beginning and end of the partition. > You need to also send to the reducer the state at the beginning and end of > each partition and how long it was in that state. The reducer can then > combine these pieces together and see if they meet the 30+ second criteria. > If so output them with the rest, otherwise don't. The known times when it is > > 30 seconds can be sent to any reducer, so they can have any key, but for > the transitions to work correctly you need to send them to a single reducer, > so they should have a very specific key. You could also try to divide them > up if you have to scale very very large, but that would be rather difficult > to get right. > > --Bobby Evans > > > On 3/29/12 4:02 AM, "banermatt" <banerm...@hotmail.fr> wrote: > > > > Hello, > > I'm developping a log file anomaly detection system on an hadoop cluster. > I'm looking for a way to process query like: "select all values when > value>threshold for a duration>30 secondes". Do you know a tool which could > help me to process such a query? > I documented on the script langages pig, hive and jaql which seem to have > very similar application. I tried it but I was not be able to do what I > want. > > Thank you in advance, > > Matthieu > > -- > View this message in context: > http://old.nabble.com/Temporal-query-tp33544869p33544869.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > >