[graylog2] Re: Graylog2 archive to Hive

ChrisDK Mon, 10 Mar 2014 00:21:31 -0700

I guess another way to do this is to:

1. Setup https://github.com/elasticsearch/elasticsearch-hadoop
2. Run Hive/Pig export queries straight on Graylog data(ElasticSearch)
3. Save query output to HDFS for long term archiving




On Sunday, March 9, 2014 6:58:13 PM UTC+2, ChrisDK wrote:
>
> Thanks Kay,
>
> Have you implemented something similar before? 
> My apologies, I have so many questions and assumptions - I will be 
> extremely grateful if you can help me out.
>
> Could you please elaborate on the following:
> - Where should the scroll search logic run? (Java app?Unix script?)
> - How often does the scroll search run?
> - How do we handle incremental exports and keep track of what's already 
> been exported, or do we just run the export job on 
> non-deflector("read-only") indexes when the deflector is cycled?
> - What should the output of the scroll search be? (File?Json?CSV?Network?)
> - How is the output from the scroll search written/imported to Hive? 
> - Do we just dump Json files on HDFS and then use a Json SerDe? (...wild 
> assumption based on "speed-reading" style research:) )
> - Can HCat pic up schema dynamically from Json or do we need to manually 
> create tables?
>
> Many thanks!
>
> Chris
>
>
>
> On Thursday, February 27, 2014 2:47:52 PM UTC+2, Kay Röpke wrote:
>>
>> Hi!
>>
>> The easiest way is to use the scroll feature of elasticsearch:
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/search-request-scroll.html
>>
>> That way you can iterate over all the documents in indices and write them 
>> to Hive.
>> We don't have a built-in way to perform archiving yet, but this should 
>> solve your immediate problem with minimal effort and impact.
>>
>> Best,
>> Kay
>>
>> On Thursday, February 27, 2014 9:36:59 AM UTC+1, ChrisDK wrote:
>>>
>>> Hi Guys,
>>>
>>> We have a requirement to archive our Graylog2( v0.20.1) data into Hive. 
>>> With a 400 million cap we currently keep only a couple of weeks' data, 
>>> where the requirement is 36 months.
>>>
>>> Ideally these exports should run near real-time, not batched as nightly 
>>> exports.
>>> It should also have a minimal impact on our live ElasticSearch cluster.
>>>
>>> What would be the best way to do this?
>>>
>>> Thanks!
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[graylog2] Re: Graylog2 archive to Hive

Reply via email to