I guess another way to do this is to:
1. Setup https://github.com/elasticsearch/elasticsearch-hadoop
2. Run Hive/Pig export queries straight on Graylog data(ElasticSearch)
3. Save query output to HDFS for long term archiving
On Sunday, March 9, 2014 6:58:13 PM UTC+2, ChrisDK wrote:
>
> Thanks Kay,
>
> Have you implemented something similar before?
> My apologies, I have so many questions and assumptions - I will be
> extremely grateful if you can help me out.
>
> Could you please elaborate on the following:
> - Where should the scroll search logic run? (Java app?Unix script?)
> - How often does the scroll search run?
> - How do we handle incremental exports and keep track of what's already
> been exported, or do we just run the export job on
> non-deflector("read-only") indexes when the deflector is cycled?
> - What should the output of the scroll search be? (File?Json?CSV?Network?)
> - How is the output from the scroll search written/imported to Hive?
> - Do we just dump Json files on HDFS and then use a Json SerDe? (...wild
> assumption based on "speed-reading" style research:) )
> - Can HCat pic up schema dynamically from Json or do we need to manually
> create tables?
>
> Many thanks!
>
> Chris
>
>
>
> On Thursday, February 27, 2014 2:47:52 PM UTC+2, Kay Röpke wrote:
>>
>> Hi!
>>
>> The easiest way is to use the scroll feature of elasticsearch:
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/search-request-scroll.html
>>
>> That way you can iterate over all the documents in indices and write them
>> to Hive.
>> We don't have a built-in way to perform archiving yet, but this should
>> solve your immediate problem with minimal effort and impact.
>>
>> Best,
>> Kay
>>
>> On Thursday, February 27, 2014 9:36:59 AM UTC+1, ChrisDK wrote:
>>>
>>> Hi Guys,
>>>
>>> We have a requirement to archive our Graylog2( v0.20.1) data into Hive.
>>> With a 400 million cap we currently keep only a couple of weeks' data,
>>> where the requirement is 36 months.
>>>
>>> Ideally these exports should run near real-time, not batched as nightly
>>> exports.
>>> It should also have a minimal impact on our live ElasticSearch cluster.
>>>
>>> What would be the best way to do this?
>>>
>>> Thanks!
>>>
>>
--
You received this message because you are subscribed to the Google Groups
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.