If I give a more clear picture into the scenario;
We have separate column families for each tenant, server and day.
Eg:

log_0_esbserver_2012_07_23
log_1_esbserver_2012_07_23
log_2_esbserver_2012_07_23
log_0_esbserver_2012_07_24
log_2_appserver_2012_07_24
log_3_appserver_2012_07_24   (0,1,2.. denotes the tenantID)


With the task/summarizer, running at the end of the day, we need to create
compressed files containing info in each of the above col. family.
Eg:

....../0/esbserver/2012_07_23/logs.gz
....../1/esbserver/2012_07_23/logs.gz
....../2/esbserver/2012_07_23/logs.gz
....../0/esbserver/2012_07_24/logs.gz


If we are doing this with Hive, we need to consider the following facts;

   1. Dynamically pick the ALL the column families that is related to the
   particular date.
   2. Dynamically generate the file URL for each of the log file.

Tried various options to achieve above, but with no luck. In [1], that you
have suggested, we need to give the file URL. But how can I dynamically
generate the URL per each day, each server and each tenant? (Because none
of the operations like concat, work for proving this file URL)

Appreciate if you could suggest a concrete plan to implement this.

[1].
https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Writingdataintofilesystemfromqueries

Thanks
Rgds
Manisha

On Mon, Jul 23, 2012 at 7:55 PM, Buddhika Chamith <[email protected]>wrote:

> So if I understand right the data are stored in seperate column families
> per each tenant,server,day and the requirement is to transfer these column
> family data directly to a flat file which corresponds to a logs from a
> tenant for a server in a given day with no analytics involved. If it is the
> case may I suggest using what tharindu suggested (insert select * from foo)
> in combination with [1] in a loop for each column family. In order to
> dynamically  provide the directory name and the column family name we can
> use SET hive command and append it to the script before passing in to the
> Hive execution service as also suggested at [2].
>
> Regards
> Buddhika
>
> [1]
> https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Writingdataintofilesystemfromqueries
>
> [2] http://mail-archives.apache.org/mod_mbox/hive-user/201207.mbox/browser
>
>
> On Mon, Jul 23, 2012 at 7:21 PM, Tharindu Mathew <[email protected]>wrote:
>
>> insert select * from foo
>>
>>
>> On Mon, Jul 23, 2012 at 7:15 PM, Afkham Azeez <[email protected]> wrote:
>>
>>>
>>>
>>> On Mon, Jul 23, 2012 at 6:41 PM, Tharindu Mathew <[email protected]>wrote:
>>>
>>>> If you are planning to do a few MB, that would mean that the size of
>>>> logs will be ( size of logs * no. of tenants ), so roughly for 200 active
>>>> tenants and 2 MB of logs, it would come to around 400 MB. This is still
>>>> manageable in a custom task if your data processing is low.
>>>>
>>>> On Mon, Jul 23, 2012 at 6:24 PM, Afkham Azeez <[email protected]> wrote:
>>>>
>>>>> Like you said, the task may not be the best way to do this. Like we
>>>>> discussed the other day, we can publish logs to unique column families
>>>>> which contain the <Service>_<Tenant>_<Date> as the unique identifier. We
>>>>> need to generate logs in a file format & allow tenant users to download
>>>>> those. What is the best approach to generate these log files from the data
>>>>> collected? Typically, such a log file can run into a few MB.
>>>>
>>>> I'm a bit confused as we did not need to use Hive as per our earlier
>>>> conversation. This is because as the data is published it is already
>>>> grouped by server/ tenant and date.
>>>>
>>>
>>> Yeah, there is no analytics to be done. It is a problem of converting
>>> data stored in Cassandra into a flat file.
>>>
>>>
>>>>
>>>>> Azeez
>>>>>
>>>>>
>>>>> On Mon, Jul 23, 2012 at 6:18 PM, Tharindu Mathew <[email protected]>wrote:
>>>>>
>>>>>> I'm no expert, but I immediately question the scale of this approach.
>>>>>>
>>>>>> Do you have an idea of how much of logs you plan to process per task?
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 23, 2012 at 6:13 PM, Afkham Azeez <[email protected]> wrote:
>>>>>>
>>>>>>> The requirement is simple. We need to generate log files on a per
>>>>>>> tenant, per date, per Service basis. Now as a big data & analytics 
>>>>>>> expert,
>>>>>>> please advise us on what is the best solution for this.
>>>>>>>
>>>>>>> Azeez
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 23, 2012 at 6:05 PM, Tharindu Mathew 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>> So through this custom java task, what is the scale of log
>>>>>>>> processing you will support? 100MB, 1 GB, 100 GB, 1 TB?
>>>>>>>>
>>>>>>>> On Mon, Jul 23, 2012 at 5:14 PM, Manisha Gayathri <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Contacted Hive User Group as well on this matter.
>>>>>>>>> They also mentioned that this approach is not possible.
>>>>>>>>> Also as per the chat I had with Buddhika, right now, these kind of
>>>>>>>>> dynamic variable creations is not possible in Hive that comes with 
>>>>>>>>> BAM2.
>>>>>>>>>
>>>>>>>>> Therefore IMO, without going ahead with this cumbersome process,
>>>>>>>>> the best way will be to run a scheduled java task to pick data from
>>>>>>>>> relevant Cassandra Column families and dynamically generate the 
>>>>>>>>> relevant
>>>>>>>>> log files (according to the tenantID and current date) which will be 
>>>>>>>>> stored
>>>>>>>>> in Apache Directory.
>>>>>>>>>
>>>>>>>> You are going to store the results in a LDAP?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> As per the offline chat had with Azeez, will start to work on a
>>>>>>>>> custom Java task that can handle the above scenario.
>>>>>>>>>
>>>>>>>>> On Mon, Jul 23, 2012 at 2:27 PM, Manisha Gayathri <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> For a log file storing scenario using BAM2, I have a requirement
>>>>>>>>>> to generate separate log files for each date. For that I have 
>>>>>>>>>> created a
>>>>>>>>>> Hive Analytic query along with a Hive UDF as well.
>>>>>>>>>>
>>>>>>>>>> I have the getFilePath function which should return a URL like
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> home/user/Desktop/logDir/logs/log_0_testServer_2012_07_22
>>>>>>>>>>
>>>>>>>>>> The defined function works perfectly if I put *getFilePath(
>>>>>>>>>> "0","testServer" ) *into the *select* statement.
>>>>>>>>>>
>>>>>>>>>> But I want to get that particular URL as the *local directory
>>>>>>>>>> name*. (The requirement is such that this should not be
>>>>>>>>>> hard-coded in the hive query. Rather should be generated in the 
>>>>>>>>>> custom UDF.
>>>>>>>>>> )
>>>>>>>>>>
>>>>>>>>>> So can I do something like I v shown below?
>>>>>>>>>>
>>>>>>>>>> *set file_name= getFilePath( "0","testServer" );    *//Define a
>>>>>>>>>> parameter.* *
>>>>>>>>>> *.................*
>>>>>>>>>> *..............*
>>>>>>>>>> *INSERT OVERWRITE LOCAL DIRECTORY
>>>>>>>>>> 'file:///${hiveconf:file_name}'                    *//Assign the
>>>>>>>>>> above parameter as the file URL
>>>>>>>>>>
>>>>>>>>>> I tried this way. But the directory name is returned as
>>>>>>>>>>
>>>>>>>>>> file:/getFilePath( "0" , "testServer" )
>>>>>>>>>>
>>>>>>>>>> Does that mean I cannot use UDF to define the local directory
>>>>>>>>>> name?
>>>>>>>>>> Or am I doing anything wrong in here?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> ~Regards
>>>>>>>>>> *Manisha Eleperuma*
>>>>>>>>>> Software Engineer
>>>>>>>>>> WSO2, Inc.: http://wso2.com
>>>>>>>>>> lean.enterprise.middleware
>>>>>>>>>>
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> ~Regards
>>>>>>>>> *Manisha Eleperuma*
>>>>>>>>> Software Engineer
>>>>>>>>> WSO2, Inc.: http://wso2.com
>>>>>>>>> lean.enterprise.middleware
>>>>>>>>>
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>> *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Dev mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Tharindu
>>>>>>>>
>>>>>>>> blog: http://mackiemathew.com/
>>>>>>>> M: +94777759908
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Dev mailing list
>>>>>>>> [email protected]
>>>>>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Afkham Azeez*
>>>>>>> Director of Architecture; WSO2, Inc.; http://wso2.com
>>>>>>> Member; Apache Software Foundation; http://www.apache.org/
>>>>>>> * <http://www.apache.org/>**
>>>>>>> email: **[email protected]* <[email protected]>* cell: +94 77 3320919
>>>>>>> blog: **http://blog.afkham.org* <http://blog.afkham.org>*
>>>>>>> twitter: 
>>>>>>> **http://twitter.com/afkham_azeez*<http://twitter.com/afkham_azeez>
>>>>>>> *
>>>>>>> linked-in: **http://lk.linkedin.com/in/afkhamazeez*
>>>>>>> *
>>>>>>> *
>>>>>>> *Lean . Enterprise . Middleware*
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>>
>>>>>> Tharindu
>>>>>>
>>>>>> blog: http://mackiemathew.com/
>>>>>> M: +94777759908
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Afkham Azeez*
>>>>> Director of Architecture; WSO2, Inc.; http://wso2.com
>>>>> Member; Apache Software Foundation; http://www.apache.org/
>>>>> * <http://www.apache.org/>**
>>>>> email: **[email protected]* <[email protected]>* cell: +94 77 3320919
>>>>> blog: **http://blog.afkham.org* <http://blog.afkham.org>*
>>>>> twitter: 
>>>>> **http://twitter.com/afkham_azeez*<http://twitter.com/afkham_azeez>
>>>>> *
>>>>> linked-in: **http://lk.linkedin.com/in/afkhamazeez*
>>>>> *
>>>>> *
>>>>> *Lean . Enterprise . Middleware*
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Tharindu
>>>>
>>>> blog: http://mackiemathew.com/
>>>> M: +94777759908
>>>>
>>>>
>>>
>>>
>>> --
>>> *Afkham Azeez*
>>> Director of Architecture; WSO2, Inc.; http://wso2.com
>>> Member; Apache Software Foundation; http://www.apache.org/
>>> * <http://www.apache.org/>**
>>> email: **[email protected]* <[email protected]>* cell: +94 77 3320919
>>> blog: **http://blog.afkham.org* <http://blog.afkham.org>*
>>> twitter: **http://twitter.com/afkham_azeez*<http://twitter.com/afkham_azeez>
>>> *
>>> linked-in: **http://lk.linkedin.com/in/afkhamazeez*
>>> *
>>> *
>>> *Lean . Enterprise . Middleware*
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Tharindu
>>
>> blog: http://mackiemathew.com/
>> M: +94777759908
>>
>>
>> _______________________________________________
>> Dev mailing list
>> [email protected]
>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>
>>
>
> _______________________________________________
> Dev mailing list
> [email protected]
> http://wso2.org/cgi-bin/mailman/listinfo/dev
>
>


-- 
~Regards
*Manisha Eleperuma*
Software Engineer
WSO2, Inc.: http://wso2.com
lean.enterprise.middleware

*
*
*
*
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to