[ 
https://issues.apache.org/jira/browse/PIG-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175358#comment-15175358
 ] 

Niels Basjes commented on PIG-4796:
-----------------------------------

I extended the test script to force doing multiple MapReduce steps.
The first MR job took more than 10 minutes and the second one was really fast.
This all succeeded correctly.

The script I used:
{code}
REGISTER ./contrib/piggybank/java/piggybank.jar ;
REGISTER ./lib/*.jar ;

set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;

UserAgents =
  LOAD '$LOGFILE'
  USING org.apache.pig.piggybank.storage.apachelog.LogFormatLoader( 
'$LOGFORMAT',
            'HTTP.USERAGENT:request.user-agent',
            'IP:connection.client.host'
        ) AS (
            useragent:chararray,
            ip:chararray
        );

UserAgentsCount =
    FOREACH  UserAgents
    GENERATE useragent AS useragent:chararray,
             ip        AS ip:chararray,
             1L        AS clicks:long;

GroupedByVisitor =
    GROUP UserAgentsCount
    BY    (useragent, ip);

SumsPerVisitor =
    FOREACH  GroupedByVisitor
    GENERATE SUM(UserAgentsCount.clicks) AS clicks,
             group.useragent             AS useragent,
             group.ip                    AS ip,
             1L                          AS visitors;

GroupedByUseragent =
    GROUP SumsPerVisitor
    BY    (useragent);

SumsPerBrowser =
    FOREACH  GroupedByUseragent
    GENERATE SUM(SumsPerVisitor.clicks)   AS clicks,
             SUM(SumsPerVisitor.visitors) AS visitors,
             group                        AS useragent;

STORE SumsPerBrowser
    INTO  'TopUseragentsV'
    USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 
'UNIX');

GroupedByIp =
    GROUP SumsPerVisitor
    BY    (ip);

SumsPerIp =
    FOREACH  GroupedByIp
    GENERATE SUM(SumsPerVisitor.clicks)   AS clicks,
             SUM(SumsPerVisitor.visitors) AS visitors,
             group                        AS ip;

STORE SumsPerIp
    INTO  'TopUseragentsIp'
    USING org.apache.pig.piggybank.storage.CSVExcelStorage('\t','NO_MULTILINE', 
'UNIX');
{code}

> Authenticate with Kerberos using a keytab file
> ----------------------------------------------
>
>                 Key: PIG-4796
>                 URL: https://issues.apache.org/jira/browse/PIG-4796
>             Project: Pig
>          Issue Type: New Feature
>    Affects Versions: 0.15.0
>            Reporter: Niels Basjes
>            Assignee: Niels Basjes
>              Labels: feature, kerberos, security
>         Attachments: 2016-02-18-1510-PIG-4796.patch, 
> 2016-02-18-PIG-4796-rough-proof-of-concept.patch, PIG-4796-2016-02-23.patch, 
> PIG-4796-4.patch
>
>
> When running in a Kerberos secured environment users are faced with the 
> limitation that their jobs cannot run longer than the (remaining) ticket 
> lifetime of their Kerberos tickets. The environment I work in these tickets 
> expire after 10 hours, thus limiting the maximum job duration to at most 10 
> hours (which is a problem).
> In the Hadoop tooling there is a feature where you can authenticate using a 
> Kerberos keytab file (essentially a file that contains the encrypted form of 
> the kerberos principal and password). Using this the running application can 
> request new tickets from the Kerberos server when the initial tickets expire.
> In my Java/Hadoop applications I commonly include these two lines:
> {code}
> System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
> UserGroupInformation.loginUserFromKeytab("[email protected]", 
> "/home/nbasjes/.krb/nbasjes.keytab");
> {code}
> This way I have run an Apache Flink based application for more than 170 hours 
> (about a week) on the kerberos secured Yarn cluster.
> What I propose is to have a feature that I can set the relevant kerberos 
> values in my pig script and from there be able to run a pig job for many days 
> on the secured cluster.
> Proposal how this can look in a pig script:
> {code}
> SET java.security.krb5.conf '/etc/krb5.conf'
> SET job.security.krb5.principal '[email protected]'
> SET job.security.krb5.keytab '/home/nbasjes/.krb/nbasjes.keytab'
> {code}
> So iff all of these are set (or at least the last two) then the 
> aforementioned  UserGroupInformation.loginUserFromKeytab method is called 
> before submitting the job to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to