[ 
https://issues.apache.org/jira/browse/PIG-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13619470#comment-13619470
 ] 

Feng Peng commented on PIG-3222:
--------------------------------

Hi [~daijy], thanks for looking into this!

1. The behavior definitely changed from pig_9 to pig_11, the trace using the 
same program as above using pig_9 generates the following output:
{noformat}
[1]setStoreFuncUDFContextSignature(samples_testdb.samples_com.twitter.twadoop.dal.pig.StoreFuncTracer('org.apache.hcatalog.pig.HCatStorer','part_dt=20130204T000000Z'))
[1]relToAbsPathForStoreLocation(testdb.samples,hdfs://...)
[2]setStoreFuncUDFContextSignature(1-0_testdb.samples_com.twitter.twadoop.dal.pig.StoreFuncTracer('org.apache.hcatalog.pig.HCatStorer','part_dt=20130204T000000Z'))
[4]setStoreFuncUDFContextSignature(samples_testdb.samples_com.twitter.twadoop.dal.pig.StoreFuncTracer('org.apache.hcatalog.pig.HCatStorer','part_dt=20130204T000000Z'))
[5]setStoreFuncUDFContextSignature(1-1_testdb.samples_com.twitter.twadoop.dal.pig.StoreFuncTracer('org.apache.hcatalog.pig.HCatStorer','part_dt=20130204T000000Z'))
[5]setStoreFuncUDFContextSignature(samples_testdb.samples_com.twitter.twadoop.dal.pig.StoreFuncTracer('org.apache.hcatalog.pig.HCatStorer','part_dt=20130204T000000Z'))
[5]checkSchema(number:int)
[5]setStoreLocation(testdb.samples,Job@1023736867)
[5]getOutputFormat
[6]setStoreFuncUDFContextSignature(samples_testdb.samples_com.twitter.twadoop.dal.pig.StoreFuncTracer('org.apache.hcatalog.pig.HCatStorer','part_dt=20130204T000000Z'))
[6]setStoreLocation(testdb.samples,Job@1713380717)
[7]setStoreFuncUDFContextSignature(samples_testdb.samples_com.twitter.twadoop.dal.pig.StoreFuncTracer('org.apache.hcatalog.pig.HCatStorer','part_dt=20130204T000000Z'))
[7]setStoreLocation(testdb.samples,Job@687736006)
[7]getOutputFormat
{noformat}

Pig_9 instantiates the same StoreFunc 8 times with three different signatures 
instead of 3 times / 2 signatures in pig_11 (which is good). However, ALL the 
setStoreLocation functions are called with a DETERMINISTIC signature in pig_9, 
which is concatenated by 
{noformat}
* relation name: samples
* location string: testdb.samples
* storer statement:  
com.twitter.twadoop.dal.pig.StoreFuncTracer('org.apache.hcatalog.pig.HCatStorer','part_dt=20130204T000000Z')
{noformat}
This signature can be safely regenerated in the backend, therefore we can 
always retrieve the information stored at the frontend.

2. In HCatStorer.setStoreLocation (and similarly in HCatLoader.setLocation, we 
have the following code basically caching information in the UDFContext using 
the signature (plus the class) as the key:

{noformat}
        Properties udfProps = 
UDFContext.getUDFContext().getUDFProperties(this.getClass(), new 
String[]{sign});
        if (udfProps.containsKey(HCatConstants.HCAT_PIG_STORER_LOCATION_SET)) {
           readConfigFromUDFProps();
        } else {
           getConfigFromHiveMetaStore();
           udfProps.put(HCatConstants.HCAT_PIG_STORER_LOCATION_SET, true);
        }
{noformat}

For HCatStorer, the problem is that the front-end and back-end are not 
guaranteed to use the same signature, therefore the HCatStorer may break at the 
backend.

For HCatLoader, the problem is that each instantiation uses a different 
signature, and thus the above caching mechanism does not work. The result is 
that the frontend reads rom HiveMetaStore as many times as the unique 
signatures there are. It is annoying and inefficient, but doesn't break the 
program. That's why the current ticket specifies the HCatStorer.

Can we use the same signature naming scheme as pig_9? As long as the relation 
names are unique, (if not, we can enforce it in the normalization process), the 
signature for every LoadFunc and StoreFunc will be unique and can be 
constructed in a consistent way in both the frontend and backend.
                
> New UDFContextSignature assignments in Pig 0.11 breaks HCatalog.HCatStorer 
> ---------------------------------------------------------------------------
>
>                 Key: PIG-3222
>                 URL: https://issues.apache.org/jira/browse/PIG-3222
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Feng Peng
>              Labels: hcatalog
>         Attachments: PigStorerDemo.java
>
>
> Pig 0.11 assigns different UDFContextSignature for different invocations of 
> the same load/store statement. This change breaks the HCatStorer which 
> assumes all front-end and back-end invocations of the same store statement 
> has the same UDFContextSignature so that it can read the previously stored 
> information correctly.
> The related HCatalog code is in 
> https://svn.apache.org/repos/asf/incubator/hcatalog/branches/branch-0.5/hcatalog-pig-adapter/src/main/java/org/apache/hcatalog/pig/HCatStorer.java
>  (the setStoreLocation() function).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to