[ 
https://issues.apache.org/jira/browse/PIG-4295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated PIG-4295:
----------------------------------
    Attachment: PIG-4295_2.patch

PIG-4295_1.patch causes OOM error, and three regression unit tests are 
added(see https://builds.apache.org/job/Pig-spark/186/):
org.apache.pig.test.TestAccumulator.testAccumWithRegexp
org.apache.pig.test.TestBestFitCast.testByteArrayCast9
org.apache.pig.test.TestEvalPipeline.testCogroupWithInputFromGroup

The OOM problem is caused by following code in PIG-4295_1.patch:
{code}
private void saveUdfImportList(PigContext pigContext) {
        String udfImportList = 
Joiner.on(",").join(PigContext.getPackageImportList());
        pigContext.getProperties().setProperty("udf.import.list", 
udfImportList);
    }
{code}

Let's explain the reason why TestBestFitCast.testByteArrayCast9 fails.
When run TestBestFitCast 26 unit tests: 
When first unit test runs  
TestBestFitCast.setUp->PigServer.<init>->PigContext.init
the value of properties.get("udf.import.list") is null, so 
PigContext.initializeImportList((String)properties.get("udf.import.list")) will 
not be executed.

PigContext#init
{code}
private void init() {
        if (properties.get("udf.import.list")!=null)
            
PigContext.initializeImportList((String)properties.get("udf.import.list"));
    }
{code}

When SparkLauncher#saveUdfImportList is executed,  
pigContext.getProperties().set
Property("udf.import.list",udfImportList) is called, the value of 
PigContext.getProperties().get("udf.import.list") is
 ",java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin.".

When second unit tests runs 
TestBestFitCast.setUp->PigServer.<init>->PigContext.init,
the value of PigContext.getProperties().get("udf.import.list") is 
",java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin."(not null), 
then PigContext.initializeImportList((String)properties.get("udf.import.list")) 
is executed.

PigContext#initializeImportList
{code}:
 public static void initializeImportList(String importListCommandLineProperties)
    {
        StringTokenizer tokenizer = new 
StringTokenizer(importListCommandLineProperties, ":");
        int pos = 1; // Leave "" as the first import
        ArrayList<String> importList = getPackageImportList();
        while (tokenizer.hasMoreTokens())
        {
            String importItem = tokenizer.nextToken();
            if (!importItem.endsWith("."))
                importItem += ".";
            importList.add(pos, importItem);
            pos++;
        }
    }
{code}

After that , the value of PigContext#packageImportList  is 
["",
"java.lang.",
"org.apache.pig.builtin.",
"org.apache.pig.impl.builtin."
",java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin."], 
PigContext#packageImportList should have 4 importPackage values, but now have 
5.",java.lang.,org.apache.pig.builtin.,org.apache.pig.impl.builtin." are added.


If a file contains many unit test cases,  the size of 
PigContext#packageImportList will be bigger.  

How to avoid the OOM problem.
In PIG-4295_2.patch:
changes from 
{code}
private void saveUdfImportList(PigContext pigContext) {
        String udfImportList = 
Joiner.on(",").join(PigContext.getPackageImportList());
        pigContext.getProperties().setProperty("udf.import.list", 
udfImportList);
    }
{code}

to 
{code}
private void saveUdfImportList(PigContext pigContext) {
        String udfImportList = 
Joiner.on(",").join(PigContext.getPackageImportList());
        pigContext.getProperties().setProperty("spark.udf.import.list", 
udfImportList);
    }
{code}

If we store the UdfImportList in the 
PigContext.getProperties().get("spark.udf.import.list", udfImportList“), it 
will not cause the error mentioned above.




> Enable unit test "TestPigContext" for spark
> -------------------------------------------
>
>                 Key: PIG-4295
>                 URL: https://issues.apache.org/jira/browse/PIG-4295
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>    Affects Versions: spark-branch
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>             Fix For: spark-branch
>
>         Attachments: PIG-4295.patch, PIG-4295_1.patch, PIG-4295_2.patch, 
> TEST-org.apache.pig.test.TestPigContext.txt
>
>
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to