[
https://issues.apache.org/jira/browse/PIG-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717037#comment-14717037
]
li xiang commented on PIG-3294:
-------------------------------
Hi Daniel,
Sorry for not responding you quickly. I am trying to debug/fix a Parquet UT
failure which I found has something to do with the change on
ExpToPhyTranslationVisitor.java by this JIRA.
The test case is testPigScript() of
https://github.com/apache/parquet-mr/blob/master/parquet-pig/src/test/java/org/apache/parquet/pig/summary/TestSummary.java.
It failed with a null pointer exception(please see the first comment in
PARQUET-334).
Class Summary
(https://github.com/apache/parquet-mr/blob/master/parquet-pig/src/main/java/org/apache/parquet/pig/summary/Summary.java)
extends EvalFunc of Pig. EvalFunc has a private field inputSchemaInternal and
provides both setInputSchema() and getInputSchema() to set and return
inputSchemaInternal. But Summary provides a different one called
inputSchema(vs. inputSchemaInternal) and only provides the setter
setInputSchema(), no getter. I think it might not be reasonable, so opened
PARQUET-365 and provide the getter to return inputSchema as the fix.
In setInputSchema() of Summary, do you think it is reasonable to get the schema
of tuple by using the following?
{code}
this.inputSchema = input.getField(0).schema.getField(0).schema;
{code}
Further, the adding of "((EvalFunc)
f).setInputSchema(((POUserFunc)p).getFunc().getInputSchema())"(as follow) makes
setInputSchema() of Summary called twice. In ExpToPhyTranslationVisitor
{code}
510 if (((POUserFunc)p).getFunc().getInputSchema() == null) {
511 ((POUserFunc)p).setFuncInputSchema(op.getSignature()); <--
call setInputSchema()
512 ((EvalFunc)
f).setInputSchema(((POUserFunc)p).getFunc().getInputSchema()); <-- add this
line, call setInputSchema() again
513 }
{code}
I printed the result of each step of "this.inputSchema =
input.getField(0).schema.getField(0).schema"
Here is the first call of setInputSchema(), by setFuncInputSechema() of
POUserFunc
======================
In Summary - SetInputSchema() - input = {A: {(a: chararray,a1: chararray,b:
int,c: {t: (a2: chararray,b2: map[])})}}
In Summary - SetInputSchema() - input.getField(0) = A: bag({(a: chararray,a1:
chararray,b: int,c: {t: (a2: chararray,b2: map[])})})
In Summary - SetInputSchema() - input.getField(0).schema = {(a: chararray,a1:
chararray,b: int,c: {t: (a2: chararray,b2: map[])})}
In Summary - SetInputSchema() - input.getField(0).schema.getField(0) =
tuple({a: chararray,a1: chararray,b: int,c: {t: (a2: chararray,b2: map[])}})
In Summary - SetInputSchema() - input.getField(0).schema.getField(0).schema =
{a: chararray,a1: chararray,b: int,c: {t: (a2: chararray,b2: map[])}}
======================
Here is the second call of setInputSchema(), by
{code}
((EvalFunc) f).setInputSchema(((POUserFunc)p).getFunc().getInputSchema())
{code}
======================
In Summary - SetInputSchema() - input = {a: chararray,a1: chararray,b: int,c:
{t: (a2: chararray,b2: map[])}}
In Summary - SetInputSchema() - input.getField(0) = a: chararray
In Summary - SetInputSchema() - input.getField(0).schema = null <--- So the
null pointer exception is here.
======================
So, to fix this error,
(1) do you think it is not quite reasonable to get the schema of tuple in class
Summary like this
{code}
this.inputSchema = input.getField(0).schema.getField(0).schema;
{code}
(2) Or on Pig side, does it make sense to check if the schema has been set
before calling setInputSchema() again, maybe like the following change onto
ExpToPhyTranslationVisitor
{code}
if (((POUserFunc)p).getFunc().getInputSchema() == null) {
System.out.println("In visit, if == null");
((POUserFunc)p).setFuncInputSchema(op.getSignature());
if (((POUserFunc)p).getFunc().getInputSchema() == null) { // Check before
calling again
((EvalFunc)
f).setInputSchema(((POUserFunc)p).getFunc().getInputSchema());
}
}
{code}
Thanks for your time, thanks!
> Allow Pig use Hive UDFs
> -----------------------
>
> Key: PIG-3294
> URL: https://issues.apache.org/jira/browse/PIG-3294
> Project: Pig
> Issue Type: New Feature
> Reporter: Daniel Dai
> Assignee: Daniel Dai
> Labels: gsoc2013, java
> Fix For: 0.15.0
>
> Attachments: PIG-3294-1.patch, PIG-3294-2.patch, PIG-3294-3.patch,
> PIG-3294-4.patch, PIG-3294-5.patch, PIG-3294-before-refactory.patch
>
>
> It would be nice if Pig provide some interoperability with Hive. We can wrap
> Hive UDF in Pig so we can use Hive UDF in Pig.
> This is a candidate project for Google summer of code 2013. More information
> about the program can be found at
> https://cwiki.apache.org/confluence/display/PIG/GSoc2013
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)