[jira] Commented: (PIG-1718) Cannot directly cast output of UDF

Mike Dillon (JIRA) Thu, 11 Nov 2010 17:21:36 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931294#action_12931294
 ]


Mike Dillon commented on PIG-1718:
----------------------------------

Yes, that syntax does work, but it's very hard to correlate output type 
annotations with field names for stuff like this:

{code}
RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
IN = foreach RAW generate id,
        (tuple(int,double))ExtractStringTuple(count_json, 'count', 'mean') as 
info (count, mean);
{code}

It seems like enhancing Pig to allow the type annotations to sit right next to 
the field names for this case would be a big win. Not to mention the duplicate 
information about the type shape that is implicit in having both 
"tuple(int,double)" as a cast and "info(count, mean)" as a schema specification.

> Cannot directly cast output of UDF
> ----------------------------------
>
>                 Key: PIG-1718
>                 URL: https://issues.apache.org/jira/browse/PIG-1718
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.7.0
>         Environment: Macbook Pro 6.2, Ubuntu 10.04 AMD64, CDH3 beta 3
>            Reporter: Mike Dillon
>            Priority: Minor
>
> I'm in the process of writing a suite of UDFs to deal with nested JSON data 
> inside of Pig. In one case, I created a UDF of type EvalFunc<String> and 
> wanted to use it like so:
> {code}
> RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
> IN = foreach RAW generate id, ExtractString(json, 'count') as count:int
> {code}
> When I do this, I get the following error:
> {quote}
> ERROR 1022: Type mismatch merging schema prefix. Field Schema: chararray. 
> Other Field Schema: count: int
> {quote}
> I can work around it by adding another projection with just a cast (as 
> below), but I'd prefer if the form I just first just worked.
> {code}
> RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
> MID = foreach RAW generate id, ExtractString(json, 'count') as count
> IN = foreach MID generate id, (int)count
> {code}
> I'd prefer not to have to have ExtractInteger extends EvalFun<Integer> if I 
> can avoid it. In our case, it gets even more cumbersome because we want to 
> have something like ExtractStringTuple extends EvalFunc<Tuple> that returns a 
> tuple of strings without parsing the JSON over and over again:
> {code}
> RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
> IN = foreach RAW generate id, ExtractStringTuple(json, 'name', 'count', 
> 'mean') as (name, count:int, mean:double);
> {code}
> As indicated, I have tested this with Pig 0.7.0. My apologies if this is 
> already fixed in 0.8 since I was not able to test with a newer version.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1718) Cannot directly cast output of UDF

Reply via email to