[jira] [Updated] (PIG-4583) Allow UDFs to return multiple values (not inside a Bag or a Tuple)

Carlos Balduz (JIRA) Wed, 03 Jun 2015 03:38:08 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Carlos Balduz updated PIG-4583:
-------------------------------
    Description: 
When writing a UDF, you are forced to return only one value, or multiple values 
but inside a Tuple or a Bag. Although this may not seem a big problem, when 
using Pig in production and performing multiple JOINs, GROUP BYs and any 
operations that make your schema more and more complicated, it would be nice to 
create a UDF to reduce the size of your schema, for example:
{code}
    rel1  = load 'a' using PigStorage(';', '-schema');
    rel2  = load 'b' using PigStorage(';', '-schema');

    joined = join rel1 by id_whatever, rel2 by id_whatever;

    ... perform operations

    another_rel = load 'c' using PigStorage('.'.'-schema');
    final_rel = join another_rel by id_whatever, joined by id_whatever;
{code}
Will have an schema like:
{code}
    describe final_rel;
    rel1::joined::id_whatever, rel1:joined::field_1, ......
{code}

When you have scripts with hundreds or thousands of lines of code, you end up 
having more foreachs to rename fields than with actual code. Therefore, I wrote 
a UDF to handle this so I wouldn't have to write a foreach to rename 100 fields 
one by one.

However, due to Pig's limitation of returning only one value, I must place my 
return values inside a Tupe or a Bag, flatten it, and have another 
`something::` for each of the fields.

Can we remove this limitation? And if it is done, perhaps upload the UDF I 
wrote... I think it is a VERY useful function for production environments and 
large scripts.

  was:
When writing a UDF, you are forced to return only one value, or multiple values 
but inside a Tuple or a Bag. Although this may not seem a big problem, when 
using Pig in production and performing multiple JOINs, GROUP BYs and any 
operations that make your schema more and more complicated, it would be nice to 
create a UDF to reduce the size of your schema, for example:
```
    rel1  = load 'a' using PigStorage(';', '-schema');
    rel2  = load 'b' using PigStorage(';', '-schema');

    joined = join rel1 by id_whatever, rel2 by id_whatever;

    ... perform operations

    another_rel = load 'c' using PigStorage('.'.'-schema');
    final_rel = join another_rel by id_whatever, joined by id_whatever;
```
Will have an schema like:
```
    describe final_rel;
    rel1::joined::id_whatever, rel1:joined::field_1, ......
```

When you have scripts with hundreds or thousands of lines of code, you end up 
having more foreachs to rename fields than with actual code. Therefore, I wrote 
a UDF to handle this so I wouldn't have to write a foreach to rename 100 fields 
one by one.

However, due to Pig's limitation of returning only one value, I must place my 
return values inside a Tupe or a Bag, flatten it, and have another 
`something::` for each of the fields.

Can we remove this limitation? And if it is done, perhaps upload the UDF I 
wrote... I think it is a VERY useful function for production environments and 
large scripts.


> Allow UDFs to return multiple values (not inside a Bag or a Tuple)
> ------------------------------------------------------------------
>
>                 Key: PIG-4583
>                 URL: https://issues.apache.org/jira/browse/PIG-4583
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.14.0
>            Reporter: Carlos Balduz
>
> When writing a UDF, you are forced to return only one value, or multiple 
> values but inside a Tuple or a Bag. Although this may not seem a big problem, 
> when using Pig in production and performing multiple JOINs, GROUP BYs and any 
> operations that make your schema more and more complicated, it would be nice 
> to create a UDF to reduce the size of your schema, for example:
> {code}
>     rel1  = load 'a' using PigStorage(';', '-schema');
>     rel2  = load 'b' using PigStorage(';', '-schema');
>     joined = join rel1 by id_whatever, rel2 by id_whatever;
>     ... perform operations
>     another_rel = load 'c' using PigStorage('.'.'-schema');
>     final_rel = join another_rel by id_whatever, joined by id_whatever;
> {code}
> Will have an schema like:
> {code}
>     describe final_rel;
>     rel1::joined::id_whatever, rel1:joined::field_1, ......
> {code}
> When you have scripts with hundreds or thousands of lines of code, you end up 
> having more foreachs to rename fields than with actual code. Therefore, I 
> wrote a UDF to handle this so I wouldn't have to write a foreach to rename 
> 100 fields one by one.
> However, due to Pig's limitation of returning only one value, I must place my 
> return values inside a Tupe or a Bag, flatten it, and have another 
> `something::` for each of the fields.
> Can we remove this limitation? And if it is done, perhaps upload the UDF I 
> wrote... I think it is a VERY useful function for production environments and 
> large scripts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4583) Allow UDFs to return multiple values (not inside a Bag or a Tuple)

Reply via email to