[ https://issues.apache.org/jira/browse/PIG-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carlos Balduz updated PIG-4583: ------------------------------- Description: When writing a UDF, you are forced to return only one value, or multiple values but inside a Tuple or a Bag. Although this may not seem a big problem, when using Pig in production and performing multiple JOINs, GROUP BYs and any operations that make your schema more and more complicated, it would be nice to create a UDF to reduce the size of your schema, for example: {code} rel1 = load 'a' using PigStorage(';', '-schema'); rel2 = load 'b' using PigStorage(';', '-schema'); joined = join rel1 by id_whatever, rel2 by id_whatever; ... perform operations another_rel = load 'c' using PigStorage('.'.'-schema'); final_rel = join another_rel by id_whatever, joined by id_whatever; {code} Will have an schema like: {code} describe final_rel; rel1::joined::id_whatever, rel1:joined::field_1, ...... {code} When you have scripts with hundreds or thousands of lines of code, you end up having more foreachs to rename fields than with actual code. Therefore, I wrote a UDF to handle this so I wouldn't have to write a foreach to rename 100 fields one by one. However, due to Pig's limitation of returning only one value, I must place my return values inside a Tupe or a Bag, flatten it, and have another `something::` for each of the fields. Can we remove this limitation? And if it is done, perhaps upload the UDF I wrote... I think it is a VERY useful function for production environments and large scripts. was: When writing a UDF, you are forced to return only one value, or multiple values but inside a Tuple or a Bag. Although this may not seem a big problem, when using Pig in production and performing multiple JOINs, GROUP BYs and any operations that make your schema more and more complicated, it would be nice to create a UDF to reduce the size of your schema, for example: ``` rel1 = load 'a' using PigStorage(';', '-schema'); rel2 = load 'b' using PigStorage(';', '-schema'); joined = join rel1 by id_whatever, rel2 by id_whatever; ... perform operations another_rel = load 'c' using PigStorage('.'.'-schema'); final_rel = join another_rel by id_whatever, joined by id_whatever; ``` Will have an schema like: ``` describe final_rel; rel1::joined::id_whatever, rel1:joined::field_1, ...... ``` When you have scripts with hundreds or thousands of lines of code, you end up having more foreachs to rename fields than with actual code. Therefore, I wrote a UDF to handle this so I wouldn't have to write a foreach to rename 100 fields one by one. However, due to Pig's limitation of returning only one value, I must place my return values inside a Tupe or a Bag, flatten it, and have another `something::` for each of the fields. Can we remove this limitation? And if it is done, perhaps upload the UDF I wrote... I think it is a VERY useful function for production environments and large scripts. > Allow UDFs to return multiple values (not inside a Bag or a Tuple) > ------------------------------------------------------------------ > > Key: PIG-4583 > URL: https://issues.apache.org/jira/browse/PIG-4583 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.14.0 > Reporter: Carlos Balduz > > When writing a UDF, you are forced to return only one value, or multiple > values but inside a Tuple or a Bag. Although this may not seem a big problem, > when using Pig in production and performing multiple JOINs, GROUP BYs and any > operations that make your schema more and more complicated, it would be nice > to create a UDF to reduce the size of your schema, for example: > {code} > rel1 = load 'a' using PigStorage(';', '-schema'); > rel2 = load 'b' using PigStorage(';', '-schema'); > joined = join rel1 by id_whatever, rel2 by id_whatever; > ... perform operations > another_rel = load 'c' using PigStorage('.'.'-schema'); > final_rel = join another_rel by id_whatever, joined by id_whatever; > {code} > Will have an schema like: > {code} > describe final_rel; > rel1::joined::id_whatever, rel1:joined::field_1, ...... > {code} > When you have scripts with hundreds or thousands of lines of code, you end up > having more foreachs to rename fields than with actual code. Therefore, I > wrote a UDF to handle this so I wouldn't have to write a foreach to rename > 100 fields one by one. > However, due to Pig's limitation of returning only one value, I must place my > return values inside a Tupe or a Bag, flatten it, and have another > `something::` for each of the fields. > Can we remove this limitation? And if it is done, perhaps upload the UDF I > wrote... I think it is a VERY useful function for production environments and > large scripts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)