[ 
https://issues.apache.org/jira/browse/HIVE-4943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13767894#comment-13767894
 ] 

Michael Haeusler commented on HIVE-4943:
----------------------------------------

An awesome feature of Hive is the rich type system with excellent support for 
complex data-structures. To me, this ticket seems like a very useful extension 
to the hive built-ins. It is especially helpful for those users, that use 
complex data-structures. Right now, queries are often cumbersome when you 
access denormalized or nested data.

E.g., let's consider a table that contains products together with their most 
popular accessories (cross-sellings). The order of the cross-selling products 
matter:
{code:javascript}
{
  "productId": 42,
  "name": "most awesome mp3 player",
  "manufacturer": "acme corp",

  "accessories": [
    { "productId" : 23, "name": "batteries", "manufacturer": "acme corp" },
    { "productId" : 25, "name": "extra load earphones", "manufacturer": 
"noisemakers inc" }
  ]
}
{code}

Let's assume we want to know the average position in cross-sellings of the 
manufacturer "noisemakers inc". Surprisingly, this is not possible with hive 
built-ins. You could try to come up with a custom UDFSequence and a query like 
this:
{code:sql}
SELECT
  AVG(SEQUENCE(p.productId)) AS wrongAverage
FROM
  products p
LATERAL VIEW
  EXPLODE(p.accessories) pa AS accessory
WHERE
  pa.accessory.manufacturer = 'noisemakers inc';
{code}
Unfortunately, the above query will give us wrong results, because Hive 
executes the predicate in the where clause first. Therefore, any UDF in the 
select clause has no chance to see and count all values.

Using the UDTF from this ticket seems to be the best solution:
{code:sql}
SELECT
  AVG(pa.pos) AS correctAverage
FROM
  products p
LATERAL VIEW
  POSEXPLODE(p.accessories) pa AS pos, accessory
WHERE
  pa.accessory.manufacturer = 'noisemakers inc';
{code}

                
> An explode function that includes the item's position in the array
> ------------------------------------------------------------------
>
>                 Key: HIVE-4943
>                 URL: https://issues.apache.org/jira/browse/HIVE-4943
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.11.0
>            Reporter: Niko Stahl
>              Labels: patch
>             Fix For: 0.11.0
>
>         Attachments: HIVE-4943.1.patch, HIVE-4943.2.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> A function that explodes an array and includes an output column with the 
> position of each item in the original array.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to