Transform" by AdamKramer

Apache Wiki Thu, 11 Nov 2010 11:41:44 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/LanguageManual/Transform" page has been changed by AdamKramer.
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform?action=diff&rev1=22&rev2=23

--------------------------------------------------

  
  Users can also plug in their own custom mappers and reducers in the data 
stream by using features natively supported in the Hive 2.0 language. e.g. in 
order to run a custom mapper script - map_script - and a custom reducer script 
- reduce_script - the user can issue the following command which uses the 
TRANSFORM clause to embed the mapper and the reducer scripts.
  
- By default, columns will be transformed to ''STRING'' and delimited by TAB 
before feeding to the user script; similarly, all NULL values will be converted 
to the literal string '''\N''' in order to differentiate NULL values from empty 
strings. The standard output of the user script will be treated as 
TAB-separated ''STRING'' columns, any cell containing only '''\N''' will be 
re-interpreted as a NULL, and then the resulting STRING column will be cast to 
the data type specified in the table declaration in the usual way. User scripts 
can output debug information to standard error which will be shown on the task 
detail page on hadoop. These defaults can be overridden with ''ROW FORMAT''...
+ By default, columns will be transformed to ''STRING'' and delimited by TAB 
before feeding to the user script; similarly, all NULL values will be converted 
to the literal string '''\N''' in order to differentiate NULL values from empty 
strings. The standard output of the user script will be treated as 
TAB-separated ''STRING'' columns, any cell containing only '''\N''' will be 
re-interpreted as a NULL, and then the resulting STRING column will be cast to 
the data type specified in the table declaration in the usual way. User scripts 
can output debug information to standard error which will be shown on the task 
detail page on hadoop. These defaults can be overridden with ''ROW FORMAT ...''.
  
- In the syntax, both ''MAP ...'' and ''REDUCE ...'' can be also written as 
''SELECT TRANSFORM ( ... )''.  There are actually no difference between these 
three.
- Hive runs the reduce script in the reduce task (instead of the map task) 
because of the ''clusterBy''/''distributeBy''/''sortBy'' clause in the inner 
query.
+ '''NOTE:''' It is your responsibility to sanitize any STRING columns prior to 
transformation. If your STRING column contains tabs, an identity transformer 
will not give you back what you started with! To help with this, see 
[[Hive/LanguageManual/UDF#String_Functions|REGEXP_REPLACE]] and replace the 
tabs with some other character on their way into the TRANSFORM() call.
+ 
+ Formally, ''MAP ...'' and ''REDUCE ...'' are syntactic transformations of 
''SELECT TRANSFORM ( ... )''. In other words, they serve as comments or notes 
to the reader of the query. BEWARE: Use of these keywords may be 
'''dangerous''' as (e.g.) typing "REDUCE" does not force a reduce phase to 
occur and typing "MAP" does not force a new map phase!
  
  Please also see [[Hive/LanguageManual/SortBy|Sort By / Cluster By / 
Distribute By]] and Larry Ogrodnek's 
[[http://dev.bizo.com/2009/10/hive-map-reduce-in-java.html|blog post]].

[Hadoop Wiki] Update of "Hive/LanguageManual/Transform" by AdamKramer

Reply via email to