[
https://issues.apache.org/jira/browse/PIG-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619722#action_12619722
]
Alan Gates commented on PIG-354:
--------------------------------
I don't think we want to be converting data to chararray by default for input
to UDFs, for several reasons:
1 It's expensive
2 It mangles any data that isn't utf8
3 It is a fair amount of work for users to provide type specific
implementations of their UDFs, and so I suspect most won't.
By contrast, on the outbound side I agree that chararray is the right default,
for two reasons:
1 It's very easy to determine what type the UDF is returning, either by
declaring a schema or by pig reflecting the return type. Only in the case
where they do not give a schema and their return type is tuple or bag (thus we
have no idea what inside that tuple or bag) will we be forcing data to strings.
2 In general pig does not assume any particular representation of data in byte
arrays. That's why we make the load function provide casts. So if we took
this unknown data from UDFs to be byte arrays we'd have no idea how to convert
it to anything else. Conversions from strings on the other hand are well
understood.
> Change to default outputSchema for UDFs
> ---------------------------------------
>
> Key: PIG-354
> URL: https://issues.apache.org/jira/browse/PIG-354
> Project: Pig
> Issue Type: Bug
> Affects Versions: types_branch
> Reporter: Olga Natkovich
> Priority: Critical
> Fix For: types_branch
>
>
> Currently, if UDF writer does not specify outputSchema the default is
> bytearray which is not what you would want most of the time. Making chararray
> a default would make things backward compatible.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.