[
https://issues.apache.org/jira/browse/CASSANDRA-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034201#comment-13034201
]
Brandon Williams edited comment on CASSANDRA-2658 at 5/16/11 7:04 PM:
----------------------------------------------------------------------
The source of this problem is that when you remove the validation, the cli
packs the data with the default validator, in this case UTF8. Pig's SUM
function doesn't understand how to sum UTF8, so you would need to explicitly
cast to an integer first.
When you have the validation class set to LongType, the cli is packing this
into an 8 byte binary form for you. When the LoadFunc pulls it out, it
converts this back into a long and pig's SUM works with those.
This isn't a problem with the load/store function, this is a problem with how
the data is being inserted and not understanding that the cli is masking the
problem you have with a validator by being aware of it and making things easy
on you. If you were inserting the data programmatically in the wrong form (ie,
a string) you would have the same problem, but if you pack()ed it yourself, it
would work.
was (Author: brandon.williams):
The source of this problem is that when you remove the validation, the cli
packs the data with the default validator, in this case UTF8. Pig's SUM
function doesn't understand how to sum UTF8, so you would need to explicitly
cast to an integer first.
When you have the validation class set to LongType, the cli is packing this
into an 8 byte binary form for you. When the LoadFunc pulls it out, it
converts this back into a long and pig's SUM works with those.
This isn't a problem with the load/store function, this is a problem with how
the data is being inserted and not understanding that the cli is masking the
problem you have a validator by being aware of it and making things easy on
you. If you were inserting the data programmatically in the wrong form (ie, a
string) you would have the same problem, but if you pack()ed it yourself, it
would work.
> Pig + CassandraStorage should work when trying to cast data after it's loaded
> -----------------------------------------------------------------------------
>
> Key: CASSANDRA-2658
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2658
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 0.7.5
> Reporter: Jeremy Hanna
> Assignee: Brandon Williams
> Priority: Minor
> Labels: pig
>
> We currently do a lot with pig + cassandra, but one thing I've found is that
> currently it's very touchy with data that comes from Cassandra for some
> reason. For example, if I try to a SUM of data that has not been validated
> as an LongType in Cassandra, it borks. See this schema script for Cassandra
> -
> https://github.com/jeromatron/pygmalion/blob/master/cassandra/example_data.txt
> - and remove the validation on the num_heads data type and try to SUM that
> over the data and it gives data type errors. (It breaks with the num_heads
> validation removed and with or without the default_validation class being
> set.)
> We currently do analysis over data that is either just String (UTF8) data or
> that we have validated, so it works for us. However, I've seen a couple of
> people trying to use Cassandra with Pig that have had issues because of this.
> One of the tenets of pig is that it will eat anything and it kind of goes
> against this if the load/store somehow interferes with that. So in essence,
> I think this is a big deal for those wanting to use pig with cassandra in the
> ways that pig is normally used.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira