[
https://issues.apache.org/jira/browse/CASSANDRA-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197202#comment-13197202
]
Brandon Williams edited comment on CASSANDRA-3371 at 1/31/12 9:25 PM:
----------------------------------------------------------------------
I resolved PIG-2485 as invalid. You can read the explanation there, but I'll
go ahead and summarize: a bag's schema can only contain one tuple because it is
assumed that all tuples in the bag have the same schema. Obviously this won't
be true in Cassandra since we allow any column to have any schema that you
like. However, after talking with Dmitriy Ryaboy, I have a plan. We got good
results out of tuple-of-tuples, but this won't work with wide rows. Another
thing it won't work with is small rows where some columns have metadata, and
some do not, because when you define a tuple-of-tuples that is a hard
constraint; you can't define 4 and then return 20. So what I propose is that
we change the output format to be a tuple-of-tuples for all columns that have
metadata, and then a bag with the rest of the columns with a single schema (the
default comparator/validator.) This will work for both static and wide rows,
unless you manage to define metadata on so many columns in a wide row that they
themselves qualify as wide.
To give an example, let's continue with what Pete started with a slight
modification:
{noformat}
create column family PhotoVotes with
comparator = UTF8Type and
column_metadata =
[
{column_name: voter, validation_class: UTF8Type, index_type: KEYS},
{column_name: vote_type, validation_class: UTF8Type},
{column_name: photo_owner, validation_class: UTF8Type, index_type: KEYS},
{column_name: src_big, validation_class: UTF8Type},
{column_name: pid, validation_class: UTF8Type, index_type: KEYS},
{column_name: matched_string, validation_class: UTF8Type},
{column_name: time, validation_class: LongType},
];
{noformat}
Loading this from pig produces a schema like:
(key: bytearray,matched_string: (name: chararray,value: chararray),photo_owner:
(name: chararray,value: chararray),pid: (name: chararray,value:
chararray),src_big: (name: chararray,value: chararray),time: (name:
chararray,value: long),vote_type: (name: chararray,value: chararray),voter:
(name: chararray,value: chararray),columns: {(name: chararray,value:
bytearray)})
This should allow you do things like:
FILTER rows by vote_type.value eq 'album_like'
Note that the *tuple* is named after the index, and inside the tuple we still
have 'name' and 'value'. This is because if we don't have the name accessible,
this is going to be hard to store later (and schema introspection is a bit more
magic than I'd care to use.)
was (Author: brandon.williams):
I resolved PIG-2485 as invalid. You can read the explanation there, but
I'll go ahead and summarize: a bag's schema can only contain one tuple because
it is assumed that all tuples in the bag have the same schema. Obviously this
won't be true in Cassandra since we allow any column to have any schema that
you like. However, after talking with Dmitriy Ryaboy, I have a plan. We got
good results out of tuple-of-tuples, but this won't work with wide rows.
Another thing it won't work with is small rows where some columns have
metadata, and some do not, because when you define a tuple-of-tuples that is a
hard constraint; you can't define 4 and then return 20. So what I propose is
that we change the output format to be a tuple-of-tuples for all columns that
have metadata, and then a bag with the rest of the columns with a single schema
(the default comparator/validator.) This will work for both static and wide
rows, unless you manage to define metadata on so many columns in a wide row
that they themselves qualify as wide.
To give an example, let's continue with what Pete started with a slight
modification:
{noformat}
create column family PhotoVotes with
comparator = UTF8Type and
column_metadata =
[
{column_name: voter, validation_class: UTF8Type, index_type: KEYS},
{column_name: vote_type, validation_class: UTF8Type},
{column_name: photo_owner, validation_class: UTF8Type, index_type: KEYS},
{column_name: src_big, validation_class: UTF8Type},
{column_name: pid, validation_class: UTF8Type, index_type: KEYS},
{column_name: matched_string, validation_class: UTF8Type},
{column_name: time, validation_class: LongType},
];
{noformat}
Loading this from pig produces a schema like:
(key: bytearray,matched_string: (name: chararray,value: chararray),photo_owner:
(name: chararray,value: chararray),pid: (name: chararray,value:
chararray),src_big: (name: chararray,value: chararray),time: (name:
chararray,value: chararray),vote_type: (name: chararray,value:
chararray),voter: (name: chararray,value: chararray),columns: {(name:
chararray,value: bytearray)})
This should allow you do things like:
FILTER rows by vote_type.value eq 'album_like'
Note that the *tuple* is named after the index, and inside the tuple we still
have 'name' and 'value'. This is because if we don't have the name accessible,
this is going to be hard to store later (and schema introspection is a bit more
magic than I'd care to use.)
> Cassandra inferred schema and actual data don't match
> -----------------------------------------------------
>
> Key: CASSANDRA-3371
> URL: https://issues.apache.org/jira/browse/CASSANDRA-3371
> Project: Cassandra
> Issue Type: Bug
> Components: Hadoop
> Affects Versions: 0.8.7
> Reporter: Pete Warden
> Assignee: Brandon Williams
> Attachments: 3371-v2.txt, 3371-v3.txt, 3371-v4.txt, 3371-v5.txt,
> pig.diff
>
>
> It's looking like there may be a mismatch between the schema that's being
> reported by the latest CassandraStorage.java, and the data that's actually
> returned. Here's an example:
> rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage();
> DESCRIBE rows;
> rows: {key: chararray,columns: {(name: chararray,value:
> bytearray,photo_owner: chararray,value_photo_owner: bytearray,pid:
> chararray,value_pid: bytearray,matched_string:
> chararray,value_matched_string: bytearray,src_big: chararray,value_src_big:
> bytearray,time: chararray,value_time: bytearray,vote_type:
> chararray,value_vote_type: bytearray,voter: chararray,value_voter:
> bytearray)}}
> DUMP rows;
> (691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu
> Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)})
> getSchema() is reporting the columns as an inner bag of tuples, each of which
> contains 16 values. In fact, getNext() seems to return an inner bag
> containing 7 tuples, each of which contains two values.
> It appears that things got out of sync with this change:
> http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.8/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?r1=1177083&r2=1177082&pathrev=1177083
> See more discussion at:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/pig-cassandra-problem-quot-Incompatible-field-schema-quot-error-tc6882703.html
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira