[jira] [Issue Comment Edited] (CASSANDRA-3371) Cassandra inferred schema and actual data don't match

Brandon Williams (Issue Comment Edited) (JIRA) Tue, 31 Jan 2012 13:26:28 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197202#comment-13197202
 ]


Brandon Williams edited comment on CASSANDRA-3371 at 1/31/12 9:25 PM:
----------------------------------------------------------------------

I resolved PIG-2485 as invalid.  You can read the explanation there, but I'll 
go ahead and summarize: a bag's schema can only contain one tuple because it is 
assumed that all tuples in the bag have the same schema.  Obviously this won't 
be true in Cassandra since we allow any column to have any schema that you 
like.  However, after talking with Dmitriy Ryaboy, I have a plan.  We got good 
results out of tuple-of-tuples, but this won't work with wide rows.  Another 
thing it won't work with is small rows where some columns have metadata, and 
some do not, because when you define a tuple-of-tuples that is a hard 
constraint; you can't define 4 and then return 20.  So what I propose is that 
we change the output format to be a tuple-of-tuples for all columns that have 
metadata, and then a bag with the rest of the columns with a single schema (the 
default comparator/validator.)  This will work for both static and wide rows, 
unless you manage to define metadata on so many columns in a wide row that they 
themselves qualify as wide.

To give an example, let's continue with what Pete started with a slight 
modification:
{noformat}
create column family PhotoVotes with
comparator = UTF8Type and
column_metadata =
[
{column_name: voter, validation_class: UTF8Type, index_type: KEYS},
{column_name: vote_type, validation_class: UTF8Type},
{column_name: photo_owner, validation_class: UTF8Type, index_type: KEYS},
{column_name: src_big, validation_class: UTF8Type},
{column_name: pid, validation_class: UTF8Type, index_type: KEYS},
{column_name: matched_string, validation_class: UTF8Type},
{column_name: time, validation_class: LongType},
];
{noformat}

Loading this from pig produces a schema like:
(key: bytearray,matched_string: (name: chararray,value: chararray),photo_owner: 
(name: chararray,value: chararray),pid: (name: chararray,value: 
chararray),src_big: (name: chararray,value: chararray),time: (name: 
chararray,value: long),vote_type: (name: chararray,value: chararray),voter: 
(name: chararray,value: chararray),columns: {(name: chararray,value: 
bytearray)})

This should allow you do things like:

FILTER rows by vote_type.value eq 'album_like'

Note that the *tuple* is named after the index, and inside the tuple we still 
have 'name' and 'value'.  This is because if we don't have the name accessible, 
this is going to be hard to store later (and schema introspection is a bit more 
magic than I'd care to use.)
                
      was (Author: brandon.williams):
    I resolved PIG-2485 as invalid.  You can read the explanation there, but 
I'll go ahead and summarize: a bag's schema can only contain one tuple because 
it is assumed that all tuples in the bag have the same schema.  Obviously this 
won't be true in Cassandra since we allow any column to have any schema that 
you like.  However, after talking with Dmitriy Ryaboy, I have a plan.  We got 
good results out of tuple-of-tuples, but this won't work with wide rows.  
Another thing it won't work with is small rows where some columns have 
metadata, and some do not, because when you define a tuple-of-tuples that is a 
hard constraint; you can't define 4 and then return 20.  So what I propose is 
that we change the output format to be a tuple-of-tuples for all columns that 
have metadata, and then a bag with the rest of the columns with a single schema 
(the default comparator/validator.)  This will work for both static and wide 
rows, unless you manage to define metadata on so many columns in a wide row 
that they themselves qualify as wide.

To give an example, let's continue with what Pete started with a slight 
modification:
{noformat}
create column family PhotoVotes with
comparator = UTF8Type and
column_metadata =
[
{column_name: voter, validation_class: UTF8Type, index_type: KEYS},
{column_name: vote_type, validation_class: UTF8Type},
{column_name: photo_owner, validation_class: UTF8Type, index_type: KEYS},
{column_name: src_big, validation_class: UTF8Type},
{column_name: pid, validation_class: UTF8Type, index_type: KEYS},
{column_name: matched_string, validation_class: UTF8Type},
{column_name: time, validation_class: LongType},
];
{noformat}

Loading this from pig produces a schema like:
(key: bytearray,matched_string: (name: chararray,value: chararray),photo_owner: 
(name: chararray,value: chararray),pid: (name: chararray,value: 
chararray),src_big: (name: chararray,value: chararray),time: (name: 
chararray,value: chararray),vote_type: (name: chararray,value: 
chararray),voter: (name: chararray,value: chararray),columns: {(name: 
chararray,value: bytearray)})

This should allow you do things like:

FILTER rows by vote_type.value eq 'album_like'

Note that the *tuple* is named after the index, and inside the tuple we still 
have 'name' and 'value'.  This is because if we don't have the name accessible, 
this is going to be hard to store later (and schema introspection is a bit more 
magic than I'd care to use.)
                  
> Cassandra inferred schema and actual data don't match
> -----------------------------------------------------
>
>                 Key: CASSANDRA-3371
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3371
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.8.7
>            Reporter: Pete Warden
>            Assignee: Brandon Williams
>         Attachments: 3371-v2.txt, 3371-v3.txt, 3371-v4.txt, 3371-v5.txt, 
> pig.diff
>
>
> It's looking like there may be a mismatch between the schema that's being 
> reported by the latest CassandraStorage.java, and the data that's actually 
> returned. Here's an example:
> rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage();
> DESCRIBE rows;
> rows: {key: chararray,columns: {(name: chararray,value: 
> bytearray,photo_owner: chararray,value_photo_owner: bytearray,pid: 
> chararray,value_pid: bytearray,matched_string: 
> chararray,value_matched_string: bytearray,src_big: chararray,value_src_big: 
> bytearray,time: chararray,value_time: bytearray,vote_type: 
> chararray,value_vote_type: bytearray,voter: chararray,value_voter: 
> bytearray)}}
> DUMP rows;
> (691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu
>  Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)})
> getSchema() is reporting the columns as an inner bag of tuples, each of which 
> contains 16 values. In fact, getNext() seems to return an inner bag 
> containing 7 tuples, each of which contains two values. 
> It appears that things got out of sync with this change:
> http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.8/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?r1=1177083&r2=1177082&pathrev=1177083
> See more discussion at:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/pig-cassandra-problem-quot-Incompatible-field-schema-quot-error-tc6882703.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-3371) Cassandra inferred schema and actual data don't match

Reply via email to