[ https://issues.apache.org/jira/browse/CASSANDRA-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197202#comment-13197202 ]
Brandon Williams commented on CASSANDRA-3371: --------------------------------------------- I resolved PIG-2485 as invalid. You can read the explanation there, but I'll go ahead and summarize: a bag's schema can only contain one tuple because it is assumed that all tuples in the bag have the same schema. Obviously this won't be true in Cassandra since we allow any column to have any schema that you like. However, after talking with Dmitriy Ryaboy, I have a plan. We got good results out of tuple-of-tuples, but this won't work with wide rows. Another thing it won't work with is small rows where some columns have metadata, and some do not, because when you define a tuple-of-tuples that is a hard constraint; you can't define 4 and then return 20. So what I propose is that we change the output format to be a tuple-of-tuples for all columns that have metadata, and then a bag with the rest of the columns with a single schema (the default comparator/validator.) This will work for both static and wide rows, unless you manage to define metadata on so many columns in a wide row that they themselves qualify as wide. To give an example, let's continue with what Pete started with a slight modification: {noformat} create column family PhotoVotes with comparator = UTF8Type and column_metadata = [ {column_name: voter, validation_class: UTF8Type, index_type: KEYS}, {column_name: vote_type, validation_class: UTF8Type}, {column_name: photo_owner, validation_class: UTF8Type, index_type: KEYS}, {column_name: src_big, validation_class: UTF8Type}, {column_name: pid, validation_class: UTF8Type, index_type: KEYS}, {column_name: matched_string, validation_class: UTF8Type}, {column_name: time, validation_class: LongType}, ]; {noformat} Loading this from pig produces a schema like: (key: bytearray,matched_string: (name: chararray,value: chararray),photo_owner: (name: chararray,value: chararray),pid: (name: chararray,value: chararray),src_big: (name: chararray,value: chararray),time: (name: chararray,value: chararray),vote_type: (name: chararray,value: chararray),voter: (name: chararray,value: chararray),columns: {(name: chararray,value: bytearray)}) This should allow you do things like: FILTER rows by vote_type.value eq 'album_like' Note that the *tuple* is named after the index, and inside the tuple we still have 'name' and 'value'. This is because if we don't have the name accessible, this is going to be hard to store later (and schema introspection is a bit more magic than I'd care to use.) > Cassandra inferred schema and actual data don't match > ----------------------------------------------------- > > Key: CASSANDRA-3371 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3371 > Project: Cassandra > Issue Type: Bug > Components: Hadoop > Affects Versions: 0.8.7 > Reporter: Pete Warden > Assignee: Brandon Williams > Attachments: 3371-v2.txt, 3371-v3.txt, 3371-v4.txt, pig.diff > > > It's looking like there may be a mismatch between the schema that's being > reported by the latest CassandraStorage.java, and the data that's actually > returned. Here's an example: > rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage(); > DESCRIBE rows; > rows: {key: chararray,columns: {(name: chararray,value: > bytearray,photo_owner: chararray,value_photo_owner: bytearray,pid: > chararray,value_pid: bytearray,matched_string: > chararray,value_matched_string: bytearray,src_big: chararray,value_src_big: > bytearray,time: chararray,value_time: bytearray,vote_type: > chararray,value_vote_type: bytearray,voter: chararray,value_voter: > bytearray)}} > DUMP rows; > (691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu > Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)}) > getSchema() is reporting the columns as an inner bag of tuples, each of which > contains 16 values. In fact, getNext() seems to return an inner bag > containing 7 tuples, each of which contains two values. > It appears that things got out of sync with this change: > http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.8/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?r1=1177083&r2=1177082&pathrev=1177083 > See more discussion at: > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/pig-cassandra-problem-quot-Incompatible-field-schema-quot-error-tc6882703.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira