[jira] [Commented] (CASSANDRA-2474) CQL support for compound columns

2011-09-04 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096842#comment-13096842
 ] 

Pavel Yaskevich commented on CASSANDRA-2474:


bq. Remember that the ideal for CQL is to have SELECT x, y, z and get back 
exactly columns x, y, and z.

composite/super columns won't originally play nice with SQL syntax because it 
wasn't designed to query hierarchical data.

Few problems I have with componentX syntax:

 - if we have 10 subcolumns do I need to list them all using component syntax 
(which would be totally unreadable)?
 - it lacks scoping therefore on the big queries it will be hard to read
   e.g.
   {noformat}
  SELECT component1 AS tweet_id, component2 AS username, body, location, 
age, value AS body
   {noformat}
 - will potentially be hard to put into grammar because it can have ambiguous 
rules again because lack of scoping
 - why should we force users to actually give each component a number? 

And I don't get why do you think that (..,..,..) is a rocket science syntax:

If we presume that user should be familiar with composite type columns before 
start using the syntax then he will know what does each section (separated by 
,) mean:

{noformat}
SELECT name AS (tweet_id, username, location), value AS body
{noformat}

means that we have three sections as column name which we are aliasing to 
tweet_id, username, location

{noformat}
SELECT name AS (tweet_id, username | body | location | age), value AS body
{noformat}

means that we have two components in the name: first one - tweet_id, and second 
component that has multiple meanings but we only want to get username, body, 
location

{noformat}
SELECT name AS (tweet_id, *), value AS body

means that we still have two components in the column name but we don't care 
what holds component #2 and we expect result set to return all of the possible 
values.
  

 CQL support for compound columns
 

 Key: CASSANDRA-2474
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2474
 Project: Cassandra
  Issue Type: Sub-task
  Components: API, Core
Reporter: Eric Evans
Assignee: Pavel Yaskevich
  Labels: cql
 Fix For: 1.0

 Attachments: screenshot-1.jpg, screenshot-2.jpg


 For the most part, this boils down to supporting the specification of 
 compound column names (the CQL syntax is colon-delimted terms), and then 
 teaching the decoders (drivers) to create structures from the results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Issue Comment Edited] (CASSANDRA-2474) CQL support for compound columns

2011-09-04 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096842#comment-13096842
 ] 

Pavel Yaskevich edited comment on CASSANDRA-2474 at 9/4/11 9:57 AM:


bq. Remember that the ideal for CQL is to have SELECT x, y, z and get back 
exactly columns x, y, and z.

composite/super columns won't originally play nice with SQL syntax because it 
wasn't designed to query hierarchical data.

Few problems I have with componentX syntax:

 - if we have 10 subcolumns do I need to list them all using component syntax 
(which would be totally unreadable)?
 - it lacks scoping therefore on the big queries it will be hard to read
   e.g.
{noformat}
SELECT component1 AS tweet_id, component2 AS username, body, location, age, 
value AS body
{noformat}
 - will potentially be hard to put into grammar because it can have ambiguous 
rules again because lack of scoping
 - why should we force users to actually give each component a number? 

And I don't get why do you think that (..,..,..) is a rocket science syntax:

If we presume that user should be familiar with composite type columns before 
start using the syntax then he will know what does each section (separated by 
,) mean:

{noformat}
SELECT name AS (tweet_id, username, location), value AS body
{noformat}

means that we have three sections as column name which we are aliasing to 
tweet_id, username, location

{noformat}
SELECT name AS (tweet_id, username | body | location | age), value AS body
{noformat}

means that we have two components in the name: first one - tweet_id, and second 
component that has multiple meanings but we only want to get username, body, 
location

{noformat}
SELECT name AS (tweet_id, *), value AS body
{noformat}

means that we still have two components in the column name but we don't care 
what holds component #2 and we expect result set to return all of the possible 
values.
  

  was (Author: xedin):
bq. Remember that the ideal for CQL is to have SELECT x, y, z and get 
back exactly columns x, y, and z.

composite/super columns won't originally play nice with SQL syntax because it 
wasn't designed to query hierarchical data.

Few problems I have with componentX syntax:

 - if we have 10 subcolumns do I need to list them all using component syntax 
(which would be totally unreadable)?
 - it lacks scoping therefore on the big queries it will be hard to read
   e.g.
   {noformat}
  SELECT component1 AS tweet_id, component2 AS username, body, location, 
age, value AS body
   {noformat}
 - will potentially be hard to put into grammar because it can have ambiguous 
rules again because lack of scoping
 - why should we force users to actually give each component a number? 

And I don't get why do you think that (..,..,..) is a rocket science syntax:

If we presume that user should be familiar with composite type columns before 
start using the syntax then he will know what does each section (separated by 
,) mean:

{noformat}
SELECT name AS (tweet_id, username, location), value AS body
{noformat}

means that we have three sections as column name which we are aliasing to 
tweet_id, username, location

{noformat}
SELECT name AS (tweet_id, username | body | location | age), value AS body
{noformat}

means that we have two components in the name: first one - tweet_id, and second 
component that has multiple meanings but we only want to get username, body, 
location

{noformat}
SELECT name AS (tweet_id, *), value AS body

means that we still have two components in the column name but we don't care 
what holds component #2 and we expect result set to return all of the possible 
values.
  
  
 CQL support for compound columns
 

 Key: CASSANDRA-2474
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2474
 Project: Cassandra
  Issue Type: Sub-task
  Components: API, Core
Reporter: Eric Evans
Assignee: Pavel Yaskevich
  Labels: cql
 Fix For: 1.0

 Attachments: screenshot-1.jpg, screenshot-2.jpg


 For the most part, this boils down to supporting the specification of 
 compound column names (the CQL syntax is colon-delimted terms), and then 
 teaching the decoders (drivers) to create structures from the results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3031) Add 4 byte integer type

2011-09-04 Thread Radim Kolar (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096845#comment-13096845
 ] 

Radim Kolar commented on CASSANDRA-3031:


Typical hector program looks like this:

import me.prettyprint.cassandra.serializers.IntegerSerializer

ColumnFamilyTemplateInteger, String template = new 
ThriftColumnFamilyTemplateInteger, String 
(HFactory.createKeyspace(keyspace, c, new 
WeakWritePolicy()), sipdb, IntegerSerializer.get(), StringSerializer.get() );

Using this will send always 4 bytes long integers to Cassandra because of 
IntegerSerializer is using fixed instead of variable size. If you serialize 0 - 
it will send 00 00 00 00. Same with reading, it cant read variable sized 
integers back. Current Cassandra-cli and some of other client libraries has 
oposite problem, it cant work with fixed size int4 integers - writing to column 
using different client library corrupts data.

There is hector serializer able to do variable size serialization (BigInteger) 
but it is way slower. in hector community conclusion is that it is not worth of 
saving one or two bytes because you will need to declare your variables as 
variable sized integers in Java, which is too slow and unpractical.

I also like fixed sized integers more and they should be default for INT cql 
type because you know what value size you will read back from database. 
application can have subtle bugs if somebody inserts large value into column 
and it silently overflows in client during reading from database. Defaulting to 
variable sized type would work if majority of client applications can process 
variable sized integers by default. Only python is doing auto conversion from 
fixed size int - to variable sized long on overflows.

existing CQL CF create scripts are not a big issue, most ppl are creating 
schemas via cassandra-cli scripts. CQL is not widely used and it will not be 
used much until cassaandra-cli can do CQL. Only tool for working with CQL is 
not user friendly cqlsh. Cassandra administrators needs to read version upgrade 
document anyway, it is enough to document this change as part of 0.8 to 1.0 
upgrade procedure.

ppl coming from SQL land (mysql, mssql, db2) expect to have INT/INTEGER types 4 
bytes long and BIGINT 8 bytes long. variable integer type should be named in 
CQL like DECIMAL or NUMERIC (mysql, db2, mssql) or NUMBER (oracle).

 Add 4 byte integer type
 ---

 Key: CASSANDRA-3031
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3031
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.8.4
 Environment: any
Reporter: Radim Kolar
Priority: Minor
  Labels: hector, lhf
 Fix For: 1.0

 Attachments: apache-cassandra-0.8.4-SNAPSHOT.jar, src.diff, test.diff


 Cassandra currently lacks support for 4byte fixed size integer data type. 
 Java API Hector and C libcassandra likes to serialize integers as 4 bytes in 
 network order. Problem is that you cant use cassandra-cli to manipulate 
 stored rows. Compatibility with other applications using api following 
 cassandra integer encoding standard is problematic too.
 Because adding new datatype/validator is fairly simple I recommend to add 
 int4 data type. Compatibility with hector is important because it is most 
 used Java cassandra api and lot of applications are using it.
 This problem was discussed several times already 
 http://comments.gmane.org/gmane.comp.db.hector.user/2125
 https://issues.apache.org/jira/browse/CASSANDRA-2585
 It would be nice to have compatibility with cassandra-cli and other 
 applications without rewriting hector apps.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3122) SSTableSimpleUnsortedWriter take long time when inserting big rows

2011-09-04 Thread Benoit Perroud (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096867#comment-13096867
 ] 

Benoit Perroud commented on CASSANDRA-3122:
---

Digging further in SSTableSimpleUnsortedWriter, I found out another point : 

every time newRow is called, serializedSize iterate through all the columns to 
compute the size.

In my use case, I have line whith hourly values (data:h0|h1|h2|...|h23), and 
for every line I will use the date of the day concatenated with the hour as key 
(dateoftheday|hour), and the value composed (using composite) with the data 
as column name ([value,data]=null). More clearly, my data look like :
abc:1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2|1|2
bcd:3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4|3|4

and the for every line I call 

writer.newRow(20110804|0), writer.addColum(Composite(1, abc), empty_array), 
writer.newRow(20110804|1), writer.addColum(Composite(2, abc), empty_array), 
writer.newRow(20110804|3), writer.addColum(Composite(1, abc), empty_array), 
writer.newRow(20110804|4), writer.addColum(Composite(2, abc), empty_array), 
...

So writer.newRow() is called 24 times for every lines.

So one solution could be to have a local class CachedSizeColumFamily 
extending ColumFamily that will increase the serialized size at every 
addColumn, and return it directly when serializedSize() is called.

In the same topic, even if ConcurrentSkipListMap claims to have good 
performances (which is the case in multi threading environments), I had really 
better results using a TreeMap in ColumnFamily (and then avoid the putIfAbscent 
call on the ConcurrentSkipListMap). In bulk loading, 
SSTableSimpleUnsortedWriter is single threaded anyway, there is no needs of 
having a complex but yes slower data structure like ConcurrentSkipListMap. An 
improvement in bulk loading would be to use a single threaded ColumFamily for 
bulk loading. This could be part of another Jira.



 SSTableSimpleUnsortedWriter take long time when inserting big rows
 --

 Key: CASSANDRA-3122
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3122
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.8.3
Reporter: Benoit Perroud
Assignee: Sylvain Lebresne
Priority: Minor
 Fix For: 0.8.5

 Attachments: 3122.patch, SSTableSimpleUnsortedWriter-v2.patch, 
 SSTableSimpleUnsortedWriter.patch


 In SSTableSimpleUnsortedWriter, when dealing with rows having a lot of 
 columns, if we call newRow several times (to flush data as soon as possible), 
 the time taken by the newRow() call is increasing non linearly. This is 
 because when newRow is called, we merge the size increasing existing CF with 
 the new one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-3135) Tighten class accessibility in JDBC Suite

2011-09-04 Thread Rick Shaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Shaw updated CASSANDRA-3135:
-

Attachment: (was: tighten-accessability.txt)

 Tighten class accessibility in JDBC Suite
 -

 Key: CASSANDRA-3135
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3135
 Project: Cassandra
  Issue Type: Improvement
  Components: Drivers
Affects Versions: 0.8.4
Reporter: Rick Shaw
Assignee: Rick Shaw
Priority: Trivial
  Labels: JDBC
 Fix For: 0.8.5

 Attachments: tighten-accessability.txt


 Tighten up class accessibility by making classes in the suite that are not 
 intended to be instantiated by a client directly remove the {{public}} 
 modifier. In addition make abstract named classes use the {{abstract}} 
 modifier. And finally make methods that are not part of public interfaces but 
 shared in the package be marked {{protected}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-3135) Tighten class accessibility in JDBC Suite

2011-09-04 Thread Rick Shaw (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Shaw updated CASSANDRA-3135:
-

Attachment: tighten-accessability.txt

 Tighten class accessibility in JDBC Suite
 -

 Key: CASSANDRA-3135
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3135
 Project: Cassandra
  Issue Type: Improvement
  Components: Drivers
Affects Versions: 0.8.4
Reporter: Rick Shaw
Assignee: Rick Shaw
Priority: Trivial
  Labels: JDBC
 Fix For: 0.8.5

 Attachments: tighten-accessability.txt


 Tighten up class accessibility by making classes in the suite that are not 
 intended to be instantiated by a client directly remove the {{public}} 
 modifier. In addition make abstract named classes use the {{abstract}} 
 modifier. And finally make methods that are not part of public interfaces but 
 shared in the package be marked {{protected}}.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

2011-09-04 Thread Yuki Morishita (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuki Morishita updated CASSANDRA-3003:
--

Attachment: 3003-v5.txt

 Trunk single-pass streaming doesn't handle large row correctly
 --

 Key: CASSANDRA-3003
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3003
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0
Reporter: Sylvain Lebresne
Assignee: Yuki Morishita
Priority: Critical
  Labels: streaming
 Fix For: 1.0

 Attachments: 3003-v1.txt, 3003-v2.txt, 3003-v3.txt, 3003-v5.txt, 
 v3003-v4.txt


 For normal column family, trunk streaming always buffer the whole row into 
 memory. In uses
 {noformat}
   ColumnFamily.serializer().deserializeColumns(in, cf, true, true);
 {noformat}
 on the input bytes.
 We must avoid this for rows that don't fit in the inMemoryLimit.
 Note that for regular column families, for a given row, there is actually no 
 need to even recreate the bloom filter of column index, nor to deserialize 
 the columns. It is enough to filter the key and row size to feed the index 
 writer, but then simply dump the rest on disk directly. This would make 
 streaming more efficient, avoid a lot of object creation and avoid the 
 pitfall of big rows.
 Counters column family are unfortunately trickier, because each column needs 
 to be deserialized (to mark them as 'fromRemote'). However, we don't need to 
 do the double pass of LazilyCompactedRow for that. We can simply use a 
 SSTableIdentityIterator and deserialize/reserialize input as it comes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly

2011-09-04 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096885#comment-13096885
 ] 

Yuki Morishita commented on CASSANDRA-3003:
---

Sylvain,

Thank you for the review.
For now, I leave the max timestamp calculation part as it is done during 
streaming.

bq. we need to use Integer.MIN_VALUE as the value for expireBefore when 
deserializing the columns, otherwise the expired columns will be converted to 
DeletedColumns, which will change there serialized size (and thus screw up the 
data size and column index)

Fixed.

bq. for markDeltaAsDeleted, we must check if the length is already negative and 
leave it so if it is, otherwise if a streamed sstable get re-streamed to 
another node before it was compacted, we could end up not cleaning the delta 
correctly.

bq. it would be nice in SSTW.appendFromStream() to assert the sanity of our 
little deserialize-reserialize dance and assert what we did write the number of 
bytes that we wrote in the header.

Nice point. I added the same assertion as other append() does.

bq. the patch change a clearAllDelta to a markDeltaAsDeleted in 
CounterColumnTest which is bogus (and the test does fail with that change).

I forgot to revert this one. I should have run test before submitting...

bq. I would markDeltaAsDeleted to markForClearingDelta as this describe what 
the function does better

Fixed.

bq. nitpick: there is a few space at end of lines in some comments (I know I 
know, I'm picky).

Fixed this one too, I guess.

 Trunk single-pass streaming doesn't handle large row correctly
 --

 Key: CASSANDRA-3003
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3003
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 1.0
Reporter: Sylvain Lebresne
Assignee: Yuki Morishita
Priority: Critical
  Labels: streaming
 Fix For: 1.0

 Attachments: 3003-v1.txt, 3003-v2.txt, 3003-v3.txt, 3003-v5.txt, 
 v3003-v4.txt


 For normal column family, trunk streaming always buffer the whole row into 
 memory. In uses
 {noformat}
   ColumnFamily.serializer().deserializeColumns(in, cf, true, true);
 {noformat}
 on the input bytes.
 We must avoid this for rows that don't fit in the inMemoryLimit.
 Note that for regular column families, for a given row, there is actually no 
 need to even recreate the bloom filter of column index, nor to deserialize 
 the columns. It is enough to filter the key and row size to feed the index 
 writer, but then simply dump the rest on disk directly. This would make 
 streaming more efficient, avoid a lot of object creation and avoid the 
 pitfall of big rows.
 Counters column family are unfortunately trickier, because each column needs 
 to be deserialized (to mark them as 'fromRemote'). However, we don't need to 
 do the double pass of LazilyCompactedRow for that. We can simply use a 
 SSTableIdentityIterator and deserialize/reserialize input as it comes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3134) Patch Hadoop Streaming Source to Support Cassandra IO

2011-09-04 Thread Brandyn White (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096908#comment-13096908
 ] 

Brandyn White commented on CASSANDRA-3134:
--

{quote}
Could HADOOP-1722 be backported to 0.20.203 by itself? That would allow it to 
be seamlessly integrated into Brisk as well.
{quote}
Yes it could be.  Basically that is the only thing that Hadoopy/Dumbo require 
to communicate with streaming.

{quote}
Brandyn, to Jonathan's point though - is it possible instead of using 
TypedBytes, to use Cassandra's serialization methods via AbstractBytes to do 
this.
{quote}
So I'll give a brief overview of how I think this whole thing would come 
together (please correct me if something isn't right).  The data is always in a 
binary native representation in Cassandra (the data doesn't have a type encoded 
with it) but the comparator(column names)/validator(row key and column value) 
classes define how to interpret the data.  As long as they can do the SerDe of 
the raw data they don't mind.  The types themselves are encoded in the column 
family metadata and all rows, column names, and column values must have a 
single type per column family (3 java classes uniquely define a column family's 
type, 4 for super CFs).

In the Hadoop Streaming code, the InputFormat would give us back values as 
ByteBuffer types (see 
[example|https://github.com/apache/cassandra/blob/trunk/examples/hadoop_word_count/src/WordCount.java#L78]).
  Is there a way to get the type classes inside of Hadoop?  Like are those 
passed in through a jobconf or a method call?  If so, then the simplest thing 
is to deserialize them to standard writables, then use the java TypedBytes 
serialization code in HADOOP-1722 to send them to the client streaming program.

The reason that we need TypedBytes comes up at this point, if we just send the 
raw data then the streaming program is unaware of how to convert it.  So 
assuming there is a way to get the types classes for the column family in 
Hadoop, the conversion can either happen in the Hadoop Streaming code or in the 
client's code.  The problem is that the data stream is just a concatenation of 
the output bytes, TypedBytes are self delimiting which makes it possible to 
figure out key/value pair data boundaries; however, if you just output binary 
data for the key/value pairs and somehow communicate the types, the client 
streaming program would be unable to parse variable length data (strings).  
Moreover, the backwards compatibility you get by using TypedBytes is a big win 
compared to making the AbstractType's self delimiting.  If this becomes an 
issue (speed), we could make a direct binary data to TypedBytes conversion that 
would basically just slap on a typecode and (for variable length data) size 
(see TypedBytes format below).

One minor issue is that while there is support for Map/Dictionaries in 
TypedBytes, it isn't necessarily an OrderedDict.  We have a few options here: 
1.) Just use a dict, 2.) Since the client will have to know somewhat that it is 
using Cassandra (to provide Keyspace, CF name, etc.) we can easily use an 
OrderedDict instead of Dict, 3.) Use a list of tuples to encode the data, and 
4.) Add a new custom typecode for OrderedDict.  Of these I think #2 is the best 
client side and it is simple to do in the Hadoop Streaming code (#1 and #2 only 
differ client side, so unaware clients would simply interpret it as a 
dictionary).

So that we're all on the same page I put together a few links.

AbstractType Resources
CASSANDRA-2530
[AbstractType Class 
Def|http://javasourcecode.org/html/open-source/cassandra/cassandra-0.8.1/org/apache/cassandra/db/marshal/AbstractType.html]
[Pycassa Type Conversion 
Code|https://github.com/pycassa/pycassa/blob/0ce77bbda2917039ef82561d70b4a063c1f66224/pycassa/util.py#L137]
[Python Struct Syntax (to interpret the above code)| 
http://docs.python.org/library/struct.html#format-characters]

TypedBytes resources
[Data 
Format|http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/typedbytes/package-summary.html]
[Hadoopy Cython code to read TypedBytes (uses C 
IO)|https://github.com/bwhite/hadoopy/blob/master/hadoopy/_typedbytes.pyx#L53]


 Patch Hadoop Streaming Source to Support Cassandra IO
 -

 Key: CASSANDRA-3134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3134
 Project: Cassandra
  Issue Type: New Feature
  Components: Hadoop
Reporter: Brandyn White
Priority: Minor
  Labels: hadoop, hadoop_examples_streaming
   Original Estimate: 504h
  Remaining Estimate: 504h

 (text is a repost from 
 [CASSANDRA-1497|https://issues.apache.org/jira/browse/CASSANDRA-1497])
 I'm the author of the Hadoopy http://bwhite.github.com/hadoopy/ python 
 library and I'm interested 

[jira] [Commented] (CASSANDRA-3031) Add 4 byte integer type

2011-09-04 Thread Eric Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096909#comment-13096909
 ] 

Eric Evans commented on CASSANDRA-3031:
---

{quote}
I suppose we could provide a conversion tool that converts int - long in cql 
scripts.
{quote}

That was merely one example.  When changing documented behavior that people 
have come to rely on, there is no end to the different ways this could disrupt.

We're going to run into cases where we wished something had been implemented 
differently, that's unavoidable.  I think we just have to weigh each one for 
the cost:benefit of changing it.

If we decide it is worth it to make a change, we should stick to our guns and 
bump the major.  This shouldn't be done lightly, or often, or used as a license 
to go nuts, and plenty of time should be given to let concerned parties know 
exactly what is coming down the pipeline.

{quote}
I don't understand the claim that we need this for Hector compatibility, 
though. My understanding is that our varint would be just fine with 32bit ints 
– since the length is part of the byte[] encoding, we don't have to do clever 
things like hadoop's vint does 
(http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableUtils.html).
{quote}

If I understand correctly, Hector is Doing It Wrong (wrong in the sense that it 
does not match Cassandra's implementation).  Pycassa had a similar bug (assumed 
4 bytes for IntegerType) up until relatively recently.

 Add 4 byte integer type
 ---

 Key: CASSANDRA-3031
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3031
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.8.4
 Environment: any
Reporter: Radim Kolar
Priority: Minor
  Labels: hector, lhf
 Fix For: 1.0

 Attachments: apache-cassandra-0.8.4-SNAPSHOT.jar, src.diff, test.diff


 Cassandra currently lacks support for 4byte fixed size integer data type. 
 Java API Hector and C libcassandra likes to serialize integers as 4 bytes in 
 network order. Problem is that you cant use cassandra-cli to manipulate 
 stored rows. Compatibility with other applications using api following 
 cassandra integer encoding standard is problematic too.
 Because adding new datatype/validator is fairly simple I recommend to add 
 int4 data type. Compatibility with hector is important because it is most 
 used Java cassandra api and lot of applications are using it.
 This problem was discussed several times already 
 http://comments.gmane.org/gmane.comp.db.hector.user/2125
 https://issues.apache.org/jira/browse/CASSANDRA-2585
 It would be nice to have compatibility with cassandra-cli and other 
 applications without rewriting hector apps.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3031) Add 4 byte integer type

2011-09-04 Thread Eric Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096910#comment-13096910
 ] 

Eric Evans commented on CASSANDRA-3031:
---

bq. Perhaps we should make int default to varint, and add int4 and int8 types 
for when you really want fixed width? Defaulting to fixed seems archaic.

How about deprecating int and introducing int4 and int8?

 Add 4 byte integer type
 ---

 Key: CASSANDRA-3031
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3031
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.8.4
 Environment: any
Reporter: Radim Kolar
Priority: Minor
  Labels: hector, lhf
 Fix For: 1.0

 Attachments: apache-cassandra-0.8.4-SNAPSHOT.jar, src.diff, test.diff


 Cassandra currently lacks support for 4byte fixed size integer data type. 
 Java API Hector and C libcassandra likes to serialize integers as 4 bytes in 
 network order. Problem is that you cant use cassandra-cli to manipulate 
 stored rows. Compatibility with other applications using api following 
 cassandra integer encoding standard is problematic too.
 Because adding new datatype/validator is fairly simple I recommend to add 
 int4 data type. Compatibility with hector is important because it is most 
 used Java cassandra api and lot of applications are using it.
 This problem was discussed several times already 
 http://comments.gmane.org/gmane.comp.db.hector.user/2125
 https://issues.apache.org/jira/browse/CASSANDRA-2585
 It would be nice to have compatibility with cassandra-cli and other 
 applications without rewriting hector apps.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2474) CQL support for compound columns

2011-09-04 Thread Rick Shaw (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096915#comment-13096915
 ] 

Rick Shaw commented on CASSANDRA-2474:
--

The keyword AS may be a bit confusing to SQL users. It is a optional keyword 
that signifies that that the following word is to be taken has an alias for the 
first word. In CQL context it would more appropriately be used for giving a 
text label to a numeric (or whatever) column name. I readily admit I do not 
really understand how this feature works. Even if it is used, it would seem you 
would declare it in the reverse order? Like:


{code}
SELECT column_name AS alias FROM CF;
{code}

Could WITH be a substitute for AS in this compound context?

 CQL support for compound columns
 

 Key: CASSANDRA-2474
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2474
 Project: Cassandra
  Issue Type: Sub-task
  Components: API, Core
Reporter: Eric Evans
Assignee: Pavel Yaskevich
  Labels: cql
 Fix For: 1.0

 Attachments: screenshot-1.jpg, screenshot-2.jpg


 For the most part, this boils down to supporting the specification of 
 compound column names (the CQL syntax is colon-delimted terms), and then 
 teaching the decoders (drivers) to create structures from the results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2474) CQL support for compound columns

2011-09-04 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096916#comment-13096916
 ] 

Pavel Yaskevich commented on CASSANDRA-2474:


I think that as is more appropriate than with in compound column context 
because we use name as collection of components (aliases).

 CQL support for compound columns
 

 Key: CASSANDRA-2474
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2474
 Project: Cassandra
  Issue Type: Sub-task
  Components: API, Core
Reporter: Eric Evans
Assignee: Pavel Yaskevich
  Labels: cql
 Fix For: 1.0

 Attachments: screenshot-1.jpg, screenshot-2.jpg


 For the most part, this boils down to supporting the specification of 
 compound column names (the CQL syntax is colon-delimted terms), and then 
 teaching the decoders (drivers) to create structures from the results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2474) CQL support for compound columns

2011-09-04 Thread Rick Shaw (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096921#comment-13096921
 ] 

Rick Shaw commented on CASSANDRA-2474:
--

My point is that the familiar usage would be that the alias is the 
manufactured name that does not actually exist in the row and so it should be 
after the AS as opposed to before. Note too that in SQL the AS is optional, 
the alias is recognized positionally as following the column identifier if 
present. 

 CQL support for compound columns
 

 Key: CASSANDRA-2474
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2474
 Project: Cassandra
  Issue Type: Sub-task
  Components: API, Core
Reporter: Eric Evans
Assignee: Pavel Yaskevich
  Labels: cql
 Fix For: 1.0

 Attachments: screenshot-1.jpg, screenshot-2.jpg


 For the most part, this boils down to supporting the specification of 
 compound column names (the CQL syntax is colon-delimted terms), and then 
 teaching the decoders (drivers) to create structures from the results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2474) CQL support for compound columns

2011-09-04 Thread Pavel Yaskevich (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096926#comment-13096926
 ] 

Pavel Yaskevich commented on CASSANDRA-2474:


I understand your point but semantically with is not appropriate here.

 CQL support for compound columns
 

 Key: CASSANDRA-2474
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2474
 Project: Cassandra
  Issue Type: Sub-task
  Components: API, Core
Reporter: Eric Evans
Assignee: Pavel Yaskevich
  Labels: cql
 Fix For: 1.0

 Attachments: screenshot-1.jpg, screenshot-2.jpg


 For the most part, this boils down to supporting the specification of 
 compound column names (the CQL syntax is colon-delimted terms), and then 
 teaching the decoders (drivers) to create structures from the results.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2434) node bootstrapping can violate consistency

2011-09-04 Thread Zhu Han (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096981#comment-13096981
 ] 

Zhu Han commented on CASSANDRA-2434:


Is it possible to make the node does not reply to any request before bootstrap 
and anti-entrophy repair is finished?

This could fix the consistency problem brought by bootstrap.

 node bootstrapping can violate consistency
 --

 Key: CASSANDRA-2434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2434
 Project: Cassandra
  Issue Type: Bug
Reporter: Peter Schuller
Assignee: paul cannon
 Fix For: 1.1

 Attachments: 2434.patch.txt


 My reading (a while ago) of the code indicates that there is no logic 
 involved during bootstrapping that avoids consistency level violations. If I 
 recall correctly it just grabs neighbors that are currently up.
 There are at least two issues I have with this behavior:
 * If I have a cluster where I have applications relying on QUORUM with RF=3, 
 and bootstrapping complete based on only one node, I have just violated the 
 supposedly guaranteed consistency semantics of the cluster.
 * Nodes can flap up and down at any time, so even if a human takes care to 
 look at which nodes are up and things about it carefully before 
 bootstrapping, there's no guarantee.
 A complication is that not only does it depend on use-case where this is an 
 issue (if all you ever do you do at CL.ONE, it's fine); even in a cluster 
 which is otherwise used for QUORUM operations you may wish to accept 
 less-than-quorum nodes during bootstrap in various emergency situations.
 A potential easy fix is to have bootstrap take an argument which is the 
 number of hosts to bootstrap from, or to assume QUORUM if none is given.
 (A related concern is bootstrapping across data centers. You may *want* to 
 bootstrap to a local node and then do a repair to avoid sending loads of data 
 across DC:s while still achieving consistency. Or even if you don't care 
 about the consistency issues, I don't think there is currently a way to 
 bootstrap from local nodes only.)
 Thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-2434) node bootstrapping can violate consistency

2011-09-04 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096990#comment-13096990
 ] 

Jonathan Ellis commented on CASSANDRA-2434:
---

Repair is a much, much more heavyweight solution to the problem than just 
stream from the node that is 'displaced.'

 node bootstrapping can violate consistency
 --

 Key: CASSANDRA-2434
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2434
 Project: Cassandra
  Issue Type: Bug
Reporter: Peter Schuller
Assignee: paul cannon
 Fix For: 1.1

 Attachments: 2434.patch.txt


 My reading (a while ago) of the code indicates that there is no logic 
 involved during bootstrapping that avoids consistency level violations. If I 
 recall correctly it just grabs neighbors that are currently up.
 There are at least two issues I have with this behavior:
 * If I have a cluster where I have applications relying on QUORUM with RF=3, 
 and bootstrapping complete based on only one node, I have just violated the 
 supposedly guaranteed consistency semantics of the cluster.
 * Nodes can flap up and down at any time, so even if a human takes care to 
 look at which nodes are up and things about it carefully before 
 bootstrapping, there's no guarantee.
 A complication is that not only does it depend on use-case where this is an 
 issue (if all you ever do you do at CL.ONE, it's fine); even in a cluster 
 which is otherwise used for QUORUM operations you may wish to accept 
 less-than-quorum nodes during bootstrap in various emergency situations.
 A potential easy fix is to have bootstrap take an argument which is the 
 number of hosts to bootstrap from, or to assume QUORUM if none is given.
 (A related concern is bootstrapping across data centers. You may *want* to 
 bootstrap to a local node and then do a repair to avoid sending loads of data 
 across DC:s while still achieving consistency. Or even if you don't care 
 about the consistency issues, I don't think there is currently a way to 
 bootstrap from local nodes only.)
 Thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3118) nodetool can not decommission a node

2011-09-04 Thread deng (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13096995#comment-13096995
 ] 

deng commented on CASSANDRA-3118:
-

I changed the code and added some debugg infomation,the code is:
public static void calculatePendingRanges(
AbstractReplicationStrategy strategy, String table) {
logger_.debug(Calculating pending ranges for {} with {}, 
table, strategy);

TokenMetadata tm = StorageService.instance.getTokenMetadata();
MultimapRange, InetAddress pendingRanges = 
HashMultimap.create();
MapToken, InetAddress bootstrapTokens = 
tm.getBootstrapTokens();
SetInetAddress leavingEndpoints = tm.getLeavingEndpoints();
for(InetAddress leave:leavingEndpoints){

System.out.println(leavingEndpoints+leave.getHostAddress());
}

if (bootstrapTokens.isEmpty()  leavingEndpoints.isEmpty()
 tm.getMovingEndpoints().isEmpty()) {
if (logger_.isDebugEnabled())
logger_.debug(
No bootstrapping, leaving or 
moving nodes - empty pending ranges for {},
table);
tm.setPendingRanges(table, pendingRanges);
return;
}

MultimapInetAddress, Range addressRanges = strategy
.getAddressRanges();

// Copy of metadata reflecting the situation after all leave 
operations
// are finished.
TokenMetadata allLeftMetadata = tm.cloneAfterAllLeft();

// get all ranges that will be affected by leaving nodes
SetRange affectedRanges = new HashSetRange();
for (InetAddress endpoint : leavingEndpoints)
affectedRanges.addAll(addressRanges.get(endpoint));

// for each of those ranges, find what new nodes will be 
responsible for
// the range when
// all leaving nodes are gone.
for (Range range : affectedRanges) {
CollectionInetAddress currentEndpoints = strategy
.calculateNaturalEndpoints(range.right, 
tm);
CollectionInetAddress newEndpoints = strategy
.calculateNaturalEndpoints(range.right, 
allLeftMetadata);

System.out.println(olddAddressSize+currentEndpoints.size());
for (InetAddress olddAddress : currentEndpoints) {
System.out.println(OLDAddress+
+ olddAddress.getHostAddress()
+ olddAddress.getHostName());
}


System.out.println(newEndpointsSize+newEndpoints.size());
for (InetAddress newAddress : newEndpoints) {
System.out.println(NEWAddress+
+ newAddress.getHostAddress()
+ newAddress.getHostName());
}
newEndpoints.removeAll(currentEndpoints);
pendingRanges.putAll(range, newEndpoints);
}

// At this stage pendingRanges has been updated according to 
leave
// operations. We can
// now continue the calculation by checking bootstrapping nodes.

// For each of the bootstrapping nodes, simply add and remove 
them one
// by one to
// allLeftMetadata and check in between what their ranges would 
be.
for (Map.EntryToken, InetAddress entry : 
bootstrapTokens.entrySet()) {
InetAddress endpoint = entry.getValue();

allLeftMetadata.updateNormalToken(entry.getKey(), 
endpoint);
for (Range range : 
strategy.getAddressRanges(allLeftMetadata).get(
endpoint))
pendingRanges.put(range, endpoint);
allLeftMetadata.removeEndpoint(endpoint);
}

// At this stage pendingRanges has been updated according to 
leaving and
// bootstrapping nodes.
// We can now finish the calculation by checking moving nodes.

// For each of the moving nodes, we do the same thing we did for
// bootstrapping:
// 

[jira] [Commented] (CASSANDRA-3118) nodetool can not decommission a node

2011-09-04 Thread deng (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13097014#comment-13097014
 ] 

deng commented on CASSANDRA-3118:
-

I find that there are some   some keyspaces using LocalStrategy,and then I drop 
these keyspaces ,the decommission can work ok,

but there is another problem. when the nodeA is decommission and I changed the 
seeds from  100.86.12.224 to 127.0.0.1  in the file cassandra.yaml. the 
listen_address and rpc_address are still 100.86.17.9. I restarted the nodeA 
server but the nodeA automatically join the cluster ,even if i install new 
cassandra0.8.4. Why? the nodeA has been decommissioned in the cluster. 

 nodetool  can not  decommission a node
 --

 Key: CASSANDRA-3118
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3118
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 0.8.4
 Environment: Cassandra0.84
Reporter: deng
 Fix For: 0.8.5

 Attachments: 3118-debug.txt


 when i use nodetool ring and get the result ,and than i want to decommission 
 100.86.17.90  node ,but i get the error:
 [root@ip bin]# ./nodetool -h10.86.12.225 ring
 Address DC  RackStatus State   LoadOwns   
  Token   
   
  154562542458917734942660802527609328132 
 100.86.17.90  datacenter1 rack1   Up Leaving 1.08 MB 
 11.21%  3493450320433654773610109291263389161   
 100.86.12.225datacenter1 rack1   Up Normal  558.25 MB   
 14.25%  27742979166206700793970535921354744095  
 100.86.12.224datacenter1 rack1   Up Normal  5.01 GB 6.58% 
   38945137636148605752956920077679425910  
 ERROR:
 root@ip bin]# ./nodetool -h100.86.17.90 decommission
 Exception in thread main java.lang.UnsupportedOperationException
 at java.util.AbstractList.remove(AbstractList.java:144)
 at java.util.AbstractList$Itr.remove(AbstractList.java:360)
 at java.util.AbstractCollection.removeAll(AbstractCollection.java:337)
 at 
 org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:1041)
 at 
 org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:1006)
 at 
 org.apache.cassandra.service.StorageService.handleStateLeaving(StorageService.java:877)
 at 
 org.apache.cassandra.service.StorageService.onChange(StorageService.java:732)
 at 
 org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:839)
 at 
 org.apache.cassandra.gms.Gossiper.addLocalApplicationState(Gossiper.java:986)
 at 
 org.apache.cassandra.service.StorageService.startLeaving(StorageService.java:1836)
 at 
 org.apache.cassandra.service.StorageService.decommission(StorageService.java:1855)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
 com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
 at 
 com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
 at 
 com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
 at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
 at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
 at 
 com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
 at 
 com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
 at 
 javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1426)
 at 
 javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
 at 
 javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1264)
 at 
 javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1359)
 at 
 javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at