Hadoop Core Upgrade?

2014-03-15 Thread Joe Stein
Hi, what would folks say to a hadoop core upgrade to 1.2.1?

Is this something that can go in to 2.1?

/***
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
/


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-11 Thread Joe Stein
ah! cool, thanks!

On Tue, Mar 11, 2014 at 7:55 PM, Brandon Williams  wrote:

> On Tue, Mar 11, 2014 at 6:53 PM, Joe Stein  wrote:
>
> > Is there a wiki page for the protocol spec? I googled a little but my
> > google fu is off today :(
> >
> >
> We keep that in-tree:
> https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v2.spec
>


Re: Proposal: freeze Thrift starting with 2.1.0

2014-03-11 Thread Joe Stein
Is there a wiki page for the protocol spec? I googled a little but my
google fu is off today :(

One nice thing about Thrift is that the interface is human explanative and
serializes into a format the computer likes too.

With Apache Kafka it is a wire protocol and a lot of developers have
developed against it. I think that is because of the documentation that was
contributed to the community those developers were able to succeed
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocoland
didn't feel left out.

I have heard folks having built their platforms to support Cassandra on the
thrift interface because they felt it as a tighter integration.  I think it
was as recently as last week at the Titan Graph DB talk at the NYC C*
meetup.

I have been recommending CQL3 for a 9 months now so if people have enough
heads up time it should be alright but I don't know if the expectation is <
when 2.1 is coming out.

Lastly, would 2.2 be released as 3.0?  I ask because everything new would
not be backwards compatible to anyone using the old interface?

/*******
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
/


On Tue, Mar 11, 2014 at 7:26 PM, Edward Capriolo wrote:

> If you are using thrift there probably isn't a reason to upgrade to 2.1
>
> What? Upgrading gets you performance regardless of your api.
>
> We have already gone from "no new feature" talk to "less enphisis on
> testing".
>
> How comforting.
> On Tuesday, March 11, 2014, Dave Brosius  wrote:
> >
> > +1,
> >
> > altho supporting thrift in 2.1 seems overly conservative.
> >
> > If you are using thrift there probably isn't a reason to upgrade to 2.1,
> in fact doing so will become an increasingly dumb idea as lesser and lesser
> emphasis will be placed on testing with 2.1+. This would allow us to
> greatly simplify the code footprint in 2.1
> >
> >
> >
> >
> > On 03/11/2014 01:00 PM, Jonathan Ellis wrote:
> >>
> >> CQL3 is almost two years old now and has proved to be the better API
> >> that Cassandra needed.  CQL drivers have caught up with and passed the
> >> Thrift ones in terms of features, performance, and usability.  CQL is
> >> easier to learn and more productive than Thrift.
> >>
> >> With static columns and LWT batch support [1] landing in 2.0.6, and
> >> UDT in 2.1 [2], I don't know of any use cases for Thrift that can't be
> >> done in CQL.  Contrawise, CQL makes many things easy that are
> >> difficult to impossible in Thrift.  New development is overwhelmingly
> >> done using CQL.
> >>
> >> To date we have had an unofficial and poorly defined policy of "add
> >> support for new features to Thrift when that is 'easy.'"  However,
> >> even relatively simple Thrift changes can create subtle complications
> >> for the rest of the server; for instance, allowing Thrift range
> >> tombtones would make filter conversion for CASSANDRA-6506 more
> >> difficult.
> >>
> >> Thus, I think it's time to officially close the book on Thrift.  We
> >> will retain it for backwards compatibility, but we will commit to
> >> adding no new features or changes to the Thrift API after 2.1.0.  This
> >> will help send an unambiguous message to users and eliminate any
> >> remaining confusion from supporting two APIs.  If any new use cases
> >> come to light that can be done with Thrift but not CQL, we will commit
> >> to supporting those in CQL.
> >>
> >> (To a large degree, this merely formalizes what is already de facto
> >> reality.  Most thrift clients have not even added support for
> >> atomic_batch_mutate and cas from 2.0, and popular clients like
> >> Astyanax are migrating to the native protocol.)
> >>
> >> Reasonable?
> >>
> >> [1] https://issues.apache.org/jira/browse/CASSANDRA-6561
> >> [2] https://issues.apache.org/jira/browse/CASSANDRA-5590
> >>
> >
> >
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>


Re: inserting a row with a map column when using if not exists results in null column for the row

2013-09-25 Thread Joe Stein
yup, thats it :) cool, thanks!

On Wed, Sep 25, 2013 at 9:42 PM, Yuki Morishita  wrote:

> Sounds like https://issues.apache.org/jira/browse/CASSANDRA-6069
>
> On Wed, Sep 25, 2013 at 8:29 PM, Joe Stein  wrote:
> > Hi, was not sure if there is a reason for this or I am doing something
> > wrong or is known issue or not but when trying to insert a row with a map
> > collection column and using if not exist the map is coming out as null :(
> > see below, let me know, thanks!
> >
> > cqlsh:rvag> CREATE TABLE users (
> > ... id text PRIMARY KEY,
> > ... given text,
> > ... surname text,
> > ... favs map   // A map of text keys, and text
> > values
> > ... );
> > cqlsh:rvag> INSERT INTO users (id, given, surname, favs)
> > ...VALUES ('jsmith', 'John', 'Smith', { 'fruit' :
> > 'apple', 'band' : 'Beatles' });
> > cqlsh:rvag> select * from users;
> >
> >  id | favs  | given | surname
> > +---+---+-
> >  jsmith | {'band': 'Beatles', 'fruit': 'apple'} |  John |   Smith
> >
> > (1 rows)
> >
> > cqlsh:rvag> truncate users;
> > cqlsh:rvag> select * from users;
> >
> > (0 rows)
> >
> > cqlsh:rvag> INSERT INTO users (id, given, surname, favs)
> > ...VALUES ('jsmith', 'John', 'Smith', { 'fruit' :
> > 'apple', 'band' : 'Beatles' }) IF NOT EXISTS;
> > cqlsh:rvag> select * from users;
> >
> >  id | favs | given | surname
> > +--+---+-
> >  jsmith | null |  John |   Smith
> >
> > (1 rows)
> >
> > /***
> >  Joe Stein
> >  Founder, Principal Consultant
> >  Big Data Open Source Security LLC
> >  http://www.stealth.ly
> >  Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
> > /
>
>
>
> --
> Yuki Morishita
>  t:yukim (http://twitter.com/yukim)
>


inserting a row with a map column when using if not exists results in null column for the row

2013-09-25 Thread Joe Stein
Hi, was not sure if there is a reason for this or I am doing something
wrong or is known issue or not but when trying to insert a row with a map
collection column and using if not exist the map is coming out as null :(
see below, let me know, thanks!

cqlsh:rvag> CREATE TABLE users (
... id text PRIMARY KEY,
... given text,
... surname text,
... favs map   // A map of text keys, and text
values
... );
cqlsh:rvag> INSERT INTO users (id, given, surname, favs)
...VALUES ('jsmith', 'John', 'Smith', { 'fruit' :
'apple', 'band' : 'Beatles' });
cqlsh:rvag> select * from users;

 id | favs  | given | surname
+---+---+-
 jsmith | {'band': 'Beatles', 'fruit': 'apple'} |  John |   Smith

(1 rows)

cqlsh:rvag> truncate users;
cqlsh:rvag> select * from users;

(0 rows)

cqlsh:rvag> INSERT INTO users (id, given, surname, favs)
...VALUES ('jsmith', 'John', 'Smith', { 'fruit' :
'apple', 'band' : 'Beatles' }) IF NOT EXISTS;
cqlsh:rvag> select * from users;

 id | favs | given | surname
+--+---+-
 jsmith | null |  John |   Smith

(1 rows)

/***
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
/


Re: Discussion: release quality

2011-11-29 Thread Joe Stein
I need at least a week, maybe two to promote anything to staging which is 
mainly because we do weekly releases.   I could introduce a 2 day turn around 
but only with a more fixed type schedule.  I am running 0.8.6 in production and 
REALLY want to upgrade for nothing more than getting compression ( the cost of 
petabytes of uncompressed data is just stupid ).  So however I can help in 
changing my process OR better understanding the PMC here I am game for.  

One thing I use C* for is holding days worth of data and re-running those days 
for regression on our software... simulating production... It might not take 
much to reverse it.

/*
Joe Stein
http://www.medialets.com
Twitter: @allthingshadoop
*/

On Nov 29, 2011, at 10:04 PM, Edward Capriolo  wrote:

> On Tue, Nov 29, 2011 at 6:16 PM, Jeremy Hanna 
> wrote:
> 
>> I'd like to start a discussion about ideas to improve release quality for
>> Cassandra.  Specifically I wonder if the community can do more to help the
>> project as a whole become more solid.  Cassandra has an active and vibrant
>> community using Cassandra for a variety of things.  If we all pitch in a
>> little bit, it seems like we can make a difference here.
>> 
>> Release quality is difficult, especially for a distributed system like
>> Cassandra.  The core devs have done an amazing job with this considering
>> how complicated it is.  Currently, there are several things in place to
>> make sure that a release is generally usable:
>> - review-then-commit
>> - 72 hour voting period
>> - at least 3 binding +1 votes
>> - unit tests
>> - integration tests
>> Then there is the personal responsibility aspect - testing a release in a
>> staging environment before pushing it to production.
>> 
>> I wonder if more could be done here to give more confidence in releases.
>> I wanted to see if there might be ways that the community could help out
>> without being too burdensome on either the core devs or the community.
>> 
>> Some ideas:
>> More automation: run YCSB and stress with various setups.  Maybe people
>> can rotate donating cloud instances (or simply money for them) but have a
>> common set of scripts to do this in the source.
>> 
>> Dedicated distributed test suite: I know there has been work done on
>> various distributed test suites (which is great!) but none have really
>> caught on so far.
>> 
>> I know what the apache guidelines say, but what if the community could
>> help out with the testing effort in a more formal way.  For example, for
>> each release to be finalized, what if there needed to be 3 community
>> members that needed to try it out in their own environment?
>> 
>> What if there was a post release +1 vote for the community to sign off on
>> - sort of a "works for me" kind of thing to reassure others that it's safe
>> to try.  So when the release email gets posted to the user list, start a
>> tradition of people saying +1 in reply if they've tested it out and it
>> works for them.  That's happening informally now when there are problems,
>> but it might be nice to see a vote of confidence.  Just another idea.
>> 
>> Any other ideas or variations?
> 
> 
> I am no software engineering guru, but whenever I +1 a hive release I
> actually do checkout the code and run a couple queries. Mostly I find that
> because there is just so many things not unit testable like those gosh darn
> bash scripts that launch Java applications. There have been times when even
> after multiple patch revisions and passing unit tests something just does
> not work in the real world. So I never +1 a binary release I don't spend an
> hour with and if possible I try twisting the knobs on any new feature or at
> least just trying the basics.Hive is aiming for something like quarterly
> releases.
> 
> So possibly better to have Cassandra do time based releases. It does not
> have to be quarterly but if people want bleeding edge features (something
> committed 2 days ago) really they should go out and build something from
> trunk.
> 
> It seems like Cassandra devs have the voting and releasing down to a
> science but from my world the types of bugs I worry about are data file
> corruption, and any weird bug that would result in data faults like
> read_repair not working or writes not going to the write nodes, or bloom
> filters giving a faulty result. New features are great and I love seeing
> them but I can wait for those.
> 
> Updates now even trivial ones get political, you just never want to be the
> guy that champions a update and then not have it go well :)
> 
> Most users of Cassandra are going to have larg

Re: How is Cassandra being used?

2011-11-16 Thread Joe Stein
This brings up a nice possibility also for businesses to have Cassandra
outsourced monitoring service(s)/solutions(s) from ring
infrastructure perspective.

Any chance the URL Cassandra posts information too can be configurable
also?  Maybe even an abstract class so others can extend it and the first
implementation is for this stuff?  Makes it easy for your support contract
to have people pull up a dashboard of your cassandra cluster without having
to give them access to your production network (I hate giving access to
anyone to my production network but so many consulting/support companies
require this (b) or you even have to do anything... the service can
even be proactive when things start to get naughty.

On Wed, Nov 16, 2011 at 11:35 AM, Jake Luciani  wrote:

> Having worked at places where you get fired if software *attempts* to
> contact outside world I understand the concerns.
>
> However, if it's opt-in via config file and requires a restart then there
> is no reason why it should be a concern.
>
>
> On Wed, Nov 16, 2011 at 3:29 AM, Zhu Han  wrote:
>
> > On Wed, Nov 16, 2011 at 3:03 PM, Norman Maurer 
> wrote:
> >
> > > 2011/11/16 Jonathan Ellis :
> > > > I started a "users survey" thread over on the users list (replies are
> > > > still trickling in), but as useful as that is, I'd like to get
> > > > feedback that is more quantitative and with a broader base.  This
> will
> > > > let us prioritize our development efforts to better address what
> > > > people are actually using it for, with less guesswork.  For instance:
> > > > we put a lot of effort into compression for 1.0.0; if it turned out
> > > > that only 1% of 1.0.x users actually enable compression, then it
> means
> > > > that we should spend less effort fine-tuning that moving forward, and
> > > > use the energy elsewhere.
> > > >
> > > > (Of course it could also mean that we did a terrible job getting the
> > > > word out about new features and explaining how to use them, but
> either
> > > > way, it would be good to know!)
> > > >
> > > > I propose adding a basic cluster reporting feature to cassandra.yaml,
> > > > enabled by default.  It would send anonymous information about your
> > > > cluster to an apache.org VM.  Information like, number (but not
> names)
> > > > of keyspaces and columnfamilies, ks-level options like compression,
> cf
> > > > options like compaction strategy, data types (again, not names) of
> > > > columns, average row size (or better: the histogram data), and
> average
> > > > sstables per read.
> > > >
> > > > Thoughts?
> > >
> >
> > -1.
> >
> > It may scare some admins who stores sensitive data  in cassandra. Even if
> > it can
> > disabled, we can not sleep well in the night when we know the door can be
> > opened unintentionally...
> >
> >
> > > Hi there,
> > >
> > > I'm not a cassandra dev but an user of it. I would really "hate" to
> > > see such code in the cassandra code-base. I understand that it would
> > > be kind of useful to get a better feeling about usage etc, but its
> > > really something that scares the shit out of many managers (and even
> > > devs ;) ).
> > >
> > > So -1 to add this code (*non-binding)
> > >
> > > Bye,
> > > Norman
> > >
> >
>
>
>
> --
> http://twitter.com/tjake
>



-- 

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
*/