map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle
Hi,

I have the need to executing a map/reduce job to identity data stored in
Cassandra before indexing this data to Elastic Search.

I have already used ColumnFamilyInputFormat (before start using CQL) to
write hadoop jobs to do that, but I use to have a lot of troubles to
perform tunning, as hadoop depends on how map tasks are split in order to
successfull execute things in parallel, for IO/bound processes.

First question is: Am I the only one having problems with that? Is anyone
else using hadoop jobs that reads from Cassandra in production?

Second question is about the alternatives. I saw new version spark will
have Cassandra support, but using CqlPagingInputFormat, from hadoop. I
tried to use HIVE with Cassandra community, but it seems it only works with
Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/),
which we have been using reading from Cassandra and so far it has been
great for SQL-like queries. For custom map reduce jobs, however, it is not
enough.

Does anyone know some other tool that performs MR on Cassandra? My
impression is most tools were created to work on top of HDFS and reading
from a nosql db is some kind of workaround.

Third question is about how these tools work. Most of them writtes mapped
data on a intermediate storage, then data is shuffled and sorted, then it
is reduced. Even when using CqlPagingInputFormat, if you are using hadoop
it will write files to HDFS after the mapping phase, shuffle and sort this
data, and then reduce it.

I wonder if a tool supporting Cassandra out of the box wouldn't be smarter.
Is it faster to write all your data to a file and then sorting it, or batch
inserting data and already indexing it, as it happens when you store data
in a Cassandra CF? I didn't do the calculations to check the complexity of
each one, what should consider no index in Cassandra would be really large,
as the maximum index size will always depend on the maximum capacity of a
single host, but my guess is that a map / reduce tool written specifically
to Cassandra, from the beggining, could perform much better than a tool
written to HDFS and adapted. I hear people saying Map/Reduce on
Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make
sense? Should we expect a result like this?

Final question: Do you think writting a new M/R tool like described would
be reinventing the wheel? Or it makes sense?

Thanks in advance. Any opinions about this subject will be very appreciated.

Best regards,
Marcelo Valle.


Re: map reduce for Cassandra

2014-07-21 Thread Jonathan Haddad
Hey Marcelo,

You should check out spark.  It intelligently deals with a lot of the
issues you're mentioning.  Al Tobey did a walkthrough of how to set up
the OSS side of things here:
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

It'll be less work than writing a M/R framework from scratch :)
Jon


On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 Hi,

 I have the need to executing a map/reduce job to identity data stored in
 Cassandra before indexing this data to Elastic Search.

 I have already used ColumnFamilyInputFormat (before start using CQL) to
 write hadoop jobs to do that, but I use to have a lot of troubles to perform
 tunning, as hadoop depends on how map tasks are split in order to
 successfull execute things in parallel, for IO/bound processes.

 First question is: Am I the only one having problems with that? Is anyone
 else using hadoop jobs that reads from Cassandra in production?

 Second question is about the alternatives. I saw new version spark will have
 Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to
 use HIVE with Cassandra community, but it seems it only works with Cassandra
 Enterprise and doesn't do more than FB presto (http://prestodb.io/), which
 we have been using reading from Cassandra and so far it has been great for
 SQL-like queries. For custom map reduce jobs, however, it is not enough.

 Does anyone know some other tool that performs MR on Cassandra? My
 impression is most tools were created to work on top of HDFS and reading
 from a nosql db is some kind of workaround.

 Third question is about how these tools work. Most of them writtes mapped
 data on a intermediate storage, then data is shuffled and sorted, then it is
 reduced. Even when using CqlPagingInputFormat, if you are using hadoop it
 will write files to HDFS after the mapping phase, shuffle and sort this
 data, and then reduce it.

 I wonder if a tool supporting Cassandra out of the box wouldn't be smarter.
 Is it faster to write all your data to a file and then sorting it, or batch
 inserting data and already indexing it, as it happens when you store data in
 a Cassandra CF? I didn't do the calculations to check the complexity of each
 one, what should consider no index in Cassandra would be really large, as
 the maximum index size will always depend on the maximum capacity of a
 single host, but my guess is that a map / reduce tool written specifically
 to Cassandra, from the beggining, could perform much better than a tool
 written to HDFS and adapted. I hear people saying Map/Reduce on
 Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make
 sense? Should we expect a result like this?

 Final question: Do you think writting a new M/R tool like described would be
 reinventing the wheel? Or it makes sense?

 Thanks in advance. Any opinions about this subject will be very appreciated.

 Best regards,
 Marcelo Valle.



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade


Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle
Hi Jonathan,

Do you know if this RDD can be used with Python? AFAIK, python + Cassandra
will be supported just in the next version, but I would like to be wrong...

Best regards,
Marcelo Valle.



2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Hey Marcelo,

 You should check out spark.  It intelligently deals with a lot of the
 issues you're mentioning.  Al Tobey did a walkthrough of how to set up
 the OSS side of things here:

 http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

 It'll be less work than writing a M/R framework from scratch :)
 Jon


 On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  Hi,
 
  I have the need to executing a map/reduce job to identity data stored in
  Cassandra before indexing this data to Elastic Search.
 
  I have already used ColumnFamilyInputFormat (before start using CQL) to
  write hadoop jobs to do that, but I use to have a lot of troubles to
 perform
  tunning, as hadoop depends on how map tasks are split in order to
  successfull execute things in parallel, for IO/bound processes.
 
  First question is: Am I the only one having problems with that? Is anyone
  else using hadoop jobs that reads from Cassandra in production?
 
  Second question is about the alternatives. I saw new version spark will
 have
  Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried
 to
  use HIVE with Cassandra community, but it seems it only works with
 Cassandra
  Enterprise and doesn't do more than FB presto (http://prestodb.io/),
 which
  we have been using reading from Cassandra and so far it has been great
 for
  SQL-like queries. For custom map reduce jobs, however, it is not enough.
 
  Does anyone know some other tool that performs MR on Cassandra? My
  impression is most tools were created to work on top of HDFS and reading
  from a nosql db is some kind of workaround.
 
  Third question is about how these tools work. Most of them writtes mapped
  data on a intermediate storage, then data is shuffled and sorted, then
 it is
  reduced. Even when using CqlPagingInputFormat, if you are using hadoop it
  will write files to HDFS after the mapping phase, shuffle and sort this
  data, and then reduce it.
 
  I wonder if a tool supporting Cassandra out of the box wouldn't be
 smarter.
  Is it faster to write all your data to a file and then sorting it, or
 batch
  inserting data and already indexing it, as it happens when you store
 data in
  a Cassandra CF? I didn't do the calculations to check the complexity of
 each
  one, what should consider no index in Cassandra would be really large, as
  the maximum index size will always depend on the maximum capacity of a
  single host, but my guess is that a map / reduce tool written
 specifically
  to Cassandra, from the beggining, could perform much better than a tool
  written to HDFS and adapted. I hear people saying Map/Reduce on
  Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
 make
  sense? Should we expect a result like this?
 
  Final question: Do you think writting a new M/R tool like described
 would be
  reinventing the wheel? Or it makes sense?
 
  Thanks in advance. Any opinions about this subject will be very
 appreciated.
 
  Best regards,
  Marcelo Valle.



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade



Re: map reduce for Cassandra

2014-07-21 Thread Jonathan Haddad
I haven't tried pyspark yet, but it's part of the distribution.  My
main language is Python too, so I intend on getting deep into it.

On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 Hi Jonathan,

 Do you know if this RDD can be used with Python? AFAIK, python + Cassandra
 will be supported just in the next version, but I would like to be wrong...

 Best regards,
 Marcelo Valle.



 2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Hey Marcelo,

 You should check out spark.  It intelligently deals with a lot of the
 issues you're mentioning.  Al Tobey did a walkthrough of how to set up
 the OSS side of things here:

 http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

 It'll be less work than writing a M/R framework from scratch :)
 Jon


 On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  Hi,
 
  I have the need to executing a map/reduce job to identity data stored in
  Cassandra before indexing this data to Elastic Search.
 
  I have already used ColumnFamilyInputFormat (before start using CQL) to
  write hadoop jobs to do that, but I use to have a lot of troubles to
  perform
  tunning, as hadoop depends on how map tasks are split in order to
  successfull execute things in parallel, for IO/bound processes.
 
  First question is: Am I the only one having problems with that? Is
  anyone
  else using hadoop jobs that reads from Cassandra in production?
 
  Second question is about the alternatives. I saw new version spark will
  have
  Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried
  to
  use HIVE with Cassandra community, but it seems it only works with
  Cassandra
  Enterprise and doesn't do more than FB presto (http://prestodb.io/),
  which
  we have been using reading from Cassandra and so far it has been great
  for
  SQL-like queries. For custom map reduce jobs, however, it is not enough.
 
  Does anyone know some other tool that performs MR on Cassandra? My
  impression is most tools were created to work on top of HDFS and reading
  from a nosql db is some kind of workaround.
 
  Third question is about how these tools work. Most of them writtes
  mapped
  data on a intermediate storage, then data is shuffled and sorted, then
  it is
  reduced. Even when using CqlPagingInputFormat, if you are using hadoop
  it
  will write files to HDFS after the mapping phase, shuffle and sort this
  data, and then reduce it.
 
  I wonder if a tool supporting Cassandra out of the box wouldn't be
  smarter.
  Is it faster to write all your data to a file and then sorting it, or
  batch
  inserting data and already indexing it, as it happens when you store
  data in
  a Cassandra CF? I didn't do the calculations to check the complexity of
  each
  one, what should consider no index in Cassandra would be really large,
  as
  the maximum index size will always depend on the maximum capacity of a
  single host, but my guess is that a map / reduce tool written
  specifically
  to Cassandra, from the beggining, could perform much better than a tool
  written to HDFS and adapted. I hear people saying Map/Reduce on
  Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
  make
  sense? Should we expect a result like this?
 
  Final question: Do you think writting a new M/R tool like described
  would be
  reinventing the wheel? Or it makes sense?
 
  Thanks in advance. Any opinions about this subject will be very
  appreciated.
 
  Best regards,
  Marcelo Valle.



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade





-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade


Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle
Jonathan,

By what I have read in the docs, Python API has some limitations yet, not
being possible to use any hadoop binary input format.

The python example for Cassandra is only in the master branch:
https://github.com/apache/spark/blob/master/examples/src/main/python/cassandra_inputformat.py

I may be lacking knowledge of Spark, but if I understood it correctly, the
access to Cassandra data is still made by the CqlPagingInputFormat, from
hadoop integration.

Here is where I ask: even if Spark supports Cassandra, will it be fast
enough?

My understanding (please some correct me if I am wrong) is that when you
insert N items in a Cassandra CF, you are executing N binary searches to
insert the item already indexed by a key. When you read the data, it's
already sorted. So you take O(N * log(N)) (binary search complexity to
insert all data already sorted.

However, by using a fast sort algorithm, you also take O(N * log(N)) to
sort the data after ir was inserted, but then using more IO.

If I write a job in Spark / Java with Cassandra, how will the mapped data
be stored and sorted? Will it be stored in Cassandra too? Will spark run
sort after the mapping?

Best regards,
Marcelo.



2014-07-21 14:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 I haven't tried pyspark yet, but it's part of the distribution.  My
 main language is Python too, so I intend on getting deep into it.

 On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  Hi Jonathan,
 
  Do you know if this RDD can be used with Python? AFAIK, python +
 Cassandra
  will be supported just in the next version, but I would like to be
 wrong...
 
  Best regards,
  Marcelo Valle.
 
 
 
  2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:
 
  Hey Marcelo,
 
  You should check out spark.  It intelligently deals with a lot of the
  issues you're mentioning.  Al Tobey did a walkthrough of how to set up
  the OSS side of things here:
 
 
 http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
 
  It'll be less work than writing a M/R framework from scratch :)
  Jon
 
 
  On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
  marc...@s1mbi0se.com.br wrote:
   Hi,
  
   I have the need to executing a map/reduce job to identity data stored
 in
   Cassandra before indexing this data to Elastic Search.
  
   I have already used ColumnFamilyInputFormat (before start using CQL)
 to
   write hadoop jobs to do that, but I use to have a lot of troubles to
   perform
   tunning, as hadoop depends on how map tasks are split in order to
   successfull execute things in parallel, for IO/bound processes.
  
   First question is: Am I the only one having problems with that? Is
   anyone
   else using hadoop jobs that reads from Cassandra in production?
  
   Second question is about the alternatives. I saw new version spark
 will
   have
   Cassandra support, but using CqlPagingInputFormat, from hadoop. I
 tried
   to
   use HIVE with Cassandra community, but it seems it only works with
   Cassandra
   Enterprise and doesn't do more than FB presto (http://prestodb.io/),
   which
   we have been using reading from Cassandra and so far it has been great
   for
   SQL-like queries. For custom map reduce jobs, however, it is not
 enough.
  
   Does anyone know some other tool that performs MR on Cassandra? My
   impression is most tools were created to work on top of HDFS and
 reading
   from a nosql db is some kind of workaround.
  
   Third question is about how these tools work. Most of them writtes
   mapped
   data on a intermediate storage, then data is shuffled and sorted, then
   it is
   reduced. Even when using CqlPagingInputFormat, if you are using hadoop
   it
   will write files to HDFS after the mapping phase, shuffle and sort
 this
   data, and then reduce it.
  
   I wonder if a tool supporting Cassandra out of the box wouldn't be
   smarter.
   Is it faster to write all your data to a file and then sorting it, or
   batch
   inserting data and already indexing it, as it happens when you store
   data in
   a Cassandra CF? I didn't do the calculations to check the complexity
 of
   each
   one, what should consider no index in Cassandra would be really large,
   as
   the maximum index size will always depend on the maximum capacity of a
   single host, but my guess is that a map / reduce tool written
   specifically
   to Cassandra, from the beggining, could perform much better than a
 tool
   written to HDFS and adapted. I hear people saying Map/Reduce on
   Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
   make
   sense? Should we expect a result like this?
  
   Final question: Do you think writting a new M/R tool like described
   would be
   reinventing the wheel? Or it makes sense?
  
   Thanks in advance. Any opinions about this subject will be very
   appreciated.
  
   Best regards,
   Marcelo Valle.
 
 
 
  --
  Jon Haddad
  http

Re: map reduce for Cassandra

2014-07-21 Thread Robert Coli
On Mon, Jul 21, 2014 at 10:54 AM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 My understanding (please some correct me if I am wrong) is that when you
 insert N items in a Cassandra CF, you are executing N binary searches to
 insert the item already indexed by a key. When you read the data, it's
 already sorted. So you take O(N * log(N)) (binary search complexity to
 insert all data already sorted.


You're wrong, unless you're talking about insertion into a memtable, which
you probably aren't and which probably doesn't actually work that way
enough to be meaningful.

On disk, Cassandra has immutable datafiles, from which row fragments are
merged into a row at read time. I'm pretty sure the rest of the stuff you
said doesn't make any sense in light of this?

=Rob


Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle
Hi Robert,

First of all, thanks for answering.


2014-07-21 20:18 GMT-03:00 Robert Coli rc...@eventbrite.com:

 You're wrong, unless you're talking about insertion into a memtable, which
 you probably aren't and which probably doesn't actually work that way
 enough to be meaningful.

 On disk, Cassandra has immutable datafiles, from which row fragments are
 merged into a row at read time. I'm pretty sure the rest of the stuff you
 said doesn't make any sense in light of this?


Although several sstables (disk fragments) may have the same row key,
inside a single sstable row keys and column keys are indexed, right?
Otherwise, doing a GET in Cassandra would take some time.
From the M/R perspective, I was reffering to the mem table, as I am trying
to compare the time to insert in Cassandra against the time of sorting in
hadoop.

To make it more clear: hadoop has it's own partitioner, which is used after
the map phase. The map output is written locally on each hadoop node, then
it's shuffled from one node to the other (see slide 17 in this
presentation: http://pt.slideshare.net/j_singh/nosql-and-mapreduce). In
other words, you may read Cassandra data on hadoop, but the intermediate
results are still stored in HDFS.

Instead of using hadoop partitioner, I would like to store the intermediate
results in a Cassandra CF, so the map output would go directly to an
intermediate column family via batch inserts, instead of being written to a
local disk first, then shuffled to the right node.

Therefore, the mapper would write it's output the same way all data enters
in Cassandra: first on a memtable, then being flush to a sstable, then read
during the reduce phase.

Shouldn't it be faster than storing intermediate results in HDFS?

Best regards,
Marcelo.


Re: map reduce for Cassandra

2014-07-21 Thread Robert Coli
On Mon, Jul 21, 2014 at 5:45 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 Although several sstables (disk fragments) may have the same row key,
 inside a single sstable row keys and column keys are indexed, right?
 Otherwise, doing a GET in Cassandra would take some time.
 From the M/R perspective, I was reffering to the mem table, as I am trying
 to compare the time to insert in Cassandra against the time of sorting in
 hadoop.


I was confused, because unless you are using new in-memory
columnfamilies, which I believe are only available in DSE, there is no way
to ensure that any given row stays in a memtable. Very rarely is there a
view of the function of a memtable that only cares about its properties and
not the closely related properties of SSTables. However yours is one of
them, I see now why your question makes sense, you only care about the
memtable for how quickly it sorts.

But if you are only relying on memtables to sort writes, that seems like a
pretty heavyweight reason to use Cassandra?

I'm certainly not an expert in this area of Cassandra... but Cassandra, as
a datastore with immutable data files, is not typically a good choice for
short lived intermediate result sets... are you planning to use DSE?

=Rob


Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle
Hi,


 But if you are only relying on memtables to sort writes, that seems like a
 pretty heavyweight reason to use Cassandra?


Actually, it's not a reason to use Cassandra. I already use Cassandra and I
need to map reduce data from it. I am trying to see a reason to use the
conventional M/R tools or to build a tool specific to Cassandra.

but Cassandra, as a datastore with immutable data files, is not typically a
 good choice for short lived intermediate result sets...


Indeed, but so far I am seeing it as the best option. I storing this
intermediate files in HDFS is better, then I agree there is no reason to
consider Cassandra to do it.

are you planning to use DSE?


Our company will probably hire DSE support when it reaches some size, but
DSE as a product doesn't seem interesting to our case so far. The only tool
that would help be at this moment would be HIVE, but honestly I didn't like
the way DSE supports hive and I don't want to use a solution not available
to DSC (see
http://stackoverflow.com/questions/23959169/problems-using-hive-cassandra-community
for details).

[]s



2014-07-21 22:09 GMT-03:00 Robert Coli rc...@eventbrite.com:

 On Mon, Jul 21, 2014 at 5:45 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Although several sstables (disk fragments) may have the same row key,
 inside a single sstable row keys and column keys are indexed, right?
 Otherwise, doing a GET in Cassandra would take some time.
 From the M/R perspective, I was reffering to the mem table, as I am
 trying to compare the time to insert in Cassandra against the time of
 sorting in hadoop.


 I was confused, because unless you are using new in-memory
 columnfamilies, which I believe are only available in DSE, there is no way
 to ensure that any given row stays in a memtable. Very rarely is there a
 view of the function of a memtable that only cares about its properties and
 not the closely related properties of SSTables. However yours is one of
 them, I see now why your question makes sense, you only care about the
 memtable for how quickly it sorts.

 But if you are only relying on memtables to sort writes, that seems like a
 pretty heavyweight reason to use Cassandra?

 I'm certainly not an expert in this area of Cassandra... but Cassandra, as
 a datastore with immutable data files, is not typically a good choice for
 short lived intermediate result sets... are you planning to use DSE?

 =Rob




Re: Pig / Map Reduce on Cassandra

2013-03-18 Thread cscetbon.ext
 astyanax library, which seems
to fail horribly on 1.2, so we're still on 1.1.8.  But we're just
starting out with this and i'm still debating between cassandra and
hbase.  So I just want to know if there is a limitation here or not,
as I have no idea when 1.2 support will exist in astyanax.

That said, are there other java (scala) libraries that people use to
connect to cassandra that support 1.2?

-James-

On Thu, Jan 17, 2013 at 8:30 AM,  
cscetbon@orange.commailto:cscetbon@orange.com wrote:
Ok, I understand that I need to manage both cassandra and hadoop components
and that pig will use hadoop components to launch its tasks which will use
Cassandra as the Storage engine.

Thanks
--
Cyril SCETBON

On Jan 17, 2013, at 4:03 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

This really depends on how you design your Hadoop Cluster.  The testing I
have done, had Hadoop and Cassandra Nodes collocated on the same hosts.
Remember that Pig code runs inside of your hadoop cluster, and connects to
Cassandra as the Database engine.


I have not done any testing with Hive, so someone else will have to answer
that question.


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 8:58 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

Jimmy,

I understand that CFS can replace HDFS for those who use Hadoop. I just want
to use pig and hive on cassandra. I know that pig samples are provided and
work now with cassandra natively (they are part of the core). However, does
it mean that the process will be spread over nodes with
number_of_mapper=number_of_nodes or something like that ?
Can Hive connect to Cassandra 1.2 easily too ?

--
Cyril Scetbon

On Jan 17, 2013, at 2:42 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

CFS is Cassandra File System:
http://www.datastax.com/dev/blog/cassandra-file-system-design


But you don't need CFS to connect from PIG to Cassandra.  The latest
versions of Cassandra Source ship with examples of connecting from pig to
cassandra.


apache-cassandra-1.2.0-src/examples/pig   --
http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz

--Jimmy


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 6:35 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

Regards

On Jan 16, 2013, at 11:14 PM, Brandon Williams 
dri...@gmail.commailto:dri...@gmail.com wrote:

You won't get CFS,
but it's not a hard requirement, either.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified, changed or falsified.
Thank you.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified

Re: Pig / Map Reduce on Cassandra

2013-03-14 Thread aaron morton
 that I need to manage both cassandra and hadoop 
 components
 and that pig will use hadoop components to launch its tasks which 
 will use
 Cassandra as the Storage engine.
 
 Thanks
 --
 Cyril SCETBON
 
 On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.com 
 wrote:
 
 This really depends on how you design your Hadoop Cluster.  The 
 testing I
 have done, had Hadoop and Cassandra Nodes collocated on the same 
 hosts.
 Remember that Pig code runs inside of your hadoop cluster, and 
 connects to
 Cassandra as the Database engine.
 
 
 I have not done any testing with Hive, so someone else will have to 
 answer
 that question.
 
 
 From: cscetbon@orange.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, January 17, 2013 8:58 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pig / Map Reduce on Cassandra
 
 Jimmy,
 
 I understand that CFS can replace HDFS for those who use Hadoop. I 
 just want
 to use pig and hive on cassandra. I know that pig samples are 
 provided and
 work now with cassandra natively (they are part of the core). 
 However, does
 it mean that the process will be spread over nodes with
 number_of_mapper=number_of_nodes or something like that ?
 Can Hive connect to Cassandra 1.2 easily too ?
 
 --
 Cyril Scetbon
 
 On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com 
 wrote:
 
 CFS is Cassandra File System:
 http://www.datastax.com/dev/blog/cassandra-file-system-design
 
 
 But you don't need CFS to connect from PIG to Cassandra.  The latest
 versions of Cassandra Source ship with examples of connecting from 
 pig to
 cassandra.
 
 
 apache-cassandra-1.2.0-src/examples/pig   --
 http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz
 
 --Jimmy
 
 
 From: cscetbon@orange.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, January 17, 2013 6:35 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pig / Map Reduce on Cassandra
 
 what do you mean ? it's not needed by Pig or Hive to access Cassandra 
 data.
 
 Regards
 
 On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com 
 wrote:
 
 You won't get CFS,
 but it's not a hard requirement, either.
 
 
 _
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous 
 avez recu
 ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les 
 messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a 
 ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain confidential or 
 privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and
 delete this message and its attachments.
 As emails may be altered, France Telecom - Orange is not liable for 
 messages
 that have been modified, changed or falsified.
 Thank you.
 
 
 _
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous 
 avez recu
 ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les 
 messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a 
 ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain confidential or 
 privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and
 delete this message and its attachments.
 As emails may be altered, France Telecom - Orange is not liable for 
 messages
 that have been modified, changed or falsified.
 Thank you.
 
 
 _
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous 
 avez recu
 ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les 
 messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a 
 ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain

Re: Pig / Map Reduce on Cassandra

2013-03-13 Thread cscetbon.ext
Ok forget it. It was a mix of mistakes like environment variables not set, 
package name not added in the script and libraries not found.

Regards
--
Cyril SCETBON

On Mar 12, 2013, at 10:43 AM, 
cscetbon@orange.commailto:cscetbon@orange.com wrote:

I'm already using Cassandra 1.2.2 with only one line to test the cassandra 
access :

rows = LOAD 'cassandra://twissandra/users' USING 
org.apache.cassandra.hadoop.pig.CassandraStorage();

extracted from the sample script provided in the sources
--
Cyril SCETBON

On Mar 12, 2013, at 6:57 AM, aaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

any idea why the function loadFunc does not work correctly ?
No sorry.
Not sure why you are linking to the CQL info or what Pig script / config you 
are running.
Did you follow the example in the examples/pig in the source distribution ?

Also please use at least cassandra 1.1.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.comhttp://www.thelastpickle.com/

On 11/03/2013, at 9:39 AM, 
cscetbon@orange.commailto:cscetbon@orange.com wrote:

You said all versions. However, when I try to access 
cassandra://twissandra/users based on 
http://www.datastax.com/docs/1.0/dml/using_cql I get :

2013-03-11 17:35:48,444 [main] INFO  org.apache.pig.Main - Apache Pig version 
0.11.0 (r1446324) compiled Feb 14 2013, 16:40:57
2013-03-11 17:35:48,445 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /Users/cyril/pig_1363019748442.log
2013-03-11 17:35:48.583 java[13809:1203] Unable to load realm info from 
SCDynamicStore
2013-03-11 17:35:48,750 [main] INFO  org.apache.pig.impl.util.Utils - Default 
bootup file /Users/cyril/.pigbootup not found
2013-03-11 17:35:48,831 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///file:
2013-03-11 17:35:49,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2245: Cannot get schema from loadFunc 
org.apache.cassandra.hadoop.pig.CassandraStorage

with pig 0.11.0

any idea why the function loadFunc does not work correctly ?

thanks
--
Cyril SCETBON

On Jan 18, 2013, at 7:00 PM, aaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

Silly question -- but does hive/pig hadoop etc work with cassandra
1.1.8?  Or only with 1.2?
all versions.

We are using astyanax library, which seems
to fail horribly on 1.2,
How does it fail ?
If you think you have a bug post it at https://github.com/Netflix/astyanax

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.comhttp://www.thelastpickle.com/

On 18/01/2013, at 7:48 AM, James Lyons 
james.ly...@gmail.commailto:james.ly...@gmail.com wrote:

Silly question -- but does hive/pig hadoop etc work with cassandra
1.1.8?  Or only with 1.2?  We are using astyanax library, which seems
to fail horribly on 1.2, so we're still on 1.1.8.  But we're just
starting out with this and i'm still debating between cassandra and
hbase.  So I just want to know if there is a limitation here or not,
as I have no idea when 1.2 support will exist in astyanax.

That said, are there other java (scala) libraries that people use to
connect to cassandra that support 1.2?

-James-

On Thu, Jan 17, 2013 at 8:30 AM,  
cscetbon@orange.commailto:cscetbon@orange.com wrote:
Ok, I understand that I need to manage both cassandra and hadoop components
and that pig will use hadoop components to launch its tasks which will use
Cassandra as the Storage engine.

Thanks
--
Cyril SCETBON

On Jan 17, 2013, at 4:03 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

This really depends on how you design your Hadoop Cluster.  The testing I
have done, had Hadoop and Cassandra Nodes collocated on the same hosts.
Remember that Pig code runs inside of your hadoop cluster, and connects to
Cassandra as the Database engine.


I have not done any testing with Hive, so someone else will have to answer
that question.


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 8:58 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

Jimmy,

I understand that CFS can replace HDFS for those who use Hadoop. I just want
to use pig and hive on cassandra. I know that pig samples are provided and
work now with cassandra natively (they are part of the core). However, does
it mean that the process will be spread over nodes with
number_of_mapper=number_of_nodes or something like that ?
Can Hive connect to Cassandra 1.2 easily too ?

--
Cyril Scetbon

On Jan 17, 2013, at 2:42 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

CFS is Cassandra File System:
http://www.datastax.com

Re: Pig / Map Reduce on Cassandra

2013-03-13 Thread cscetbon.ext
:58 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

Jimmy,

I understand that CFS can replace HDFS for those who use Hadoop. I just want
to use pig and hive on cassandra. I know that pig samples are provided and
work now with cassandra natively (they are part of the core). However, does
it mean that the process will be spread over nodes with
number_of_mapper=number_of_nodes or something like that ?
Can Hive connect to Cassandra 1.2 easily too ?

--
Cyril Scetbon

On Jan 17, 2013, at 2:42 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

CFS is Cassandra File System:
http://www.datastax.com/dev/blog/cassandra-file-system-design


But you don't need CFS to connect from PIG to Cassandra.  The latest
versions of Cassandra Source ship with examples of connecting from pig to
cassandra.


apache-cassandra-1.2.0-src/examples/pig   --
http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz

--Jimmy


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 6:35 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

Regards

On Jan 16, 2013, at 11:14 PM, Brandon Williams 
dri...@gmail.commailto:dri...@gmail.com wrote:

You won't get CFS,
but it's not a hard requirement, either.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified, changed or falsified.
Thank you.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified, changed or falsified.
Thank you.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified, changed or falsified.
Thank you.



_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses

Re: Pig / Map Reduce on Cassandra

2013-03-13 Thread cscetbon.ext
 to launch its tasks which will use
Cassandra as the Storage engine.

Thanks
--
Cyril SCETBON

On Jan 17, 2013, at 4:03 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

This really depends on how you design your Hadoop Cluster.  The testing I
have done, had Hadoop and Cassandra Nodes collocated on the same hosts.
Remember that Pig code runs inside of your hadoop cluster, and connects to
Cassandra as the Database engine.


I have not done any testing with Hive, so someone else will have to answer
that question.


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 8:58 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

Jimmy,

I understand that CFS can replace HDFS for those who use Hadoop. I just want
to use pig and hive on cassandra. I know that pig samples are provided and
work now with cassandra natively (they are part of the core). However, does
it mean that the process will be spread over nodes with
number_of_mapper=number_of_nodes or something like that ?
Can Hive connect to Cassandra 1.2 easily too ?

--
Cyril Scetbon

On Jan 17, 2013, at 2:42 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

CFS is Cassandra File System:
http://www.datastax.com/dev/blog/cassandra-file-system-design


But you don't need CFS to connect from PIG to Cassandra.  The latest
versions of Cassandra Source ship with examples of connecting from pig to
cassandra.


apache-cassandra-1.2.0-src/examples/pig   --
http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz

--Jimmy


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 6:35 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

Regards

On Jan 16, 2013, at 11:14 PM, Brandon Williams 
dri...@gmail.commailto:dri...@gmail.com wrote:

You won't get CFS,
but it's not a hard requirement, either.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified, changed or falsified.
Thank you.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified, changed or falsified.
Thank you.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci

Re: Pig / Map Reduce on Cassandra

2013-03-11 Thread cscetbon.ext
You said all versions. However, when I try to access 
cassandra://twissandra/users based on 
http://www.datastax.com/docs/1.0/dml/using_cql I get :

2013-03-11 17:35:48,444 [main] INFO  org.apache.pig.Main - Apache Pig version 
0.11.0 (r1446324) compiled Feb 14 2013, 16:40:57
2013-03-11 17:35:48,445 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /Users/cyril/pig_1363019748442.log
2013-03-11 17:35:48.583 java[13809:1203] Unable to load realm info from 
SCDynamicStore
2013-03-11 17:35:48,750 [main] INFO  org.apache.pig.impl.util.Utils - Default 
bootup file /Users/cyril/.pigbootup not found
2013-03-11 17:35:48,831 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: file:///file:
2013-03-11 17:35:49,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
2245: Cannot get schema from loadFunc 
org.apache.cassandra.hadoop.pig.CassandraStorage

with pig 0.11.0

any idea why the function loadFunc does not work correctly ?

thanks
--
Cyril SCETBON

On Jan 18, 2013, at 7:00 PM, aaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:

Silly question -- but does hive/pig hadoop etc work with cassandra
1.1.8?  Or only with 1.2?
all versions.

We are using astyanax library, which seems
to fail horribly on 1.2,
How does it fail ?
If you think you have a bug post it at https://github.com/Netflix/astyanax

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.comhttp://www.thelastpickle.com/

On 18/01/2013, at 7:48 AM, James Lyons 
james.ly...@gmail.commailto:james.ly...@gmail.com wrote:

Silly question -- but does hive/pig hadoop etc work with cassandra
1.1.8?  Or only with 1.2?  We are using astyanax library, which seems
to fail horribly on 1.2, so we're still on 1.1.8.  But we're just
starting out with this and i'm still debating between cassandra and
hbase.  So I just want to know if there is a limitation here or not,
as I have no idea when 1.2 support will exist in astyanax.

That said, are there other java (scala) libraries that people use to
connect to cassandra that support 1.2?

-James-

On Thu, Jan 17, 2013 at 8:30 AM,  
cscetbon@orange.commailto:cscetbon@orange.com wrote:
Ok, I understand that I need to manage both cassandra and hadoop components
and that pig will use hadoop components to launch its tasks which will use
Cassandra as the Storage engine.

Thanks
--
Cyril SCETBON

On Jan 17, 2013, at 4:03 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

This really depends on how you design your Hadoop Cluster.  The testing I
have done, had Hadoop and Cassandra Nodes collocated on the same hosts.
Remember that Pig code runs inside of your hadoop cluster, and connects to
Cassandra as the Database engine.


I have not done any testing with Hive, so someone else will have to answer
that question.


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 8:58 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

Jimmy,

I understand that CFS can replace HDFS for those who use Hadoop. I just want
to use pig and hive on cassandra. I know that pig samples are provided and
work now with cassandra natively (they are part of the core). However, does
it mean that the process will be spread over nodes with
number_of_mapper=number_of_nodes or something like that ?
Can Hive connect to Cassandra 1.2 easily too ?

--
Cyril Scetbon

On Jan 17, 2013, at 2:42 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

CFS is Cassandra File System:
http://www.datastax.com/dev/blog/cassandra-file-system-design


But you don't need CFS to connect from PIG to Cassandra.  The latest
versions of Cassandra Source ship with examples of connecting from pig to
cassandra.


apache-cassandra-1.2.0-src/examples/pig   --
http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz

--Jimmy


From: cscetbon@orange.com
Reply-To: user@cassandra.apache.org
Date: Thursday, January 17, 2013 6:35 AM
To: user@cassandra.apache.org user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

Regards

On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote:

You won't get CFS,
but it's not a hard requirement, either.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et

Re: Pig / Map Reduce on Cassandra

2013-03-11 Thread aaron morton
 any idea why the function loadFunc does not work correctly ?
No sorry. 
Not sure why you are linking to the CQL info or what Pig script / config you 
are running. 
Did you follow the example in the examples/pig in the source distribution ? 

Also please use at least cassandra 1.1. 

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/03/2013, at 9:39 AM, cscetbon@orange.com wrote:

 You said all versions. However, when I try to access 
 cassandra://twissandra/users based on 
 http://www.datastax.com/docs/1.0/dml/using_cql I get :
 
 2013-03-11 17:35:48,444 [main] INFO  org.apache.pig.Main - Apache Pig version 
 0.11.0 (r1446324) compiled Feb 14 2013, 16:40:57
 2013-03-11 17:35:48,445 [main] INFO  org.apache.pig.Main - Logging error 
 messages to: /Users/cyril/pig_1363019748442.log
 2013-03-11 17:35:48.583 java[13809:1203] Unable to load realm info from 
 SCDynamicStore
 2013-03-11 17:35:48,750 [main] INFO  org.apache.pig.impl.util.Utils - Default 
 bootup file /Users/cyril/.pigbootup not found
 2013-03-11 17:35:48,831 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting 
 to hadoop file system at: file:///
 2013-03-11 17:35:49,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2245: Cannot get schema from loadFunc 
 org.apache.cassandra.hadoop.pig.CassandraStorage
 
 with pig 0.11.0
 
 any idea why the function loadFunc does not work correctly ?
 
 thanks
 -- 
 Cyril SCETBON
 
 On Jan 18, 2013, at 7:00 PM, aaron morton aa...@thelastpickle.com wrote:
 
 Silly question -- but does hive/pig hadoop etc work with cassandra
 1.1.8?  Or only with 1.2?  
 all versions. 
 
 We are using astyanax library, which seems
 to fail horribly on 1.2, 
 How does it fail ? 
 If you think you have a bug post it at https://github.com/Netflix/astyanax
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 18/01/2013, at 7:48 AM, James Lyons james.ly...@gmail.com wrote:
 
 Silly question -- but does hive/pig hadoop etc work with cassandra
 1.1.8?  Or only with 1.2?  We are using astyanax library, which seems
 to fail horribly on 1.2, so we're still on 1.1.8.  But we're just
 starting out with this and i'm still debating between cassandra and
 hbase.  So I just want to know if there is a limitation here or not,
 as I have no idea when 1.2 support will exist in astyanax.
 
 That said, are there other java (scala) libraries that people use to
 connect to cassandra that support 1.2?
 
 -James-
 
 On Thu, Jan 17, 2013 at 8:30 AM,  cscetbon@orange.com wrote:
 Ok, I understand that I need to manage both cassandra and hadoop components
 and that pig will use hadoop components to launch its tasks which will use
 Cassandra as the Storage engine.
 
 Thanks
 --
 Cyril SCETBON
 
 On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.com wrote:
 
 This really depends on how you design your Hadoop Cluster.  The testing I
 have done, had Hadoop and Cassandra Nodes collocated on the same hosts.
 Remember that Pig code runs inside of your hadoop cluster, and connects to
 Cassandra as the Database engine.
 
 
 I have not done any testing with Hive, so someone else will have to answer
 that question.
 
 
 From: cscetbon@orange.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, January 17, 2013 8:58 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pig / Map Reduce on Cassandra
 
 Jimmy,
 
 I understand that CFS can replace HDFS for those who use Hadoop. I just 
 want
 to use pig and hive on cassandra. I know that pig samples are provided and
 work now with cassandra natively (they are part of the core). However, does
 it mean that the process will be spread over nodes with
 number_of_mapper=number_of_nodes or something like that ?
 Can Hive connect to Cassandra 1.2 easily too ?
 
 --
 Cyril Scetbon
 
 On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote:
 
 CFS is Cassandra File System:
 http://www.datastax.com/dev/blog/cassandra-file-system-design
 
 
 But you don't need CFS to connect from PIG to Cassandra.  The latest
 versions of Cassandra Source ship with examples of connecting from pig to
 cassandra.
 
 
 apache-cassandra-1.2.0-src/examples/pig   --
 http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz
 
 --Jimmy
 
 
 From: cscetbon@orange.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, January 17, 2013 6:35 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pig / Map Reduce on Cassandra
 
 what do you mean ? it's not needed by Pig or Hive to access Cassandra data.
 
 Regards
 
 On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote:
 
 You won't get CFS,
 but it's not a hard requirement, either

Re: Pig / Map Reduce on Cassandra

2013-01-18 Thread aaron morton
 Silly question -- but does hive/pig hadoop etc work with cassandra
 1.1.8?  Or only with 1.2?  
all versions. 

 We are using astyanax library, which seems
 to fail horribly on 1.2, 
How does it fail ? 
If you think you have a bug post it at https://github.com/Netflix/astyanax

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/01/2013, at 7:48 AM, James Lyons james.ly...@gmail.com wrote:

 Silly question -- but does hive/pig hadoop etc work with cassandra
 1.1.8?  Or only with 1.2?  We are using astyanax library, which seems
 to fail horribly on 1.2, so we're still on 1.1.8.  But we're just
 starting out with this and i'm still debating between cassandra and
 hbase.  So I just want to know if there is a limitation here or not,
 as I have no idea when 1.2 support will exist in astyanax.
 
 That said, are there other java (scala) libraries that people use to
 connect to cassandra that support 1.2?
 
 -James-
 
 On Thu, Jan 17, 2013 at 8:30 AM,  cscetbon@orange.com wrote:
 Ok, I understand that I need to manage both cassandra and hadoop components
 and that pig will use hadoop components to launch its tasks which will use
 Cassandra as the Storage engine.
 
 Thanks
 --
 Cyril SCETBON
 
 On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.com wrote:
 
 This really depends on how you design your Hadoop Cluster.  The testing I
 have done, had Hadoop and Cassandra Nodes collocated on the same hosts.
 Remember that Pig code runs inside of your hadoop cluster, and connects to
 Cassandra as the Database engine.
 
 
 I have not done any testing with Hive, so someone else will have to answer
 that question.
 
 
 From: cscetbon@orange.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, January 17, 2013 8:58 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pig / Map Reduce on Cassandra
 
 Jimmy,
 
 I understand that CFS can replace HDFS for those who use Hadoop. I just want
 to use pig and hive on cassandra. I know that pig samples are provided and
 work now with cassandra natively (they are part of the core). However, does
 it mean that the process will be spread over nodes with
 number_of_mapper=number_of_nodes or something like that ?
 Can Hive connect to Cassandra 1.2 easily too ?
 
 --
 Cyril Scetbon
 
 On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote:
 
 CFS is Cassandra File System:
 http://www.datastax.com/dev/blog/cassandra-file-system-design
 
 
 But you don't need CFS to connect from PIG to Cassandra.  The latest
 versions of Cassandra Source ship with examples of connecting from pig to
 cassandra.
 
 
 apache-cassandra-1.2.0-src/examples/pig   --
 http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz
 
 --Jimmy
 
 
 From: cscetbon@orange.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, January 17, 2013 6:35 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pig / Map Reduce on Cassandra
 
 what do you mean ? it's not needed by Pig or Hive to access Cassandra data.
 
 Regards
 
 On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote:
 
 You won't get CFS,
 but it's not a hard requirement, either.
 
 
 _
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
 ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain confidential or privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and
 delete this message and its attachments.
 As emails may be altered, France Telecom - Orange is not liable for messages
 that have been modified, changed or falsified.
 Thank you.
 
 
 _
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
 ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain confidential

Re: Pig / Map Reduce on Cassandra

2013-01-17 Thread cscetbon.ext
what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

Regards

On Jan 16, 2013, at 11:14 PM, Brandon Williams 
dri...@gmail.commailto:dri...@gmail.com wrote:

You won't get CFS,
but it's not a hard requirement, either.


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



Re: Pig / Map Reduce on Cassandra

2013-01-17 Thread James Schappet
CFS is Cassandra File System:
http://www.datastax.com/dev/blog/cassandra-file-system-design


But you don't need CFS to connect from PIG to Cassandra.  The latest
versions of Cassandra Source ship with examples of connecting from pig to
cassandra.


apache-cassandra-1.2.0-src/examples/pig   --
http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-
1.2.0-src.tar.gz

--Jimmy


From:  cscetbon@orange.com
Reply-To:  user@cassandra.apache.org
Date:  Thursday, January 17, 2013 6:35 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: Pig / Map Reduce on Cassandra

what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

Regards

On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote:

 You won't get CFS,
 but it's not a hard requirement, either.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified, changed or falsified.
Thank you.




Re: Pig / Map Reduce on Cassandra

2013-01-17 Thread cscetbon.ext
Jimmy,

I understand that CFS can replace HDFS for those who use Hadoop. I just want to 
use pig and hive on cassandra. I know that pig samples are provided and work 
now with cassandra natively (they are part of the core). However, does it mean 
that the process will be spread over nodes with 
number_of_mapper=number_of_nodes or something like that ?
Can Hive connect to Cassandra 1.2 easily too ?

--
Cyril Scetbon

On Jan 17, 2013, at 2:42 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

CFS is Cassandra File System:  
http://www.datastax.com/dev/blog/cassandra-file-system-design


But you don't need CFS to connect from PIG to Cassandra.  The latest versions 
of Cassandra Source ship with examples of connecting from pig to cassandra.


apache-cassandra-1.2.0-src/examples/pig   -- 
http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz

--Jimmy


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 6:35 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

Regards

On Jan 16, 2013, at 11:14 PM, Brandon Williams 
dri...@gmail.commailto:dri...@gmail.com wrote:

You won't get CFS,
but it's not a hard requirement, either.


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



Re: Pig / Map Reduce on Cassandra

2013-01-17 Thread James Schappet
This really depends on how you design your Hadoop Cluster.  The testing I
have done, had Hadoop and Cassandra Nodes collocated on the same hosts.
Remember that Pig code runs inside of your hadoop cluster, and connects to
Cassandra as the Database engine.


I have not done any testing with Hive, so someone else will have to answer
that question.


From:  cscetbon@orange.com
Reply-To:  user@cassandra.apache.org
Date:  Thursday, January 17, 2013 8:58 AM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: Pig / Map Reduce on Cassandra

Jimmy, 

I understand that CFS can replace HDFS for those who use Hadoop. I just want
to use pig and hive on cassandra. I know that pig samples are provided and
work now with cassandra natively (they are part of the core). However, does
it mean that the process will be spread over nodes with
number_of_mapper=number_of_nodes or something like that ?
Can Hive connect to Cassandra 1.2 easily too ?

--
Cyril Scetbon

On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote:

 CFS is Cassandra File System:
 http://www.datastax.com/dev/blog/cassandra-file-system-design
 
 
 But you don't need CFS to connect from PIG to Cassandra.  The latest versions
 of Cassandra Source ship with examples of connecting from pig to cassandra.
 
 
 apache-cassandra-1.2.0-src/examples/pig   --
 http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.
 2.0-src.tar.gz
 
 --Jimmy
 
 
 From:  cscetbon@orange.com
 Reply-To:  user@cassandra.apache.org
 Date:  Thursday, January 17, 2013 6:35 AM
 To:  user@cassandra.apache.org user@cassandra.apache.org
 Subject:  Re: Pig / Map Reduce on Cassandra
 
 what do you mean ? it's not needed by Pig or Hive to access Cassandra data.
 
 Regards
 
 On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote:
 
 You won't get CFS,
 but it's not a hard requirement, either.
 
 __
 ___
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce
 message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain confidential or privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and delete
 this message and its attachments.
 As emails may be altered, France Telecom - Orange is not liable for messages
 that have been modified, changed or falsified.
 Thank you.


_

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages
that have been modified, changed or falsified.
Thank you.




Re: Pig / Map Reduce on Cassandra

2013-01-17 Thread cscetbon.ext
Ok, I understand that I need to manage both cassandra and hadoop components and 
that pig will use hadoop components to launch its tasks which will use 
Cassandra as the Storage engine.

Thanks
--
Cyril SCETBON

On Jan 17, 2013, at 4:03 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

This really depends on how you design your Hadoop Cluster.  The testing I have 
done, had Hadoop and Cassandra Nodes collocated on the same hosts.
Remember that Pig code runs inside of your hadoop cluster, and connects to 
Cassandra as the Database engine.


I have not done any testing with Hive, so someone else will have to answer that 
question.


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 8:58 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

Jimmy,

I understand that CFS can replace HDFS for those who use Hadoop. I just want to 
use pig and hive on cassandra. I know that pig samples are provided and work 
now with cassandra natively (they are part of the core). However, does it mean 
that the process will be spread over nodes with 
number_of_mapper=number_of_nodes or something like that ?
Can Hive connect to Cassandra 1.2 easily too ?

--
Cyril Scetbon

On Jan 17, 2013, at 2:42 PM, James Schappet 
jschap...@gmail.commailto:jschap...@gmail.com wrote:

CFS is Cassandra File System:  
http://www.datastax.com/dev/blog/cassandra-file-system-design


But you don't need CFS to connect from PIG to Cassandra.  The latest versions 
of Cassandra Source ship with examples of connecting from pig to cassandra.


apache-cassandra-1.2.0-src/examples/pig   -- 
http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz

--Jimmy


From: cscetbon@orange.commailto:cscetbon@orange.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thursday, January 17, 2013 6:35 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pig / Map Reduce on Cassandra

what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

Regards

On Jan 16, 2013, at 11:14 PM, Brandon Williams 
dri...@gmail.commailto:dri...@gmail.com wrote:

You won't get CFS,
but it's not a hard requirement, either.


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages

Re: Pig / Map Reduce on Cassandra

2013-01-17 Thread James Lyons
Silly question -- but does hive/pig hadoop etc work with cassandra
1.1.8?  Or only with 1.2?  We are using astyanax library, which seems
to fail horribly on 1.2, so we're still on 1.1.8.  But we're just
starting out with this and i'm still debating between cassandra and
hbase.  So I just want to know if there is a limitation here or not,
as I have no idea when 1.2 support will exist in astyanax.

That said, are there other java (scala) libraries that people use to
connect to cassandra that support 1.2?

-James-

On Thu, Jan 17, 2013 at 8:30 AM,  cscetbon@orange.com wrote:
 Ok, I understand that I need to manage both cassandra and hadoop components
 and that pig will use hadoop components to launch its tasks which will use
 Cassandra as the Storage engine.

 Thanks
 --
 Cyril SCETBON

 On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.com wrote:

 This really depends on how you design your Hadoop Cluster.  The testing I
 have done, had Hadoop and Cassandra Nodes collocated on the same hosts.
 Remember that Pig code runs inside of your hadoop cluster, and connects to
 Cassandra as the Database engine.


 I have not done any testing with Hive, so someone else will have to answer
 that question.


 From: cscetbon@orange.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, January 17, 2013 8:58 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pig / Map Reduce on Cassandra

 Jimmy,

 I understand that CFS can replace HDFS for those who use Hadoop. I just want
 to use pig and hive on cassandra. I know that pig samples are provided and
 work now with cassandra natively (they are part of the core). However, does
 it mean that the process will be spread over nodes with
 number_of_mapper=number_of_nodes or something like that ?
 Can Hive connect to Cassandra 1.2 easily too ?

 --
 Cyril Scetbon

 On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote:

 CFS is Cassandra File System:
 http://www.datastax.com/dev/blog/cassandra-file-system-design


 But you don't need CFS to connect from PIG to Cassandra.  The latest
 versions of Cassandra Source ship with examples of connecting from pig to
 cassandra.


 apache-cassandra-1.2.0-src/examples/pig   --
 http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz

 --Jimmy


 From: cscetbon@orange.com
 Reply-To: user@cassandra.apache.org
 Date: Thursday, January 17, 2013 6:35 AM
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pig / Map Reduce on Cassandra

 what do you mean ? it's not needed by Pig or Hive to access Cassandra data.

 Regards

 On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote:

 You won't get CFS,
 but it's not a hard requirement, either.


 _

 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
 ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a ete
 altere, deforme ou falsifie. Merci.

 This message and its attachments may contain confidential or privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and
 delete this message and its attachments.
 As emails may be altered, France Telecom - Orange is not liable for messages
 that have been modified, changed or falsified.
 Thank you.


 _

 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu
 ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a ete
 altere, deforme ou falsifie. Merci.

 This message and its attachments may contain confidential or privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and
 delete this message and its attachments.
 As emails may be altered, France Telecom - Orange is not liable for messages
 that have been modified, changed or falsified.
 Thank you.


 _

 Ce message

Pig / Map Reduce on Cassandra

2013-01-16 Thread cscetbon.ext
Hi,

I know that DataStax Enterprise package provide Brisk, but is there a community 
version ? Is it easy to interface Hadoop with Cassandra as the storage or do we 
absolutely have to use Brisk for that ?
I know CassandraFS is natively available in cassandra 1.2, the version I use, 
so is there a way/procedure to interface hadoop with Cassandra as the storage ?

Thanks 
_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



Re: Pig / Map Reduce on Cassandra

2013-01-16 Thread James Schappet
Here are a few examples I have worked on, reading from xml.gz files then
writing to cassandara.


https://github.com/jschappet/medline

You will also need:

https://github.com/jschappet/medline-base



These examples are Hadoop Jobs using Cassandra as the Data Store.

This one is a good place to start.
https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ic
ts/jobs/LoadMedline/StartJob.java

ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
COLUMN_FAMILY);
ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE,
outputPath);

job.setMapperClass(MapperToCassandra.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

LOG.info(Writing output to Cassandra);
//job.setReducerClass(ReducerToCassandra.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);

ConfigHelper.setRpcPort(job.getConfiguration(), 9160);
//org.apache.cassandra.dht.LocalPartitioner
ConfigHelper.setInitialAddress(job.getConfiguration(),
localhost);
ConfigHelper.setPartitioner(job.getConfiguration(),
org.apache.cassandra.dht.RandomPartitioner);






On 1/16/13 7:37 AM, cscetbon@orange.com cscetbon@orange.com
wrote:

Hi,

I know that DataStax Enterprise package provide Brisk, but is there a
community version ? Is it easy to interface Hadoop with Cassandra as the
storage or do we absolutely have to use Brisk for that ?
I know CassandraFS is natively available in cassandra 1.2, the version I
use, so is there a way/procedure to interface hadoop with Cassandra as
the storage ?

Thanks 
__
___

Ce message et ses pieces jointes peuvent contenir des informations
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and
delete this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for
messages that have been modified, changed or falsified.
Thank you.





Re: Pig / Map Reduce on Cassandra

2013-01-16 Thread cscetbon.ext
I don't want to write to Cassandra as it replicates data from another 
datacenter, but I just want to use Hadoop Jobs (Pig and Hive) to read data from 
it. I would like to use the same configuration as 
http://www.datastax.com/dev/blog/hadoop-mapreduce-in-the-cassandra-cluster but 
I want to know if there are alternatives to DataStax Enterprise package.

Thanks
On Jan 16, 2013, at 3:59 PM, James Schappet jschap...@gmail.com wrote:

 Here are a few examples I have worked on, reading from xml.gz files then
 writing to cassandara.
 
 
 https://github.com/jschappet/medline
 
 You will also need:
 
 https://github.com/jschappet/medline-base
 
 
 
 These examples are Hadoop Jobs using Cassandra as the Data Store.
 
 This one is a good place to start.
 https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ic
 ts/jobs/LoadMedline/StartJob.java
 
 ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
 COLUMN_FAMILY);
   ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE,
 outputPath);
 
job.setMapperClass(MapperToCassandra.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
 
   LOG.info(Writing output to Cassandra);
   //job.setReducerClass(ReducerToCassandra.class);
   job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
 
ConfigHelper.setRpcPort(job.getConfiguration(), 9160);
//org.apache.cassandra.dht.LocalPartitioner
ConfigHelper.setInitialAddress(job.getConfiguration(),
 localhost);
ConfigHelper.setPartitioner(job.getConfiguration(),
 org.apache.cassandra.dht.RandomPartitioner);
 
 
 
 
 
 
 On 1/16/13 7:37 AM, cscetbon@orange.com cscetbon@orange.com
 wrote:
 
 Hi,
 
 I know that DataStax Enterprise package provide Brisk, but is there a
 community version ? Is it easy to interface Hadoop with Cassandra as the
 storage or do we absolutely have to use Brisk for that ?
 I know CassandraFS is natively available in cassandra 1.2, the version I
 use, so is there a way/procedure to interface hadoop with Cassandra as
 the storage ?
 
 Thanks 
 __
 ___
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
 recu ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain confidential or privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and
 delete this message and its attachments.
 As emails may be altered, France Telecom - Orange is not liable for
 messages that have been modified, changed or falsified.
 Thank you.
 
 
 


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,
France Telecom - Orange decline toute responsabilite si ce message a ete 
altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, France Telecom - Orange is not liable for messages 
that have been modified, changed or falsified.
Thank you.



Re: Pig / Map Reduce on Cassandra

2013-01-16 Thread James Schappet
Try this one then, it reads from cassandra, then writes back to cassandra,
but you could change the write to where ever you would like.



   getConf().set(IN_COLUMN_NAME, columnName );

Job job = new Job(getConf(), ProcessRawXml);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.setNumReduceTasks(0);

job.setJarByClass(StartJob.class);
job.setMapperClass(ParseMapper.class);
job.setOutputKeyClass(ByteBuffer.class);
//job.setOutputValueClass(Text.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);

ConfigHelper.setOutputColumnFamily(job.getConfiguration(),
KEYSPACE, COLUMN_FAMILY);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
ConfigHelper.setRpcPort(job.getConfiguration(), 9160);
//org.apache.cassandra.dht.LocalPartitioner
ConfigHelper.setInitialAddress(job.getConfiguration(),
localhost);
ConfigHelper.setPartitioner(job.getConfiguration(),
org.apache.cassandra.dht.RandomPartitioner);
ConfigHelper.setInputColumnFamily(job.getConfiguration(),
KEYSPACE, COLUMN_FAMILY);


SlicePredicate predicate = new
SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes(columnN
ame)));
//  SliceRange slice_range = new SliceRange();
//  slice_range.setStart(ByteBufferUtil.bytes(startPoint));
//  slice_range.setFinish(ByteBufferUtil.bytes(endPoint));
//  
//  predicate.setSlice_range(slice_range);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(),
predicate);

job.waitForCompletion(true);


https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ic
ts/jobs/ProcessXml/StartJob.java








On 1/16/13 9:22 AM, cscetbon@orange.com cscetbon@orange.com
wrote:

I don't want to write to Cassandra as it replicates data from another
datacenter, but I just want to use Hadoop Jobs (Pig and Hive) to read
data from it. I would like to use the same configuration as
http://www.datastax.com/dev/blog/hadoop-mapreduce-in-the-cassandra-cluster
 but I want to know if there are alternatives to DataStax Enterprise
package.

Thanks
On Jan 16, 2013, at 3:59 PM, James Schappet jschap...@gmail.com wrote:

 Here are a few examples I have worked on, reading from xml.gz files then
 writing to cassandara.
 
 
 https://github.com/jschappet/medline
 
 You will also need:
 
 https://github.com/jschappet/medline-base
 
 
 
 These examples are Hadoop Jobs using Cassandra as the Data Store.
 
 This one is a good place to start.
 
https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/
ic
 ts/jobs/LoadMedline/StartJob.java
 
 ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
 COLUMN_FAMILY);
  ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE,
 outputPath);
 
job.setMapperClass(MapperToCassandra.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
 
  LOG.info(Writing output to Cassandra);
  //job.setReducerClass(ReducerToCassandra.class);
  job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
 
ConfigHelper.setRpcPort(job.getConfiguration(), 9160);
//org.apache.cassandra.dht.LocalPartitioner
ConfigHelper.setInitialAddress(job.getConfiguration(),
 localhost);
ConfigHelper.setPartitioner(job.getConfiguration(),
 org.apache.cassandra.dht.RandomPartitioner);
 
 
 
 
 
 
 On 1/16/13 7:37 AM, cscetbon@orange.com cscetbon@orange.com
 wrote:
 
 Hi,
 
 I know that DataStax Enterprise package provide Brisk, but is there a
 community version ? Is it easy to interface Hadoop with Cassandra as
the
 storage or do we absolutely have to use Brisk for that ?
 I know CassandraFS is natively available in cassandra 1.2, the version
I
 use, so is there a way/procedure to interface hadoop with Cassandra as
 the storage ?
 
 Thanks 
 

__
 ___
 
 Ce message et ses pieces jointes peuvent contenir des informations
 confidentielles ou privilegiees et ne doivent donc
 pas etre diffuses, exploites ou copies sans autorisation. Si vous avez
 recu ce message par erreur, veuillez le signaler
 a l'expediteur et le detruire ainsi que les pieces jointes. Les
messages
 electroniques etant susceptibles d'alteration,
 France Telecom - Orange decline toute responsabilite si ce message a
ete
 altere, deforme ou falsifie. Merci.
 
 This message and its attachments may contain confidential or privileged
 information that may be protected by law;
 they should not be distributed, used or copied without authorisation.
 If you have received this email in error, please notify the sender and
 delete this message and its attachments.

Re: Pig / Map Reduce on Cassandra

2013-01-16 Thread Michael Kjellman
Brisk is pretty much stagnant. I think someone forked it to work with 1.0
but not sure how that is going. You'll need to pay for DSE to get CFS
(which is essentially Brisk) if you want to use any modern version of C*.

Best,
Michael

On 1/16/13 11:17 AM, cscetbon@orange.com cscetbon@orange.com
wrote:

Thanks I understand that your code uses the hadoop interface of Cassandra
to be able to read from it with a job. However I would like to know how
to bring pieces (hive + pig + hadoop) together with cassandra as the
storage layer, not to get code to test it. I have found repository
https://github.com/riptano/brisk which might be a good start for it

Regards 

On Jan 16, 2013, at 4:27 PM, James Schappet jschap...@gmail.com wrote:

 Try this one then, it reads from cassandra, then writes back to
cassandra,
 but you could change the write to where ever you would like.
 
 
 
   getConf().set(IN_COLUMN_NAME, columnName );
 
  Job job = new Job(getConf(), ProcessRawXml);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
  job.setNumReduceTasks(0);
 
job.setJarByClass(StartJob.class);
job.setMapperClass(ParseMapper.class);
job.setOutputKeyClass(ByteBuffer.class);
//job.setOutputValueClass(Text.class);
  job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
 
ConfigHelper.setOutputColumnFamily(job.getConfiguration(),
 KEYSPACE, COLUMN_FAMILY);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
ConfigHelper.setRpcPort(job.getConfiguration(), 9160);
//org.apache.cassandra.dht.LocalPartitioner
  ConfigHelper.setInitialAddress(job.getConfiguration(),
 localhost);
  ConfigHelper.setPartitioner(job.getConfiguration(),
 org.apache.cassandra.dht.RandomPartitioner);
  ConfigHelper.setInputColumnFamily(job.getConfiguration(),
 KEYSPACE, COLUMN_FAMILY);
 
 
  SlicePredicate predicate = new
 
SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes(colum
nN
 ame)));
 //   SliceRange slice_range = new SliceRange();
 //   slice_range.setStart(ByteBufferUtil.bytes(startPoint));
 //   slice_range.setFinish(ByteBufferUtil.bytes(endPoint));
 //   
 //   predicate.setSlice_range(slice_range);
  ConfigHelper.setInputSlicePredicate(job.getConfiguration(),
 predicate);
 
  job.waitForCompletion(true);
 
 
 
https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/
ic
 ts/jobs/ProcessXml/StartJob.java
 
 
 
 
 
 
 
 
 On 1/16/13 9:22 AM, cscetbon@orange.com cscetbon@orange.com
 wrote:
 
 I don't want to write to Cassandra as it replicates data from another
 datacenter, but I just want to use Hadoop Jobs (Pig and Hive) to read
 data from it. I would like to use the same configuration as
 
http://www.datastax.com/dev/blog/hadoop-mapreduce-in-the-cassandra-clust
er
 but I want to know if there are alternatives to DataStax Enterprise
 package.
 
 Thanks
 On Jan 16, 2013, at 3:59 PM, James Schappet jschap...@gmail.com
wrote:
 
 Here are a few examples I have worked on, reading from xml.gz files
then
 writing to cassandara.
 
 
 https://github.com/jschappet/medline
 
 You will also need:
 
 https://github.com/jschappet/medline-base
 
 
 
 These examples are Hadoop Jobs using Cassandra as the Data Store.
 
 This one is a good place to start.
 
 
https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiow
a/
 ic
 ts/jobs/LoadMedline/StartJob.java
 
 ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
 COLUMN_FAMILY);
ConfigHelper.setOutputColumnFamily(job.getConfiguration(),
KEYSPACE,
 outputPath);
 
   job.setMapperClass(MapperToCassandra.class);
   job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(Text.class);
 
LOG.info(Writing output to Cassandra);
//job.setReducerClass(ReducerToCassandra.class);
job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
 
   ConfigHelper.setRpcPort(job.getConfiguration(), 9160);
   //org.apache.cassandra.dht.LocalPartitioner
   ConfigHelper.setInitialAddress(job.getConfiguration(),
 localhost);
   ConfigHelper.setPartitioner(job.getConfiguration(),
 org.apache.cassandra.dht.RandomPartitioner);
 
 
 
 
 
 
 On 1/16/13 7:37 AM, cscetbon@orange.com
cscetbon@orange.com
 wrote:
 
 Hi,
 
 I know that DataStax Enterprise package provide Brisk, but is there a
 community version ? Is it easy to interface Hadoop with Cassandra as
 the
 storage or do we absolutely have to use Brisk for that ?
 I know CassandraFS is natively available in cassandra 1.2, the
version
 I
 use, so is there a way/procedure to interface hadoop with Cassandra
as
 the storage ?
 
 Thanks 
 
 
__
__
 __
 ___
 
 Ce message et ses 

Re: Pig / Map Reduce on Cassandra

2013-01-16 Thread cscetbon.ext
Here is the point. You're right this github repository has not been updated for 
a year and a half. I thought brisk was just a bundle of some technologies and 
that it was possible to install the same components and make them work together 
without using this bundle :(


On Jan 16, 2013, at 8:22 PM, Michael Kjellman mkjell...@barracuda.com wrote:

 Brisk is pretty much stagnant. I think someone forked it to work with 1.0
 but not sure how that is going. You'll need to pay for DSE to get CFS
 (which is essentially Brisk) if you want to use any modern version of C*.
 
 Best,
 Michael
 
 On 1/16/13 11:17 AM, cscetbon@orange.com cscetbon@orange.com
 wrote:
 
 Thanks I understand that your code uses the hadoop interface of Cassandra
 to be able to read from it with a job. However I would like to know how
 to bring pieces (hive + pig + hadoop) together with cassandra as the
 storage layer, not to get code to test it. I have found repository
 https://github.com/riptano/brisk which might be a good start for it
 
 Regards
 
 On Jan 16, 2013, at 4:27 PM, James Schappet jschap...@gmail.com wrote:
 
 Try this one then, it reads from cassandra, then writes back to
 cassandra,
 but you could change the write to where ever you would like.
 
 
 
  getConf().set(IN_COLUMN_NAME, columnName );
 
 Job job = new Job(getConf(), ProcessRawXml);
   job.setInputFormatClass(ColumnFamilyInputFormat.class);
 job.setNumReduceTasks(0);
 
   job.setJarByClass(StartJob.class);
   job.setMapperClass(ParseMapper.class);
   job.setOutputKeyClass(ByteBuffer.class);
   //job.setOutputValueClass(Text.class);
 job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
 
   ConfigHelper.setOutputColumnFamily(job.getConfiguration(),
 KEYSPACE, COLUMN_FAMILY);
   job.setInputFormatClass(ColumnFamilyInputFormat.class);
   ConfigHelper.setRpcPort(job.getConfiguration(), 9160);
   //org.apache.cassandra.dht.LocalPartitioner
 ConfigHelper.setInitialAddress(job.getConfiguration(),
 localhost);
 ConfigHelper.setPartitioner(job.getConfiguration(),
 org.apache.cassandra.dht.RandomPartitioner);
 ConfigHelper.setInputColumnFamily(job.getConfiguration(),
 KEYSPACE, COLUMN_FAMILY);
 
 
 SlicePredicate predicate = new
 
 SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes(colum
 nN
 ame)));
 //   SliceRange slice_range = new SliceRange();
 //   slice_range.setStart(ByteBufferUtil.bytes(startPoint));
 //   slice_range.setFinish(ByteBufferUtil.bytes(endPoint));
 //
 //   predicate.setSlice_range(slice_range);
 ConfigHelper.setInputSlicePredicate(job.getConfiguration(),
 predicate);
 
 job.waitForCompletion(true);
 
 
 
 https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/
 ic
 ts/jobs/ProcessXml/StartJob.java
 
 
 
 
 
 
 
 
 On 1/16/13 9:22 AM, cscetbon@orange.com cscetbon@orange.com
 wrote:
 
 I don't want to write to Cassandra as it replicates data from another
 datacenter, but I just want to use Hadoop Jobs (Pig and Hive) to read
 data from it. I would like to use the same configuration as
 
 http://www.datastax.com/dev/blog/hadoop-mapreduce-in-the-cassandra-clust
 er
 but I want to know if there are alternatives to DataStax Enterprise
 package.
 
 Thanks
 On Jan 16, 2013, at 3:59 PM, James Schappet jschap...@gmail.com
 wrote:
 
 Here are a few examples I have worked on, reading from xml.gz files
 then
 writing to cassandara.
 
 
 https://github.com/jschappet/medline
 
 You will also need:
 
 https://github.com/jschappet/medline-base
 
 
 
 These examples are Hadoop Jobs using Cassandra as the Data Store.
 
 This one is a good place to start.
 
 
 https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiow
 a/
 ic
 ts/jobs/LoadMedline/StartJob.java
 
 ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE,
 COLUMN_FAMILY);
   ConfigHelper.setOutputColumnFamily(job.getConfiguration(),
 KEYSPACE,
 outputPath);
 
  job.setMapperClass(MapperToCassandra.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(Text.class);
 
   LOG.info(Writing output to Cassandra);
   //job.setReducerClass(ReducerToCassandra.class);
   job.setOutputFormatClass(ColumnFamilyOutputFormat.class);
 
  ConfigHelper.setRpcPort(job.getConfiguration(), 9160);
  //org.apache.cassandra.dht.LocalPartitioner
  ConfigHelper.setInitialAddress(job.getConfiguration(),
 localhost);
  ConfigHelper.setPartitioner(job.getConfiguration(),
 org.apache.cassandra.dht.RandomPartitioner);
 
 
 
 
 
 
 On 1/16/13 7:37 AM, cscetbon@orange.com
 cscetbon@orange.com
 wrote:
 
 Hi,
 
 I know that DataStax Enterprise package provide Brisk, but is there a
 community version ? Is it easy to interface Hadoop with Cassandra as
 the
 storage or do we absolutely have to use 

Re: Pig / Map Reduce on Cassandra

2013-01-16 Thread Brandon Williams
On Wed, Jan 16, 2013 at 2:37 PM,  cscetbon@orange.com wrote:
 Here is the point. You're right this github repository has not been updated 
 for a year and a half. I thought brisk was just a bundle of some technologies 
 and that it was possible to install the same components and make them work 
 together without using this bundle :(

You can install hadoop manually alongside Cassandra as well as pig.
Pig support is in C*'s tree in o.a.c.hadoop.pig.  You won't get CFS,
but it's not a hard requirement, either.

-Brandon


Map Reduce and Cassandra with Trigger patch

2012-11-26 Thread Felipe Schmidt
I'm having some problems during running a Map Reduce program using
Cassandra as input.
I already right some MapRed programs using the cassandra 1.0.9, but now I'm
trying with an old version with a patch that supports trigger. (this one:
https://issues.apache.org/jira/browse/CASSANDRA-1311)

When I try to run, it throws the following error:
12/11/26 16:59:06 ERROR config.DatabaseDescriptor: Fatal error: Cannot
locate cassandra.yaml on the classpath

I had already this problem and the solution was just add the path of
cassandra.yaml to a system property, but now it's not working. I also saw
somewhere that one solution would be adding the line set
CLASSPATH=%CASSANDRA_HOME%\conf into /bin/cassandra-cli.bat but it also
didn't work.

If someone has some idea of what to do I will be really thankful.

Regards,
Felipe Mathias Schmidt
*(Computer Science UFRGS, RS, Brazil)*

*Most people do not listen with the intent to understand; they listen with
the intend to reply - Stephen Covey*


Re: Map/Reduce over Cassandra

2010-08-18 Thread Drew Dahlke
Hey Bill,

A few months ago we did an experiment with 5 hadoop nodes pulling from
4 cass nodes. It was pulling down 1 column family with 8 small columns
 just dumping the raw data to hdfs. It was cycling through around 17K
map tasks per sec. The machines weren't being taxed too hard, so I'm
sure there's some concurrency tuning we could have done to speed that
up. Unfortunately we don't have that same data on HDFS yet, so I can't
really give a direct comparison.

Hope that helps. I'm curious what others have seen as well.

On Tue, Aug 17, 2010 at 6:59 PM, Bill Hastings bllhasti...@gmail.com wrote:
 Hi All
 How performant is M/R on Cassandra when compared to running it on HDFS?
 Anyone have any numbers they can share? Specifically how much of data the
 M/R job was run against and what was the throughput etc. Any information
 would be very helpful.

 --
 Cheers
 Bill



Map/Reduce over Cassandra

2010-08-17 Thread Bill Hastings
Hi All

How performant is M/R on Cassandra when compared to running it on HDFS?
Anyone have any numbers they can share? Specifically how much of data the
M/R job was run against and what was the throughput etc. Any information
would be very helpful.

-- 
Cheers
Bill