Re: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Cassa L
I tried all combinations of spark-cassandra connector. Didn't work.
Finally, I downgraded spark to 1.5.1 and now it works.
LCassa

On Wed, May 18, 2016 at 11:11 AM, Mohammed Guller 
wrote:

> As Ben mentioned, Spark 1.5.2 does work with C*.  Make sure that you are
> using the correct version of the Spark Cassandra Connector.
>
>
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
> *Sent:* Tuesday, May 17, 2016 11:00 PM
> *To:* user@cassandra.apache.org; Mohammed Guller
> *Cc:* user
>
> *Subject:* Re: Accessing Cassandra data from Spark Shell
>
>
>
> It definitely should be possible for 1.5.2 (I have used it with
> spark-shell and cassandra connector with 1.4.x). The main trick is in
> lining up all the versions and building an appropriate connector jar.
>
>
>
> Cheers
>
> Ben
>
>
>
> On Wed, 18 May 2016 at 15:40 Cassa L  wrote:
>
> Hi,
>
> I followed instructions to run SparkShell with Spark-1.6. It works fine.
> However, I need to use spark-1.5.2 version. With it, it does not work. I
> keep getting NoSuchMethod Errors. Is there any issue running Spark Shell
> for Cassandra using older version of Spark?
>
>
>
>
>
> Regards,
>
> LCassa
>
>
>
> On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
> wrote:
>
> Yes, it is very simple to access Cassandra data using Spark shell.
>
>
>
> Step 1: Launch the spark-shell with the spark-cassandra-connector package
>
> $SPARK_HOME/bin/spark-shell --packages
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0
>
>
>
> Step 2: Create a DataFrame pointing to your Cassandra table
>
> val dfCassTable = sqlContext.read
>
>
> .format("org.apache.spark.sql.cassandra")
>
>  .options(Map(
> "table" -> "your_column_family", "keyspace" -> "your_keyspace"))
>
>  .load()
>
>
>
> From this point onward, you have complete access to the DataFrame API. You
> can even register it as a temporary table, if you would prefer to use
> SQL/HiveQL.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
> *Sent:* Monday, May 9, 2016 9:28 PM
> *To:* user@cassandra.apache.org; user
> *Subject:* Re: Accessing Cassandra data from Spark Shell
>
>
>
> You can use SparkShell to access Cassandra via the Spark Cassandra
> connector. The getting started article on our support page will probably
> give you a good steer to get started even if you’re not using Instaclustr:
> https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-
>
>
>
> Cheers
>
> Ben
>
>
>
> On Tue, 10 May 2016 at 14:08 Cassa L  wrote:
>
> Hi,
>
> Has anyone tried accessing Cassandra data using SparkShell? How do you do
> it? Can you use HiveContext for Cassandra data? I'm using community version
> of Cassandra-3.0
>
>
>
> Thanks,
>
> LCassa
>
> --
>
> 
>
> Ben Slater
>
> Chief Product Officer, Instaclustr
>
> +61 437 929 798
>
>
>
> --
>
> 
>
> Ben Slater
>
> Chief Product Officer, Instaclustr
>
> +61 437 929 798
>


Intermittent CAS error

2016-05-18 Thread Robert Wille
When executing bulk CAS queries, I intermittently get the following error: 

SERIAL is not supported as conditional update commit consistency. Use ANY if 
you mean "make sure it is accepted but I don't care how many replicas commit it 
for non-SERIAL reads”

This doesn’t make any sense. Obviously, it IS supported because it works most 
of the time. Is this just a result of not enough replicas, and the error 
message is jacked up?

I’m running 2.1.13.

Thanks

Robert



Re: Setting bloom_filter_fp_chance < 0.01

2016-05-18 Thread Adarsh Kumar
Hi Sai,

We have a use case where we are designing a table that is going to have
around 50 billion rows and we require a very fast reads. Partitions are not
that complex/big, it has
some validation data for duplicate checks (consisting 4-5 int and varchar).
So we were trying various options to optimize read performance. Apart from
tuning Bloom Filter we are trying following thing:

1). Better data modelling (making appropriate partition and clustering keys)
2). Trying Leveled compaction (changing data model for this one)

Jonathan,

I understand that tuning bloom_filter_fp_chance will not have a drastic
performance gain.
But this is one of the many tings we are trying.
Please let me know if you have any other suggestions to improve read
performance for this volume of data.

Also please let me know any performance benchmark technique (currently we
are planing to trigger massive reads from spark and check cfstats).

NOTE: we will be deploying DSE on EC2, so please suggest if you have
anything specific to DSE and EC2.

Adarsh

On Wed, May 18, 2016 at 9:45 PM, Jonathan Haddad  wrote:

> The impact is it'll get massively bigger with very little performance
> benefit, if any.
>
> You can't get 0 because it's a probabilistic data structure.  It tells you
> either:
>
> your data is definitely not here
> your data has a pretty decent chance of being here
>
> but never "it's here for sure"
>
> https://en.wikipedia.org/wiki/Bloom_filter
>
> On Wed, May 18, 2016 at 11:04 AM sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
>> hi Adarsh;
>> were there any drawbacks to setting the bloom_filter_fp_chance  to
>> the default value?
>>
>> thanks
>> Sai
>>
>> On Wed, May 18, 2016 at 2:21 AM, Adarsh Kumar 
>> wrote:
>>
>>> Hi,
>>>
>>> What is the impact of setting bloom_filter_fp_chance < 0.01.
>>>
>>> During performance tuning I was trying to tune bloom_filter_fp_chance
>>> and have following questions:
>>>
>>> 1). Why bloom_filter_fp_chance = 0 is not allowed. (
>>> https://issues.apache.org/jira/browse/CASSANDRA-5013)
>>> 2). What is the maximum/recommended value of bloom_filter_fp_chance (if
>>> we do not have any limitation for bloom filter size).
>>>
>>> NOTE: We are using default SizeTieredCompactionStrategy on
>>> cassandra  2.1.8.621
>>>
>>> Thanks in advance..:)
>>>
>>> Adarsh Kumar
>>>
>>
>>


Re: Replication lag between data center

2016-05-18 Thread Jeff Jirsa
Cassandra isn’t a traditional DB – it doesn’t “replicate” in the same way that 
a relational DB replicas.

Cassandra clients send mutations (via native protocol or thrift). Those 
mutations include a minimum consistency level for the server to return a 
successful write.

If a write says “Consistency: ALL” - then as soon as the write returns, the 
mutation exists on all nodes (no replication delay – it’s done).
If a write is anything other than ALL, it’s possible that any individual node 
may not have the write when the client is told the write succeeds. At that 
point, the coordinator will make a best effort to deliver the write to all 
nodes in real time, but may fail or time out. As far as I know, there are no 
metrics on this delivery – I believe the writes prior to the coordinator 
returning may have some basic data in TRACE, but wouldn’t expect writes after 
the coordinator returned to have tracing data available.

If any individual times out completely, the coordinator writes a hint. When the 
coordinator sees the node come back online, it will try to replay the writes by 
replaying the hints – this may happen minutes or hours later.

If it’s unable to replay hints, or if writes are missed for some other reason, 
the data may never “replicate” to the other nodes/Dcs on its own – you may need 
to manually “replicate” it using the `nodetool repair` tool.

Taken together, there’s no simple “replication lag” here – if you write with 
ALL, the lag is “none”. If you write with CL:QUORUM and read with CL:QUORUM, 
your effective lag is “probably none”, because missing replicas will 
read-repair the data on read. If you read or write with low consistency, your 
lag may be milliseconds, hours, weeks, or forever, depending on how long your 
link is down and how often you repair. 



From:  cass savy
Reply-To:  "user@cassandra.apache.org"
Date:  Wednesday, May 18, 2016 at 8:03 PM
To:  "user@cassandra.apache.org"
Subject:  Replication lag between data center

How can we determine/measure the replication lag or latency between on premise 
data centers or cross region/Availability zones?



smime.p7s
Description: S/MIME cryptographic signature


Replication lag between data center

2016-05-18 Thread cass savy
How can we determine/measure the replication lag or latency between on
premise data centers or cross region/Availability zones?


Re: Extending a partially upgraded cluster - supported

2016-05-18 Thread Erik Forsberg



On 2016-05-18 20:19, Jeff Jirsa wrote:

You can’t stream between versions, so in order to grow the cluster, you’ll need 
to be entirely on 2.0 or entirely on 2.1.


OK. I was sure you can't stream between a 2.0 node and a 2.1 node, but 
if I understand you correctly you can't stream between two 2.1 nodes 
unless the sstables on the source node has been upgraded to "ka", i.e. 
the 2.1 sstable version?


Looks like it's extend first, upgrade later, given that we're a bit 
close on disk capacity.


Thanks,
\EF


If you go to 2.1 first, be sure you run upgradesstables before you try to 
extend the cluster.





On 5/18/16, 11:17 AM, "Erik Forsberg"  wrote:


Hi!

I have a 2.0.13 cluster which I need to do two things with:

* Extend it
* Upgrade to 2.1.14

I'm pondering in what order to do things. Is it a supported operation to
extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0
where not all sstables have been upgraded?

If I do that, will the sstables written on the new nodes be in the 2.1
format when I add them? Or will they be written in the 2.0 format so
I'll have to run upgradesstables anyway?

The cleanup I do on the existing nodes, will write the new 2.1 format,
right?

There might be other reasons not to do this, one being that it's seldom
wise to do many operations at once. So please enlighten me on how bad an
idea this is :-)

Thanks,
\EF




Re: Extending a partially upgraded cluster - supported

2016-05-18 Thread Jeff Jirsa
You can’t stream between versions, so in order to grow the cluster, you’ll need 
to be entirely on 2.0 or entirely on 2.1.

If you go to 2.1 first, be sure you run upgradesstables before you try to 
extend the cluster.





On 5/18/16, 11:17 AM, "Erik Forsberg"  wrote:

>Hi!
>
>I have a 2.0.13 cluster which I need to do two things with:
>
>* Extend it
>* Upgrade to 2.1.14
>
>I'm pondering in what order to do things. Is it a supported operation to 
>extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0 
>where not all sstables have been upgraded?
>
>If I do that, will the sstables written on the new nodes be in the 2.1 
>format when I add them? Or will they be written in the 2.0 format so 
>I'll have to run upgradesstables anyway?
>
>The cleanup I do on the existing nodes, will write the new 2.1 format, 
>right?
>
>There might be other reasons not to do this, one being that it's seldom 
>wise to do many operations at once. So please enlighten me on how bad an 
>idea this is :-)
>
>Thanks,
>\EF

smime.p7s
Description: S/MIME cryptographic signature


Extending a partially upgraded cluster - supported

2016-05-18 Thread Erik Forsberg

Hi!

I have a 2.0.13 cluster which I need to do two things with:

* Extend it
* Upgrade to 2.1.14

I'm pondering in what order to do things. Is it a supported operation to 
extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0 
where not all sstables have been upgraded?


If I do that, will the sstables written on the new nodes be in the 2.1 
format when I add them? Or will they be written in the 2.0 format so 
I'll have to run upgradesstables anyway?


The cleanup I do on the existing nodes, will write the new 2.1 format, 
right?


There might be other reasons not to do this, one being that it's seldom 
wise to do many operations at once. So please enlighten me on how bad an 
idea this is :-)


Thanks,
\EF


RE: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Mohammed Guller
As Ben mentioned, Spark 1.5.2 does work with C*.  Make sure that you are using 
the correct version of the Spark Cassandra Connector.


Mohammed
Author: Big Data Analytics with 
Spark

From: Ben Slater [mailto:ben.sla...@instaclustr.com]
Sent: Tuesday, May 17, 2016 11:00 PM
To: user@cassandra.apache.org; Mohammed Guller
Cc: user
Subject: Re: Accessing Cassandra data from Spark Shell

It definitely should be possible for 1.5.2 (I have used it with spark-shell and 
cassandra connector with 1.4.x). The main trick is in lining up all the 
versions and building an appropriate connector jar.

Cheers
Ben

On Wed, 18 May 2016 at 15:40 Cassa L 
> wrote:
Hi,
I followed instructions to run SparkShell with Spark-1.6. It works fine. 
However, I need to use spark-1.5.2 version. With it, it does not work. I keep 
getting NoSuchMethod Errors. Is there any issue running Spark Shell for 
Cassandra using older version of Spark?


Regards,
LCassa

On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
> wrote:
Yes, it is very simple to access Cassandra data using Spark shell.

Step 1: Launch the spark-shell with the spark-cassandra-connector package
$SPARK_HOME/bin/spark-shell --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0

Step 2: Create a DataFrame pointing to your Cassandra table
val dfCassTable = sqlContext.read
 
.format("org.apache.spark.sql.cassandra")
 .options(Map( "table" 
-> "your_column_family", "keyspace" -> "your_keyspace"))
 .load()

From this point onward, you have complete access to the DataFrame API. You can 
even register it as a temporary table, if you would prefer to use SQL/HiveQL.

Mohammed
Author: Big Data Analytics with 
Spark

From: Ben Slater 
[mailto:ben.sla...@instaclustr.com]
Sent: Monday, May 9, 2016 9:28 PM
To: user@cassandra.apache.org; user
Subject: Re: Accessing Cassandra data from Spark Shell

You can use SparkShell to access Cassandra via the Spark Cassandra connector. 
The getting started article on our support page will probably give you a good 
steer to get started even if you’re not using Instaclustr: 
https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-

Cheers
Ben

On Tue, 10 May 2016 at 14:08 Cassa L 
> wrote:
Hi,
Has anyone tried accessing Cassandra data using SparkShell? How do you do it? 
Can you use HiveContext for Cassandra data? I'm using community version of 
Cassandra-3.0

Thanks,
LCassa
--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798

--

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798


Re: Setting bloom_filter_fp_chance < 0.01

2016-05-18 Thread Jonathan Haddad
The impact is it'll get massively bigger with very little performance
benefit, if any.

You can't get 0 because it's a probabilistic data structure.  It tells you
either:

your data is definitely not here
your data has a pretty decent chance of being here

but never "it's here for sure"

https://en.wikipedia.org/wiki/Bloom_filter

On Wed, May 18, 2016 at 11:04 AM sai krishnam raju potturi <
pskraj...@gmail.com> wrote:

> hi Adarsh;
> were there any drawbacks to setting the bloom_filter_fp_chance  to the
> default value?
>
> thanks
> Sai
>
> On Wed, May 18, 2016 at 2:21 AM, Adarsh Kumar 
> wrote:
>
>> Hi,
>>
>> What is the impact of setting bloom_filter_fp_chance < 0.01.
>>
>> During performance tuning I was trying to tune bloom_filter_fp_chance and
>> have following questions:
>>
>> 1). Why bloom_filter_fp_chance = 0 is not allowed. (
>> https://issues.apache.org/jira/browse/CASSANDRA-5013)
>> 2). What is the maximum/recommended value of bloom_filter_fp_chance (if
>> we do not have any limitation for bloom filter size).
>>
>> NOTE: We are using default SizeTieredCompactionStrategy on
>> cassandra  2.1.8.621
>>
>> Thanks in advance..:)
>>
>> Adarsh Kumar
>>
>
>


Re: Setting bloom_filter_fp_chance < 0.01

2016-05-18 Thread sai krishnam raju potturi
hi Adarsh;
were there any drawbacks to setting the bloom_filter_fp_chance  to the
default value?

thanks
Sai

On Wed, May 18, 2016 at 2:21 AM, Adarsh Kumar  wrote:

> Hi,
>
> What is the impact of setting bloom_filter_fp_chance < 0.01.
>
> During performance tuning I was trying to tune bloom_filter_fp_chance and
> have following questions:
>
> 1). Why bloom_filter_fp_chance = 0 is not allowed. (
> https://issues.apache.org/jira/browse/CASSANDRA-5013)
> 2). What is the maximum/recommended value of bloom_filter_fp_chance (if we
> do not have any limitation for bloom filter size).
>
> NOTE: We are using default SizeTieredCompactionStrategy on
> cassandra  2.1.8.621
>
> Thanks in advance..:)
>
> Adarsh Kumar
>


Re: Cassandra Debian repos (Apache vs DataStax)

2016-05-18 Thread Eric Evans
On Tue, May 17, 2016 at 2:16 PM, Drew Kutcharian  wrote:
> OK to make things even more confusing, the “Release” files in the Apache Repo 
> say "Origin: Unofficial Cassandra Packages”!!
>
> i.e. http://dl.bintray.com/apache/cassandra/dists/35x/:Release

Yes, as I remember, someone was concerned that the use of the word
"Debian" might be interpreted to mean that the packages were Official
in the Debian sense.

So, these packages-meant-to-be-installed-on-Debian-systems *are* the
official packages of the Apache™ Cassandra™ Project, even if they are
*not* officially a part of Debian™.

Make sense? :)

-- 
Eric Evans
john.eric.ev...@gmail.com


Re: Cassandra Debian repos (Apache vs DataStax)

2016-05-18 Thread Eric Evans
On Tue, May 17, 2016 at 2:11 PM, Drew Kutcharian  wrote:
> BTW, the language on this page should probably change since it currently 
> sounds like the official repo is the DataStax one and Apache is only an 
> “alternative"
>
> http://wiki.apache.org/cassandra/DebianPackaging

It does, doesn't it?  I'll fix this.

Thanks Drew!

-- 
Eric Evans
john.eric.ev...@gmail.com


Re: Low cardinality secondary index behaviour

2016-05-18 Thread DuyHai Doan
Cassandra 3.0.6 does not have SASI. SASI is available only from C* 3.4 but
I advise C* 3.5/3.6 because some critical bugs have been fixed in 3.5

On Wed, May 18, 2016 at 1:58 PM, Atul Saroha 
wrote:

> Thanks Tyler,
>
> SPARSE SASI index solves my use case. Planing to upgrade the cassandra to
> 3.0.6 now.
>
>
> -
> Atul Saroha
> *Lead Software Engineer*
> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>
> On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs  wrote:
>
>>
>> On Tue, May 10, 2016 at 6:41 AM, Atul Saroha 
>> wrote:
>>
>>> I have concern over using secondary index on field with low cardinality.
>>> Lets say I have few billion rows and each row can be classified in 1000
>>> category. Lets say we have 50 node cluster.
>>>
>>> Now we want to fetch data for a single category using secondary index
>>> over a category. And query is paginated too with fetch size property say
>>> 5000.
>>>
>>> Since query on secondary index works as scatter and gatherer approach by
>>> coordinator node. Would it lead to out of memory on coordinator or timeout
>>> errors too much.
>>>
>>
>> Paging will prevent the coordinator from using excessive memory.  With
>> the type of data that you described, timeouts shouldn't be huge problem
>> because it will only take a few token ranges (assuming you're using vnodes)
>> to get enough matching rows to hit the page size.
>>
>>
>>>
>>> How does pagination (token level data fetch) behave in scatter and
>>> gatherer approach?
>>>
>>
>> Secondary index queries fetch token ranges in sequential order [1],
>> starting with the minimum token.  When you fetch a new page, it resumes
>> from the last token (and primary key) that it returned in the previous page.
>>
>> [1] As an optimization, multiple token ranges will be fetched in parallel
>> based on estimates of how many token ranges it will take to fill the page.
>>
>>
>>>
>>> Secondly, What If we create an inverted table with partition key as
>>> category. Then this will led to lots of data on single node. Then it might
>>> led to hot shard issue and performance issue of data fetching from single
>>> node as a single partition has  millions of rows.
>>>
>>> How should we tackle such low cardinality index in Cassandra?
>>
>>
>> The data distribution that you described sounds like a reasonable fit for
>> secondary indexes.  However, I would also take into account how frequently
>> you run this query and how fast you need it to be.  Even ignoring the
>> scatter-gather aspects of a secondary index query, they are still expensive
>> because they fetch many non-contiguous rows from an SSTable.  If you need
>> to run this query very frequently, that may add too much load to your
>> cluster, and some sort of inverted table approach may be more appropriate.
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>
>


Re: Low cardinality secondary index behaviour

2016-05-18 Thread Atul Saroha
Thanks Tyler,

SPARSE SASI index solves my use case. Planing to upgrade the cassandra to
3.0.6 now.

-
Atul Saroha
*Lead Software Engineer*
*M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA

On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs  wrote:

>
> On Tue, May 10, 2016 at 6:41 AM, Atul Saroha 
> wrote:
>
>> I have concern over using secondary index on field with low cardinality.
>> Lets say I have few billion rows and each row can be classified in 1000
>> category. Lets say we have 50 node cluster.
>>
>> Now we want to fetch data for a single category using secondary index
>> over a category. And query is paginated too with fetch size property say
>> 5000.
>>
>> Since query on secondary index works as scatter and gatherer approach by
>> coordinator node. Would it lead to out of memory on coordinator or timeout
>> errors too much.
>>
>
> Paging will prevent the coordinator from using excessive memory.  With the
> type of data that you described, timeouts shouldn't be huge problem because
> it will only take a few token ranges (assuming you're using vnodes) to get
> enough matching rows to hit the page size.
>
>
>>
>> How does pagination (token level data fetch) behave in scatter and
>> gatherer approach?
>>
>
> Secondary index queries fetch token ranges in sequential order [1],
> starting with the minimum token.  When you fetch a new page, it resumes
> from the last token (and primary key) that it returned in the previous page.
>
> [1] As an optimization, multiple token ranges will be fetched in parallel
> based on estimates of how many token ranges it will take to fill the page.
>
>
>>
>> Secondly, What If we create an inverted table with partition key as
>> category. Then this will led to lots of data on single node. Then it might
>> led to hot shard issue and performance issue of data fetching from single
>> node as a single partition has  millions of rows.
>>
>> How should we tackle such low cardinality index in Cassandra?
>
>
> The data distribution that you described sounds like a reasonable fit for
> secondary indexes.  However, I would also take into account how frequently
> you run this query and how fast you need it to be.  Even ignoring the
> scatter-gather aspects of a secondary index query, they are still expensive
> because they fetch many non-contiguous rows from an SSTable.  If you need
> to run this query very frequently, that may add too much load to your
> cluster, and some sort of inverted table approach may be more appropriate.
>
> --
> Tyler Hobbs
> DataStax 
>


Migrating from Cassandra-Lucene to SASI

2016-05-18 Thread Atul Saroha
>From Duy Hai DOAN's blog http://www.doanduyhai.com/blog/?p=2058 :

   Please note that SASI does not intercept DELETE for indexing. Indeed the
> resolution and reconciliation of deleted data is let to Cassandra at read
> time. SASI only indexes INSERT and UPDATE.
>

With this it feels that Lucene is better if you have use-case of frequent
deletes on CF as it marks documents deleted and will be cleaned later.

Though I dint find any doc mentioning how SASI indexes are cleaned after
grace period expired for the tombstone. Will it be remove or not? Ideally
it should be removed.

Also fetching functionality of lucene 'filter' search and SASI is also
sequancial, i.e. no scatter gather.

Does it worth to migrate to SASI if your column family face heavy deletes
too.
-
Atul Saroha
*Sr. Software Engineer*
*M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA


Setting bloom_filter_fp_chance < 0.01

2016-05-18 Thread Adarsh Kumar
Hi,

What is the impact of setting bloom_filter_fp_chance < 0.01.

During performance tuning I was trying to tune bloom_filter_fp_chance and
have following questions:

1). Why bloom_filter_fp_chance = 0 is not allowed. (
https://issues.apache.org/jira/browse/CASSANDRA-5013)
2). What is the maximum/recommended value of bloom_filter_fp_chance (if we
do not have any limitation for bloom filter size).

NOTE: We are using default SizeTieredCompactionStrategy on
cassandra  2.1.8.621

Thanks in advance..:)

Adarsh Kumar


Re: Accessing Cassandra data from Spark Shell

2016-05-18 Thread Ben Slater
It definitely should be possible for 1.5.2 (I have used it with spark-shell
and cassandra connector with 1.4.x). The main trick is in lining up all the
versions and building an appropriate connector jar.

Cheers
Ben

On Wed, 18 May 2016 at 15:40 Cassa L  wrote:

> Hi,
> I followed instructions to run SparkShell with Spark-1.6. It works fine.
> However, I need to use spark-1.5.2 version. With it, it does not work. I
> keep getting NoSuchMethod Errors. Is there any issue running Spark Shell
> for Cassandra using older version of Spark?
>
>
> Regards,
> LCassa
>
> On Tue, May 10, 2016 at 6:48 PM, Mohammed Guller 
> wrote:
>
>> Yes, it is very simple to access Cassandra data using Spark shell.
>>
>>
>>
>> Step 1: Launch the spark-shell with the spark-cassandra-connector package
>>
>> $SPARK_HOME/bin/spark-shell --packages
>> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0
>>
>>
>>
>> Step 2: Create a DataFrame pointing to your Cassandra table
>>
>> val dfCassTable = sqlContext.read
>>
>>
>> .format("org.apache.spark.sql.cassandra")
>>
>>  .options(Map(
>> "table" -> "your_column_family", "keyspace" -> "your_keyspace"))
>>
>>  .load()
>>
>>
>>
>> From this point onward, you have complete access to the DataFrame API.
>> You can even register it as a temporary table, if you would prefer to use
>> SQL/HiveQL.
>>
>>
>>
>> Mohammed
>>
>> Author: Big Data Analytics with Spark
>> 
>>
>>
>>
>> *From:* Ben Slater [mailto:ben.sla...@instaclustr.com]
>> *Sent:* Monday, May 9, 2016 9:28 PM
>> *To:* user@cassandra.apache.org; user
>> *Subject:* Re: Accessing Cassandra data from Spark Shell
>>
>>
>>
>> You can use SparkShell to access Cassandra via the Spark Cassandra
>> connector. The getting started article on our support page will probably
>> give you a good steer to get started even if you’re not using Instaclustr:
>> https://support.instaclustr.com/hc/en-us/articles/213097877-Getting-Started-with-Instaclustr-Spark-Cassandra-
>>
>>
>>
>> Cheers
>>
>> Ben
>>
>>
>>
>> On Tue, 10 May 2016 at 14:08 Cassa L  wrote:
>>
>> Hi,
>>
>> Has anyone tried accessing Cassandra data using SparkShell? How do you do
>> it? Can you use HiveContext for Cassandra data? I'm using community version
>> of Cassandra-3.0
>>
>>
>>
>> Thanks,
>>
>> LCassa
>>
>> --
>>
>> 
>>
>> Ben Slater
>>
>> Chief Product Officer, Instaclustr
>>
>> +61 437 929 798
>>
>
> --

Ben Slater
Chief Product Officer, Instaclustr
+61 437 929 798