Reason for Trace Message Drop

2016-06-15 Thread Varun Barala
Hi all,

Can anyone tell me that what are all possible reasons for below log:-


*"INFO  [ScheduledTasks:1] 2016-06-14 06:27:39,498
MessagingService.java:929 - _TRACE messages were dropped in last 5000 ms:
928 for internal timeout and 0 for cross node timeout".*
I searched online for the same and found some reasons like:-

* Disk is not able to keep up with your ingest
* Resources are not able to support all parallel running tasks
* If other nodes are down then due to large hint replay
* Heavy workload

But in this case other kind of messages (mutation, read, write etc)  should
be dropped by *C** but It doesn't happen.

-
Cluster Specifications
--
number of nodes = 1
total number of CF = 2000

-
Machine Specifications
--
RAM 30 GB
hard disk SSD
ubuntu 14.04


Thanks in advance!!

Regards,
Varun Barala


Re: Data lost in Cassandra 3.5 single instance via Erlang driver

2016-06-15 Thread linbo liao
Thanks Ben, Paul, Alain.  I debug at client side find the reason is
pub_timestamp duplicated.  I will use timeuuid instead.

Thanks,
Linbo

2016-06-15 13:09 GMT+08:00 Alain Rastoul :

> On 15/06/2016 06:40, linbo liao wrote:
>
>> I am not sure, but looks it will cause the update other than insert. If
>> it is true, the only way is request includes IF NOT EXISTS, inform the
>> client it failed?
>>
>> Thanks,
>> Linbo
>>
>> Hi Linbo,
>
> +1 with what Ben said, timestamp has a millisecond precision and is a bad
> choice for making PK unicity.
> If your client and server are on the same physical machine (both on same
> computer or different vms on same hypervisor), insert duration can go down
> to very few microseconds (2~3 on a recent computer).
> Your insert will/should often become "update".
> The reason is that update does not exists in cassandra, neither delete,
> they are just "appends":  append with same key for update or append of a
> tombstone for delete.
> You should try to use a timeuuid instead, it has a node, clock sequence, a
> counter plus the timestamp part that you can get with cql functions, and it
> exists for that use.
> see here for the functions
>
> https://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html
>
>
> --
> best,
> Alain
>


Re: Cassandra monitoring

2016-06-15 Thread Kai Wang
I use graphite/jmxtrans/collectd to monitor not just Cassandra but also
other jvm applications as well as OS. I found it's more useful and flexible
than opscenter in terms of monitoring.
On Jun 14, 2016 3:10 PM, "Arun Ramakrishnan" 
wrote:

What are the options for a very small and nimble startup to do keep a
cassandra cluster running well oiled. We are on AWS. We are interested in a
monitoring tool and potentially also cluster management tools.

We are currently on apache cassandra 3.7. We were hoping the datastax
opscenter would be it (It is free for startups our size). But, looks like
it does not support cassandra versions greater than v2.1. It is pretty
surprising considering cassandra v2.1  came out in 2014.

We would consider downgrading to datastax cassandra 2.1 just to have robust
monitoring tools. But, I am not sure if having opscenter offsets all the
improvements that have been added to cassandra since 2.1.

Sematext has a integrations for monitoring cassandra. Does anyone have good
experience with it ?

How much work would be involved to setup Ganglia or some such option for
cassandra ?

Thanks,
Arun


how to force cassandra-stress to actually generate enough data

2016-06-15 Thread Peter Kovgan
Hi,

The cassandra-stress is not helping really to populate the disk sufficiently.

I tried several table structures, providing

cluster: UNIFORM(1..100)  on clustering parts of the PK.

Partition part of PK makes about 660 000 partitions.

The hope was create enough cells in a row, make the row really WIDE.

No matter what I tried, does no matter how long it runs, I see maximum 2-3 
SSTables per node and maximum 300Mb of data per node.

(I have 6 nodes and very active 400 threads stress)

It looks, like It is impossible to make the row really wide and disk really 
full.

Is it intentional?

I mean, if there was an intention to avoid really wide rows, why there is no 
hint on this in docs?

Do you have similar experience and do you know how resolve that?

Thanks.

**
This communication and all or some of the information contained therein may be 
confidential and is subject to our Terms and Conditions. If you have received 
this
communication in error, please destroy all electronic and paper copies and 
notify the sender immediately. Unless specifically indicated, this 
communication is 
not a confirmation, an offer to sell or solicitation of any offer to buy any 
financial product, or an official statement of ICAP or its affiliates. 
Non-Transactable Pricing Terms and Conditions apply to any non-transactable 
pricing provided. All terms and conditions referenced herein available
at www.icapterms.com. Please notify us by reply message if this link does not 
work.
**


Re: Reason for Trace Message Drop

2016-06-15 Thread Eric Stevens
This is better kept to the User groups.

What are your JVM memory settings for Cassandra, and have you seen big GC's
in your logs?

The reason I ask is because that's a large number of column families, which
produces memory pressure, and at first blush that strikes me as a likely
cause.

On Wed, Jun 15, 2016 at 3:23 AM Varun Barala 
wrote:

> Hi all,
>
> Can anyone tell me that what are all possible reasons for below log:-
>
>
> *"INFO  [ScheduledTasks:1] 2016-06-14 06:27:39,498
> MessagingService.java:929 - _TRACE messages were dropped in last 5000 ms:
> 928 for internal timeout and 0 for cross node timeout".*
> I searched online for the same and found some reasons like:-
>
> * Disk is not able to keep up with your ingest
> * Resources are not able to support all parallel running tasks
> * If other nodes are down then due to large hint replay
> * Heavy workload
>
> But in this case other kind of messages (mutation, read, write etc)
>  should be dropped by *C** but It doesn't happen.
>
> -
> Cluster Specifications
> --
> number of nodes = 1
> total number of CF = 2000
>
> -
> Machine Specifications
> --
> RAM 30 GB
> hard disk SSD
> ubuntu 14.04
>
>
> Thanks in advance!!
>
> Regards,
> Varun Barala
>


Re: how to force cassandra-stress to actually generate enough data

2016-06-15 Thread Ben Slater
Are you running with n=[number ops] or duration=[xx]? I’ve found you need
to you n= when inserting data. When you use duration cassandra-stress
defaults to 1,000,000 somethings (to be honest, I’m not entirely sure if
it’s rows, partitions or something else that the 1,000,000 relates to) and
running for a long time just results in overwriting a lot a data that gets
compacted away. Using n=[number > 1M] will get you n somethings.

Cheers
Ben

On Wed, 15 Jun 2016 at 22:25 Peter Kovgan 
wrote:

> Hi,
>
>
>
> The cassandra-stress is not helping really to populate the disk
> sufficiently.
>
>
>
> I tried several table structures, providing
>
> cluster: UNIFORM(1..100)  on clustering parts of the PK.
>
>
>
> Partition part of PK makes about 660 000 partitions.
>
>
>
> The hope was create enough cells in a row, make the row really WIDE.
>
>
>
> No matter what I tried, does no matter how long it runs, I see maximum 2-3
> SSTables per node and maximum 300Mb of data per node.
>
>
>
> (I have 6 nodes and very active 400 threads stress)
>
>
>
> It looks, like It is impossible to make the row really wide and disk
> really full.
>
>
>
> Is it intentional?
>
>
>
> I mean, if there was an intention to avoid really wide rows, why there is
> no hint on this in docs?
>
>
>
> Do you have similar experience and do you know how resolve that?
>
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> 
> This communication and all or some of the information contained therein
> may be confidential and is subject to our Terms and Conditions. If you have
> received this communication in error, please destroy all electronic and
> paper copies and notify the sender immediately. Unless specifically
> indicated, this communication is not a confirmation, an offer to sell or
> solicitation of any offer to buy any financial product, or an official
> statement of ICAP or its affiliates. Non-Transactable Pricing Terms and
> Conditions apply to any non-transactable pricing provided. All terms and
> conditions referenced herein available at www.icapterms.com. Please
> notify us by reply message if this link does not work.
>
> 
>
-- 

Ben Slater
Chief Product Officer
Instaclustr: Cassandra + Spark - Managed | Consulting | Support
+61 437 929 798


Cqlsh Questions

2016-06-15 Thread Steve Anderson
Couple of Cqlsh questions:

1) Why when I run the DESCRIBE CLUSTER command no snitch information is 
required? Is this because my Cassandra cluster is a single node?

2) When I run the HELP CREATE_KEYSPACE command the following info is displayed:

*** No browser to display CQL help. URL for help topic CREATE_KEYSPACE 
: https://cassandra.apache.org/doc/cql3/CQL-3.2.html#createKeyspaceStmt

I am connecting via SSH to my Amazon Linux image and hence have no 
browser available. I see the browser can be configured via a .cqlshrc file, but 
can I configure it not to use a browser?

By the way, there is no CQL-3.2.html file under 
https://cassandra.apache.org/doc/cql3/  

Thanks
Steve
—
"Surely, those who believe, those who are Jewish, the Christians, and the 
converts; anyone who (1) believes in God, (2) believes in the Last Day, and (3) 
leads a righteous life, will receive their recompense from their Lord; they 
have nothing to fear, nor will they grieve." (Quran 2:62, 5:69) …learn more at 
www.masjidtucson.org






Re: how to force cassandra-stress to actually generate enough data

2016-06-15 Thread Julien Anguenot
I usually do a write only bench run first. Doing a 1B write iterations will 
produce 200GB+ data on disk.  You can then do mixed tests.

For instance, a write bench that would produce such volume on a 3 nodes cluster:

./tools/bin/cassandra-stress write cl=LOCAL_QUORUM n=10  -rate 
threads=1 -node 1.2.3.1,1.2.3.2.1.2.3.4 -schema 
'replication(strategy=NetworkTopologyStrategy,dallas=3)'  -log 
file=raid5_ssd_1b_10kt_cl_quorum.log -graph 
file=raid5_ssd_1B_10kt_cl_quorum.html title=raid5_ssd_1B_10kt_cl_quorum 
revision=benchmark-0

After that you can then do various mixed bench runs with data, SSTables and 
compactions kicking in.

Not sure this is the best, advocated, way to achieve the goal when having empty 
disk and no dataset to start with though.

   J.


> On Jun 15, 2016, at 7:24 AM, Peter Kovgan  
> wrote:
> 
> Hi,
>  
> The cassandra-stress is not helping really to populate the disk sufficiently.
>  
> I tried several table structures, providing 
> 
> cluster: UNIFORM(1..100)  on clustering parts of the PK.
>  
> Partition part of PK makes about 660 000 partitions.
>  
> The hope was create enough cells in a row, make the row really WIDE.
>  
> No matter what I tried, does no matter how long it runs, I see maximum 2-3 
> SSTables per node and maximum 300Mb of data per node.
>  
> (I have 6 nodes and very active 400 threads stress)
>  
> It looks, like It is impossible to make the row really wide and disk really 
> full.
>  
> Is it intentional? 
>  
> I mean, if there was an intention to avoid really wide rows, why there is no 
> hint on this in docs?
>  
> Do you have similar experience and do you know how resolve that?
>  
> Thanks.
>  
>  
>  
>  
> 
> 
> This communication and all or some of the information contained therein may 
> be confidential and is subject to our Terms and Conditions. If you have 
> received this communication in error, please destroy all electronic and paper 
> copies and notify the sender immediately. Unless specifically indicated, this 
> communication is not a confirmation, an offer to sell or solicitation of any 
> offer to buy any financial product, or an official statement of ICAP or its 
> affiliates. Non-Transactable Pricing Terms and Conditions apply to any 
> non-transactable pricing provided. All terms and conditions referenced herein 
> available at www.icapterms.com . Please notify us 
> by reply message if this link does not work.
> 

--
Julien Anguenot (@anguenot)



Re: Data lost in Cassandra 3.5 single instance via Erlang driver

2016-06-15 Thread Eric Stevens
As a side note, if you're inserting records quickly enough that you're
potentially doing multiple in the same millisecond, it seems likely to me
that your partition size is going to be too large at a day level unless
your writes are super bursty: ((appkey, pub_date), pub_timestamp).  You
might need to do hour, or 15 minutes or something, depending on what you
think your peak write rate will look like.

And another note, slightly bikeshed, but *personally* when doing time-based
bucketing (pub_date column), I prefer to use a timestamp and floor the
value I write.  This makes it easier to convert to a smaller bucket size
without changing the format of the data in that column.

On Wed, Jun 15, 2016 at 1:07 AM linbo liao  wrote:

> Thanks Ben, Paul, Alain.  I debug at client side find the reason is
> pub_timestamp duplicated.  I will use timeuuid instead.
>
> Thanks,
> Linbo
>
> 2016-06-15 13:09 GMT+08:00 Alain Rastoul :
>
>> On 15/06/2016 06:40, linbo liao wrote:
>>
>>> I am not sure, but looks it will cause the update other than insert. If
>>> it is true, the only way is request includes IF NOT EXISTS, inform the
>>> client it failed?
>>>
>>> Thanks,
>>> Linbo
>>>
>>> Hi Linbo,
>>
>> +1 with what Ben said, timestamp has a millisecond precision and is a bad
>> choice for making PK unicity.
>> If your client and server are on the same physical machine (both on same
>> computer or different vms on same hypervisor), insert duration can go down
>> to very few microseconds (2~3 on a recent computer).
>> Your insert will/should often become "update".
>> The reason is that update does not exists in cassandra, neither delete,
>> they are just "appends":  append with same key for update or append of a
>> tombstone for delete.
>> You should try to use a timeuuid instead, it has a node, clock sequence,
>> a counter plus the timestamp part that you can get with cql functions, and
>> it exists for that use.
>> see here for the functions
>>
>> https://docs.datastax.com/en/cql/3.3/cql/cql_reference/timeuuid_functions_r.html
>>
>>
>> --
>> best,
>> Alain
>>
>
>


Re: Streaming from 1 node only when adding a new DC

2016-06-15 Thread Paulo Motta
For rebuild, replace and -Dcassandra.consistent.rangemovement=false in
general we currently pick the closest replica (as indicated by the Snitch)
which has the range, what will often map to the same node due to the
dynamic snitch, specially when N=RF. This is good for picking a node in the
same DC or rack for transferring, but we can probably improve this to
distribute streaming load more evenly within candidate source nodes in the
same rack/DC.

Would you mind opening a ticket for improving this?

2016-06-14 17:35 GMT-03:00 Fabien Rousseau :

> We've tested with C* 2.1.14 version
> Yes VNodes with 256 tokens
> Once all the nodes in dc2 are added, schema is modified to have RF=3 in
> dc1 and RF=3 in dc2.
> Then on each nodes of dc2:
> nodetool rebuild dc1
> Le 14 juin 2016 10:39, "kurt Greaves"  a écrit :
>
>> What version of Cassandra are you using? Also what command are you using
>> to run the rebuilds? Are you using vnodes?
>>
>> On 13 June 2016 at 09:01, Fabien Rousseau  wrote:
>>
>>> Hello,
>>>
>>> We've tested adding a new DC from an existing DC having 3 nodes and RF=3
>>> (ie all nodes have all data).
>>> During the rebuild process, only one node of the first DC streamed data
>>> to the 3 nodes of the second DC.
>>>
>>> Our goal is to minimise the time it takes to rebuild a DC and would like
>>> to be able to stream from all nodes.
>>>
>>> Starting C* with debug logs, it appears that all nodes, when computing
>>> their "streaming plan" returns the same node for all ranges.
>>> This is probably because all nodes in DC2 have the same view of the ring.
>>>
>>> I understand that when bootstrapping a new node, it's preferable to
>>> stream from the node being replaced, but when rebuilding a new DC, it
>>> should probably select sources "randomly" (rather than always selecting the
>>> same source for a specific range).
>>> What do you think ?
>>>
>>> Best Regards,
>>> Fabien
>>>
>>
>>
>>
>> --
>> Kurt Greaves
>> k...@instaclustr.com
>> www.instaclustr.com
>>
>


Re: Cassandra monitoring

2016-06-15 Thread Otis Gospodnetić
Hi,

On Tue, Jun 14, 2016 at 4:20 PM, Arun Ramakrishnan <
sinchronized.a...@gmail.com> wrote:

> Thanks Jonathan.
>
> Out of curiosity, does opscenter support some later version of cassandra
> that is not OSS ?
>
> Well, the most minimal requirement is that, I want to be able to monitor
> for cluster health and hook this info to some alerting platform. We are AWS
> heavy. We just really heavily on AWS cloud watch for our metrics as of now.
> We prefer to not spend our time setting up additional tools if we can help
> it. So, if we needed a 3rd party service we would consider an APM or
> monitoring service that is on the cheaper side.
>

Right.  Using N open-source tools can lead to the image like the one on
https://sematext.com/blog/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




>
>
>
>
> On Tue, Jun 14, 2016 at 12:20 PM, Jonathan Haddad 
> wrote:
>
>> Depends what you want to monitor.  I wouldn't use a lesser version of
>> Cassandra for OpsCenter, it doesn't give you a ton you can't get elsewhere
>> and it's not ever going to support OSS > 2.1, so you kind of limit yourself
>> to a pretty old version of Cassandra for a non-good reason.
>>
>> What else do you use for monitoring in your infra?  I've used a mix of
>> OSS tools (nagios, statsd, graphite, ELK), and hosted solutions. The nice
>> part about them is that you can monitor your whole stack in a single UI not
>> just your database.
>>
>> On Tue, Jun 14, 2016 at 12:10 PM Arun Ramakrishnan <
>> sinchronized.a...@gmail.com> wrote:
>>
>>> What are the options for a very small and nimble startup to do keep a
>>> cassandra cluster running well oiled. We are on AWS. We are interested in a
>>> monitoring tool and potentially also cluster management tools.
>>>
>>> We are currently on apache cassandra 3.7. We were hoping the datastax
>>> opscenter would be it (It is free for startups our size). But, looks like
>>> it does not support cassandra versions greater than v2.1. It is pretty
>>> surprising considering cassandra v2.1  came out in 2014.
>>>
>>> We would consider downgrading to datastax cassandra 2.1 just to have
>>> robust monitoring tools. But, I am not sure if having opscenter offsets all
>>> the improvements that have been added to cassandra since 2.1.
>>>
>>> Sematext has a integrations for monitoring cassandra. Does anyone have
>>> good experience with it ?
>>>
>>> How much work would be involved to setup Ganglia or some such option for
>>> cassandra ?
>>>
>>> Thanks,
>>> Arun
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>


Re: Cassandra monitoring

2016-06-15 Thread Julien Anguenot
Hey Otis, 

hehe :-) 

I do have the latest agents running but these metrics are still empty on my 
side. Will take that issue on Sematext side then.

Thanks.

   J.

> On Jun 15, 2016, at 8:26 AM, Otis Gospodnetić  
> wrote:
> 
> Hi,
> 
> On Wed, Jun 15, 2016 at 8:58 AM, Julien Anguenot  > wrote:
> 
>> On Jun 14, 2016, at 2:10 PM, Arun Ramakrishnan > > wrote:
> 
> […]
> 
>> 
>> Sematext has a integrations for monitoring cassandra. Does anyone have good 
>> experience with it ?
> 
> We are using Sematext with Cassandra 3.0.x and it is mostly working fine for 
> us expect couple of metrics (pending writes and pending cluster ops) that 
> never got supported after Casandra 2.2.x. Did report the issue but it never 
> got fixed. Otis if you are listening? :-)
> 
> I am - almost always ;)
> We've added those metrics you mentioned a while back - see the various tabs 
> on the left in the attached screenshots.  Make sure you have the latest SPM 
> agent for Cassandra.
> 
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ 
> 
>  
> 
> --
> Julien Anguenot (@anguenot)
> 
> 
>  AM.png>

--
Julien Anguenot (@anguenot)



Re: Cassandra monitoring

2016-06-15 Thread Julien Anguenot

> On Jun 14, 2016, at 2:10 PM, Arun Ramakrishnan  
> wrote:

[…]

> 
> Sematext has a integrations for monitoring cassandra. Does anyone have good 
> experience with it ?

We are using Sematext with Cassandra 3.0.x and it is mostly working fine for us 
expect couple of metrics (pending writes and pending cluster ops) that never 
got supported after Casandra 2.2.x. Did report the issue but it never got 
fixed. Otis if you are listening? :-)

[…]

   J.

--
Julien Anguenot (@anguenot)



Re: Cqlsh Questions

2016-06-15 Thread Eric Stevens
There's an effort to improve the docs, but while that's catching up, 3.0
has the latest version of the document you're looking for:
https://cassandra.apache.org/doc/cql3/CQL-3.0.html#createKeyspaceStmt

On Wed, Jun 15, 2016 at 5:28 AM Steve Anderson 
wrote:

> Couple of Cqlsh questions:
>
> 1) Why when I run the DESCRIBE CLUSTER command no snitch information is
> required? Is this because my Cassandra cluster is a single node?
>
> 2) When I run the HELP CREATE_KEYSPACE command the following info is
> displayed:
>
>  No browser to display CQL help. URL for help topic CREATE_KEYSPACE :
> https://cassandra.apache.org/doc/cql3/CQL-3.2.html#createKeyspaceStmt
> *
>
> I am connecting via SSH to my Amazon Linux image and hence have no browser
> available. I see the browser can be configured via a .cqlshrc file, but can
> I configure it not to use a browser?
>
> By the way, there is no CQL-3.2.html file under
> https://cassandra.apache.org/doc/cql3/
>
> Thanks
> *Steve*
> —
> "Surely, those who believe, those who are Jewish, the Christians, and the
> converts; anyone who (1) believes in God, (2) believes in the Last Day, and
> (3) leads a righteous life, will receive their recompense from their Lord;
> they have nothing to fear, nor will they grieve." (Quran 2:62, 5:69) …learn
> more at www.masjidtucson.org
>
>
>
>
>


Re: how to force cassandra-stress to actually generate enough data

2016-06-15 Thread Benedict Elliott Smith
cassandra-stress has some (many) limitations - that I had planned to
address now it's seeing wider adoption, but since I no longer work on the
project for my day job I am unlikely to now... so, sorry but you'll have to
tolerate them :)

In particular, the problem you encounter here is that a given clustering
*tier* must be generated in its entirety before performing any operation
that touches any of its values (read or write), regardless of how many are
actually needed.  So, if you have a single clustering column in your
primary key, the client must generate the entire partition.  And if you
have a million of them, you may just be watching your cassandra-stress
instance enter a GC spiral and die slowly; in all likelihood the data you
see is just the partitions that get randomly assigned a modest size in your
range.

If you need to generate giant partitions, at the moment you need to have
multiple clustering columns, and preferably keep the cardinality of each to
at most a few hundred.  The smaller, the faster queries that only touch
small portions of the partition will run (such as point or range queries,
or partial insertions)

On 15 June 2016 at 13:24, Peter Kovgan 
wrote:

> Hi,
>
>
>
> The cassandra-stress is not helping really to populate the disk
> sufficiently.
>
>
>
> I tried several table structures, providing
>
> cluster: UNIFORM(1..100)  on clustering parts of the PK.
>
>
>
> Partition part of PK makes about 660 000 partitions.
>
>
>
> The hope was create enough cells in a row, make the row really WIDE.
>
>
>
> No matter what I tried, does no matter how long it runs, I see maximum 2-3
> SSTables per node and maximum 300Mb of data per node.
>
>
>
> (I have 6 nodes and very active 400 threads stress)
>
>
>
> It looks, like It is impossible to make the row really wide and disk
> really full.
>
>
>
> Is it intentional?
>
>
>
> I mean, if there was an intention to avoid really wide rows, why there is
> no hint on this in docs?
>
>
>
> Do you have similar experience and do you know how resolve that?
>
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> 
> This communication and all or some of the information contained therein
> may be confidential and is subject to our Terms and Conditions. If you have
> received this communication in error, please destroy all electronic and
> paper copies and notify the sender immediately. Unless specifically
> indicated, this communication is not a confirmation, an offer to sell or
> solicitation of any offer to buy any financial product, or an official
> statement of ICAP or its affiliates. Non-Transactable Pricing Terms and
> Conditions apply to any non-transactable pricing provided. All terms and
> conditions referenced herein available at www.icapterms.com. Please
> notify us by reply message if this link does not work.
>
> 
>


Are you using Cassandra to record user analytics?

2016-06-15 Thread Richard L. Burton III
I'm interested in hearing how you may be using Cassandra to capture user
analytics and some design choices you've made.

At a high level, I'm considering the following:

web application -> kafka -> cassandra

I need to be able to show the user and at a higher level the company:


   - When did a user execute a search; date/time + search details
  - Allow the user to select a date range
  - What records the user viewed and when


   - At a high level, have an aggregated view


It seems like the most sensible solution would be to model various tables
to address these different views. e.g., user vs manager

Any feedback on my initial thoughts?

-- 
-Richard L. Burton III


Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-15 Thread Cassa L
Hi,
Upgrading sprak is not option right now. I did set  --driver-memory 4G. I
still run into this issue after 1 hour of data load.

LCassa


On Tue, Jun 14, 2016 at 3:57 PM, Gaurav Bhatnagar 
wrote:

> try setting the option --driver-memory 4G
>
> On Tue, Jun 14, 2016 at 3:52 PM, Ben Slater 
> wrote:
>
>> A high level shot in the dark but in our testing we found Spark 1.6 a lot
>> more reliable in low memory situations (presumably due to
>> https://issues.apache.org/jira/browse/SPARK-1). If it’s an option,
>> probably worth a try.
>>
>> Cheers
>> Ben
>>
>> On Wed, 15 Jun 2016 at 08:48 Cassa L  wrote:
>>
>>> Hi,
>>> I would appreciate any clue on this. It has become a bottleneck for our
>>> spark job.
>>>
>>> On Mon, Jun 13, 2016 at 2:56 PM, Cassa L  wrote:
>>>
 Hi,

 I'm using spark 1.5.1 version. I am reading data from Kafka into Spark and 
 writing it into Cassandra after processing it. Spark job starts fine and 
 runs all good for some time until I start getting below errors. Once these 
 errors come, job start to lag behind and I see that job has scheduling and 
 processing delays in streaming  UI.

 Worker memory is 6GB, executor-memory is 5GB, I also tried to tweak 
 memoryFraction parameters. Nothing works.


 16/06/13 21:26:02 INFO MemoryStore: ensureFreeSpace(4044) called with 
 curMem=565394, maxMem=2778495713
 16/06/13 21:26:02 INFO MemoryStore: Block broadcast_69652_piece0 stored as 
 bytes in memory (estimated size 3.9 KB, free 2.6 GB)
 16/06/13 21:26:02 INFO TorrentBroadcast: Reading broadcast variable 69652 
 took 2 ms
 16/06/13 21:26:02 WARN MemoryStore: Failed to reserve initial memory 
 threshold of 1024.0 KB for computing block broadcast_69652 in memory.
 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache 
 broadcast_69652 in memory! (computed 496.0 B so far)
 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6 
 GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6 
 GB.
 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to 
 disk instead.
 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID 
 452316). 2043 bytes result sent to driver


 Thanks,

 L


>>> --
>> 
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>
>


Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-15 Thread Cassa L
Hi,
 I did set  --driver-memory 4G. I still run into this issue after 1 hour of
data load.

I also tried version 1.6 in test environment. I hit this issue much faster
than in 1.5.1 setup.
LCassa

On Tue, Jun 14, 2016 at 3:57 PM, Gaurav Bhatnagar 
wrote:

> try setting the option --driver-memory 4G
>
> On Tue, Jun 14, 2016 at 3:52 PM, Ben Slater 
> wrote:
>
>> A high level shot in the dark but in our testing we found Spark 1.6 a lot
>> more reliable in low memory situations (presumably due to
>> https://issues.apache.org/jira/browse/SPARK-1). If it’s an option,
>> probably worth a try.
>>
>> Cheers
>> Ben
>>
>> On Wed, 15 Jun 2016 at 08:48 Cassa L  wrote:
>>
>>> Hi,
>>> I would appreciate any clue on this. It has become a bottleneck for our
>>> spark job.
>>>
>>> On Mon, Jun 13, 2016 at 2:56 PM, Cassa L  wrote:
>>>
 Hi,

 I'm using spark 1.5.1 version. I am reading data from Kafka into Spark and 
 writing it into Cassandra after processing it. Spark job starts fine and 
 runs all good for some time until I start getting below errors. Once these 
 errors come, job start to lag behind and I see that job has scheduling and 
 processing delays in streaming  UI.

 Worker memory is 6GB, executor-memory is 5GB, I also tried to tweak 
 memoryFraction parameters. Nothing works.


 16/06/13 21:26:02 INFO MemoryStore: ensureFreeSpace(4044) called with 
 curMem=565394, maxMem=2778495713
 16/06/13 21:26:02 INFO MemoryStore: Block broadcast_69652_piece0 stored as 
 bytes in memory (estimated size 3.9 KB, free 2.6 GB)
 16/06/13 21:26:02 INFO TorrentBroadcast: Reading broadcast variable 69652 
 took 2 ms
 16/06/13 21:26:02 WARN MemoryStore: Failed to reserve initial memory 
 threshold of 1024.0 KB for computing block broadcast_69652 in memory.
 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache 
 broadcast_69652 in memory! (computed 496.0 B so far)
 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6 
 GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6 
 GB.
 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to 
 disk instead.
 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID 
 452316). 2043 bytes result sent to driver


 Thanks,

 L


>>> --
>> 
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>
>