Re: Using cassandra a BLOB store / web cache.

2016-01-19 Thread Richard L. Burton III
I would ask why do this over say HDFS, S3, etc. seems like this problem has
been solved with other solutions that are specifically designed for blob
storage?

On Tue, Jan 19, 2016 at 4:23 PM,  wrote:

> I recently started noodling with this concept and built a working blob
> storage service using node.js and C*.  I setup a basic web server using the
> express web server where you could POST binary files to the server where
> they would get chunked and assigned to a user and bucket, in the spirit of
> S3.  Then when you retrieved the file with a GET it would reassemble the
> chunks and push it out.  What was really nice about node.js is I could
> retrieve all the chunks in parallel asynchronously.  It was just for fun,
> but we have considered fleshing it out for use in our organization.  Here
> is the schema I developed:
>
> CREATE TABLE object.users (
> name text PRIMARY KEY,
> buckets set,
> password text
> );
>
> CREATE TABLE object.objects (
> user text,
> bucket text,
> name text,
> chunks int,
> id uuid,
> size int,
> type text,
> PRIMARY KEY ((user, bucket), name)
> );
>
> CREATE TABLE object.chunks (
> file_id uuid,
> num int,
> data blob,
> PRIMARY KEY (file_id, num)
> );
>
> For your purposes you could modify the objects table to keep a revision id
> or timeuuid for looking at previous versions.  If you want some insight
> into how the code worked give me a shout.
>



-- 
-Richard L. Burton III
@rburton


Re: Using cassandra a BLOB store / web cache.

2016-01-19 Thread list
I recently started noodling with this concept and built a working blob storage 
service using node.js and C*.  I setup a basic web server using the express web 
server where you could POST binary files to the server where they would get 
chunked and assigned to a user and bucket, in the spirit of S3.  Then when you 
retrieved the file with a GET it would reassemble the chunks and push it out.  
What was really nice about node.js is I could retrieve all the chunks in 
parallel asynchronously.  It was just for fun, but we have considered fleshing 
it out for use in our organization.  Here is the schema I developed:

CREATE TABLE object.users (
name text PRIMARY KEY,
buckets set,
password text
);

CREATE TABLE object.objects (
user text,
bucket text,
name text,
chunks int,
id uuid,
size int,
type text,
PRIMARY KEY ((user, bucket), name)
);

CREATE TABLE object.chunks (
file_id uuid,
num int,
data blob,
PRIMARY KEY (file_id, num)
);

For your purposes you could modify the objects table to keep a revision id or 
timeuuid for looking at previous versions.  If you want some insight into how 
the code worked give me a shout.


Re: Using cassandra a BLOB store / web cache.

2016-01-19 Thread Eric Evans
On Mon, Jan 18, 2016 at 8:52 PM, Kevin Burton  wrote:

> Internally we have the need for a blob store for web content.  It's MOSTLY
> key, ,value based but we'd like to have lookups by coarse grained tags.
>
> This needs to store normal web content like HTML , CSS, JPEG, SVG, etc.
>
> Highly doubt that anything over 5MB would need to be stored.
>
> We also need the ability to store older versions of the same URL for
> features like "time travel" where we can see what the web looks like over
> time.
>
> I initially wrote this for Elasticsearch (and it works well for that) but
> it looks like binaries snuck into the set of requirements.
>
> I could Base64 encode/decode them in ES I guess but that seems ugly.
>
> I was thinking of porting this over to CS but I'm not up to date on the
> current state of blobs in C*...
>
> Any advice?
>

We (Wikimedia Foundation) use Cassandra as a durable cache for HTML (with
history).  A simplified version of the schema we use would look something
like:

CREATE TABLE data (
key text,
rev int,
tid timeuuid,
value blob,
PRIMARY KEY (("_domain", key), rev, tid)
)

In our case, a 'rev' represents a normative change to the document (read:
someone made an edit), and the 'tid' attribute allows for some arbitrary
number of HTML representations of that revision (say if for example some
transclusion would alter the final outcome).  You could simplify this
further by removing the 'tid' attribute if this doesn't apply to you.

One concern here is the size of blobs.  Where exactly the threshold on size
should be is probably debatable, but if you are using G1GC I would be
careful about what large blobs do to humongous allocations.  G1 will
allocate anything over 1/2 the region size as humongous, and special-case
the handling of them, so humongous allocations should be the exception and
not the rule.  Depending on your heap size and the distribution of blob
sizes, you might be able to get by with overriding the GC's choice of
region size, but if 5MB values are at all common, you'll need 16MB region
sizes, (which probably won't work well without a very large corresponding
max heap size).

Another concern is row width.  With a data-model like this, rows will grow
relative to the number of versions stored.  If versions are added at a low
rate, that might not pose an issue in practice, if it does though you'll
need to consider a different partitioning strategy.

TL;DR You need to understand what your data will look like.  Min and max
value sizes aren't enough, you should have some idea of size distribution,
read/write rates, etc.  Understand the implications of your data model.
And then test, test, test.


-- 
Eric Evans
eev...@wikimedia.org


Re: Using cassandra a BLOB store / web cache.

2016-01-19 Thread Robert Coli
On Mon, Jan 18, 2016 at 6:52 PM, Kevin Burton  wrote:

> Internally we have the need for a blob store for web content.  It's MOSTLY
> key, ,value based but we'd like to have lookups by coarse grained tags.
>

I know you know how to operate and scale MySQL, so I suggest MogileFS for
the actual blob storage :

https://github.com/mogilefs

Then do some simple indexing in some search store. Done.

=Rob


Re: Using cassandra a BLOB store / web cache.

2016-01-19 Thread Kevin Burton
Lots of interesting feedback.. I like the ideal of chunking the IO into
pages.. it would require more thinking but I could even do cassandra async
IO and async HTTP to serve the data and then use HTTP chunks for each
range.

On Tue, Jan 19, 2016 at 10:47 AM, Robert Coli  wrote:

> On Mon, Jan 18, 2016 at 6:52 PM, Kevin Burton  wrote:
>
>> Internally we have the need for a blob store for web content.  It's
>> MOSTLY key, ,value based but we'd like to have lookups by coarse grained
>> tags.
>>
>
> I know you know how to operate and scale MySQL, so I suggest MogileFS for
> the actual blob storage :
>
> https://github.com/mogilefs
>
> Then do some simple indexing in some search store. Done.
>
> =Rob
>
>



-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Possible to adjust tokens on a vnode cluster?

2016-01-19 Thread John Sumsion
I have a 24 node cluster, with vnodes set to 256.


'nodetool status ' looks like this for our keyspace:


UN  588.23 GB  256 11.0% 
0c8708a7-b962-4fc9-996c-617da642d9ee  1a
UN  601.33 GB  256 11.3% 
5ef60730-0b01-4a8b-a578-d828cdf78a1f  1b
UN  613.02 GB  256 11.5% 
dddc78b1-7dc2-4e9f-8e8a-1b52595aa0e3  1a
UN  620.76 GB  256 11.7% 
87ac93ff-dc8e-4cd5-842c-0389ce016d70  1b
UN  631.81 GB  256 11.9% 
8e1416aa-3e75-4ab5-a2a6-49d26f514115  1d
UN  634.65 GB  256 11.9% 
3c97f722-16f5-455c-8f58-71c07ad93d25  1b
UN  634.79 GB  256 11.9% 
3e3d41bd-d6e8-4a7e-aee2-7ea16b1dadb9  1d
UN  637.05 GB  256 12.0% 
2f26f19a-c88f-4cbe-b865-155c0b66bff0  1b
UN  637.83 GB  256 12.0% 
6385e073-5b48-49b3-a85b-e7511fa8b3a0  1a
UN  638.05 GB  256 12.1% 
382681e5-c060-4594-ae2a-062a324c12d4  1d
UN  660.22 GB  256 12.4% 
ea6aad23-7d93-4989-8898-7505df51298f  1d
UN  674.98 GB  256 12.6% 
7d372371-c23f-4235-9e3c-cf030fb52ab3  1a
UN  676.22 GB  256 12.7% 
41c4cb98-91ae-43a6-9bc4-11aa6106faad  1d
UN  680.15 GB  256 12.7% 
65ac3aef-8a9b-423d-83fb-ed8e41f88ccc  1a
UN  681.35 GB  256 12.8% 
e38efc6a-e7eb-4d8e-9069-a0b099bea96e  1d
UN  693.19 GB  256 13.0% 
2b9a5d3e-8529-47fe-8d2c-13553a8df91f  1b
UN  696.92 GB  256 13.0% 
46382cd1-402c-4200-858c-100dade03fc5  1d
UN  698.17 GB  256 13.1% 
a68107e7-8e1a-469e-8dd1-e2d87445fd47  1b
UN  698.92 GB  256 13.1% 
662338a7-1f5c-4eaa-926e-9e9fda926504  1a
UN  699.26 GB  256 13.1% 
e7c15c56-80e6-4961-9cd9-c1302fbf2026  1a
UN  702.98 GB  256 13.2% 
461baba0-60f3-423a-a5cf-e0c482da2dbf  1b
UN  710.27 GB  256 13.3% 
ffa9700d-50ef-4b23-92d9-18f8029c8cd6  1d
UN  740.63 GB  256 13.8% 
d9c6e2a1-2193-4f32-8426-3bd7ad8bf679  1a
UN  744.12 GB  256 13.9% 
ff841094-7624-4dc5-b480-f39138b7f17c  1b


First, the difference in disk usage between 588G (lowest) and 744G (highest) is 
significant - at 156G.  I'm sure it's probably a weird pattern in our partition 
keys, but we can't predict that until we get the data loaded.


Maybe someone will advise against using vnodes altogether, but we need to be 
able to add 3 nodes for extra capacity and would like not to have to rewrite 
the vnode token assignment code in order to figure out a rack-safe token 
reassignment.


Given the above, is there any way to manually adjust tokens (while still using 
vnodes) so that we can balance the disk usage out?  If so, is there an easy way 
to do that in a rack-safe manner?


Thanks,

John...


Re: Possible to adjust tokens on a vnode cluster?

2016-01-19 Thread Eric Evans
On Tue, Jan 19, 2016 at 12:21 PM, John Sumsion 
wrote:

> I have a 24 node cluster, with vnodes set to 256.
>
>
> 'nodetool status ' looks like this for our keyspace:
>
>
> UN  588.23 GB  256 11.0%
> 0c8708a7-b962-4fc9-996c-617da642d9ee  1a
> UN  601.33 GB  256 11.3%
> 5ef60730-0b01-4a8b-a578-d828cdf78a1f  1b
> UN  613.02 GB  256 11.5%
> dddc78b1-7dc2-4e9f-8e8a-1b52595aa0e3  1a
> UN  620.76 GB  256 11.7%
> 87ac93ff-dc8e-4cd5-842c-0389ce016d70  1b
> UN  631.81 GB  256 11.9%
> 8e1416aa-3e75-4ab5-a2a6-49d26f514115  1d
> UN  634.65 GB  256 11.9%
> 3c97f722-16f5-455c-8f58-71c07ad93d25  1b
> UN  634.79 GB  256 11.9%
> 3e3d41bd-d6e8-4a7e-aee2-7ea16b1dadb9  1d
> UN  637.05 GB  256 12.0%
> 2f26f19a-c88f-4cbe-b865-155c0b66bff0  1b
> UN  637.83 GB  256 12.0%
> 6385e073-5b48-49b3-a85b-e7511fa8b3a0  1a
> UN  638.05 GB  256 12.1%
> 382681e5-c060-4594-ae2a-062a324c12d4  1d
> UN  660.22 GB  256 12.4%
> ea6aad23-7d93-4989-8898-7505df51298f  1d
> UN  674.98 GB  256 12.6%
> 7d372371-c23f-4235-9e3c-cf030fb52ab3  1a
> UN  676.22 GB  256 12.7%
> 41c4cb98-91ae-43a6-9bc4-11aa6106faad  1d
> UN  680.15 GB  256 12.7%
> 65ac3aef-8a9b-423d-83fb-ed8e41f88ccc  1a
> UN  681.35 GB  256 12.8%
> e38efc6a-e7eb-4d8e-9069-a0b099bea96e  1d
> UN  693.19 GB  256 13.0%
> 2b9a5d3e-8529-47fe-8d2c-13553a8df91f  1b
> UN  696.92 GB  256 13.0%
> 46382cd1-402c-4200-858c-100dade03fc5  1d
> UN  698.17 GB  256 13.1%
> a68107e7-8e1a-469e-8dd1-e2d87445fd47  1b
> UN  698.92 GB  256 13.1%
> 662338a7-1f5c-4eaa-926e-9e9fda926504  1a
> UN  699.26 GB  256 13.1%
> e7c15c56-80e6-4961-9cd9-c1302fbf2026  1a
> UN  702.98 GB  256 13.2%
> 461baba0-60f3-423a-a5cf-e0c482da2dbf  1b
> UN  710.27 GB  256 13.3%
> ffa9700d-50ef-4b23-92d9-18f8029c8cd6  1d
> UN  740.63 GB  256 13.8%
> d9c6e2a1-2193-4f32-8426-3bd7ad8bf679  1a
> UN  744.12 GB  256 13.9%
> ff841094-7624-4dc5-b480-f39138b7f17c  1b
>
> 156GB might seem like a lot, but those extremes are less than 3% away; How
close do really need it to be (and why)?



-- 
Eric Evans
eev...@wikimedia.org


Re: Using cassandra a BLOB store / web cache.

2016-01-19 Thread Robert Coli
On Tue, Jan 19, 2016 at 2:07 PM, Richard L. Burton III 
wrote:

> I would ask why do this over say HDFS, S3, etc. seems like this problem
> has been solved with other solutions that are specifically designed for
> blob storage?
>

HDFS's default block size is 64mb. If you are storing objects smaller than
this, that might be bad! It also doesn't have http transport, which other
things do.

Etc..

=Rob


Connection error 61 for cassandra

2016-01-19 Thread ankita therese
Hello,

I set up a single node on localhost, and it was working fine
I connected cassandra with a apache spark, and was able to access the
keyspaces

After this, i connected to pyspark using the datastax spark-cassandra
connector

Ever since then, when i try to access spark via cqlsh, all i get is

>

Connection error: ('Unable to connect to any servers', {'127.0.0.1':
> error(61, "Tried connecting to [('127.0.0.1', 9042)]. Last error:
> Connection refused")})



when i try to connect via spark shell, i get

>
> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
> All host(s) tried for query failed (tried: /192.168.1.2:9042 
> (com.datastax.driver.core.exceptions.TransportException:
> [/192.168.1.2] Cannot connect))




Any idea what i should do?

OS: OSX Yosemite 10.10.5


[RELEASE] Apache Cassandra 3.2.1 released

2016-01-19 Thread Jake Luciani
The Cassandra team is pleased to announce the release of Apache Cassandra
version 3.2.1.

Apache Cassandra is a fully distributed database. It is the right choice
when you need scalability and high availability without compromising
performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

 http://cassandra.apache.org/download/

This version is a bug fix release[1] on the 3.2 series. As always, please
pay
attention to the release notes[2] and Let us know[3] if you were to
encounter
any problem.

Enjoy!

[1]: https://goo.gl/ySa5hr (CHANGES.txt)
[2]: https://goo.gl/tCBBPv (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA


Re: Unable to locate Solr Configuration file ( Generated using dsetool )

2016-01-19 Thread Harikrishnan A
Thanks Sebastian , Jack...This really helps

Sent from Yahoo Mail on Android 
 
  On Mon, Jan 18, 2016 at 11:03 AM, Sebastian 
Estevez wrote:   You can post it to the server 
using either curl or dsetool:
http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchReldCore.html

use the solrconfig and schema options:

| Option | Settings | Default | Description |
| schema= | path | n/a | Path of the schema file used for reloading the core |
| solrconfig= | path | n/a | Path of the solrconfig file used for reloading the 
core
 |



All the best,






Sebastián Estévez

Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com









DataStax is the fastest, most scalable distributed database technology, 
delivering Apache Cassandra to the world’s most innovative enterprises. 
Datastax is built to be agile, always-on, and predictably scalable to any size. 
With more than 500 customers in 45 countries, DataStax is the database 
technology and transactional backbone of choice for the worlds most innovative 
companies such as Netflix, Adobe, Intuit, and eBay. 
On Mon, Jan 18, 2016 at 12:29 PM, Harikrishnan A  wrote:

Thanks Jack ..So how do I customize these resource file.. I mean if I want to 
add some custom fields or to change the default text analyzer etc..

Sent from Yahoo Mail on Android 
 
 On Mon, Jan 18, 2016 at 7:50 AM, Jack Krupansky 
wrote:   Also, you can (and probably should) use the Solr admin UI console if 
you simply wish to view the generated resource files.
-- Jack Krupansky
On Mon, Jan 18, 2016 at 9:46 AM, Jack Krupansky  
wrote:

As per the DSE Search doc: "Resource files are stored in Cassandra database, 
not in the file system. The schema.xml and solrconfig.xml resources are 
persisted in the solr_admin.solr_resources database 
table":http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchUpld.html

Use the dsetool get_core_schema and get_core_config commands to retrieve the 
generated Solr schema and solrconfig 
files:http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/tools/toolsDsetool.html

You can also use the dsetool read_resource command to read any of the Solr 
resource "files".

-- Jack Krupansky
On Mon, Jan 18, 2016 at 12:47 AM, Harikrishnan A  wrote:

Hello,
I have created a solr core with automatic resource generation using the below 
command> dsetool create_core . generateResources=true 
reindex=trueHowever I am unable to locate the schema.xml and the solrconfig.xml 
which got created for this core.What is the default location of these 
configuration files?Can I customize these configuration files once it is 
generated using the above commands?
Thanks & Regards,Hari



  


  


Re: broadcast_address in multi data center setups

2016-01-19 Thread Francisco Reyes

On 01/18/2016 09:44 AM, Paulo Motta wrote:
broadcast_address is the address exposed for internal inter-node 
communication, while rpc_address is the address that will listen to 
clients.


all nodes need to talk to each other via the broadcast_address, so if 
they are within the same network, you may use public or private IPs as 
broadcast_address, but if there's at least one node in a different 
network they all need to use the public IP, or you need to setup your 
own tunnelling/vpn to make sure nodes can reach each other.


You need to setup your own firewall rules. See more about what ports 
are used here: 
https://docs.datastax.com/en/cassandra/2.1/cassandra/security/secureFireWall_r.html. 
You may also be interested in setting up client authentication: 
https://docs.datastax.com/en/cassandra/2.1/cassandra/security/security_config_native_authenticate_t.html





Thanks for links/info.

For applications, do they use the CQL native clients port(9042) or 
Cassandra client port (Thrift).(9160). We will be using Python to 
connect to Cassandra.


Plan to use RCP_address internally, so for this case firewall not an 
issue, but would like to know for future reference. Although I think one 
would ideally always want applications in the same data center as the 
database.


Re: Connection error 61 for cassandra

2016-01-19 Thread Carlos Alonso
I ran through those issues a while ago.

It was on Ubuntu rather than OSX but probably the same.

I compiled my steps here:
http://mrcalonso.com/fitting-ipython-notebooks-spark-and-cassandra-all-together/

Cheers!

Carlos Alonso | Software Engineer | @calonso 

On 19 January 2016 at 14:01, ankita therese  wrote:

> Hello,
>
> I set up a single node on localhost, and it was working fine
> I connected cassandra with a apache spark, and was able to access the
> keyspaces
>
> After this, i connected to pyspark using the datastax spark-cassandra
> connector
>
> Ever since then, when i try to access spark via cqlsh, all i get is
>
>>
>
> Connection error: ('Unable to connect to any servers', {'127.0.0.1':
>> error(61, "Tried connecting to [('127.0.0.1', 9042)]. Last error:
>> Connection refused")})
>
>
>
> when i try to connect via spark shell, i get
>
>>
>> Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException:
>> All host(s) tried for query failed (tried: /192.168.1.2:9042 
>> (com.datastax.driver.core.exceptions.TransportException:
>> [/192.168.1.2] Cannot connect))
>
>
>
>
> Any idea what i should do?
>
> OS: OSX Yosemite 10.10.5
>