Re: Storing large files for later processing through hadoop

2015-01-03 Thread Wilm Schumacher
Am 03.01.2015 um 07:07 schrieb Srinivasa T N:
 Hi Wilm,
The reason is that for some auditing purpose, I want to store the
 original files also.
well, then I would use a hdfs cluster for storing, as it seems to be
exactly what you need. If you collocate hdfs DataNodes and yarns
ResourceManager, you also could spare a lot of hardware or costs for
external services. It is not recommended to do that, but in your special
case this should work. This seems applicable as you only use the hdfs
for storing the xml exactly for that purpose.

But I'm more familiar with hadoop, hdfs and hbase than with Cassandra.
So perhaps I'm biased.

And what Jacob proposed could be a solution, too. Spares a lot of nerves ;).

Best wishes,

Wilm



Re: Storing large files for later processing through hadoop

2015-01-02 Thread Wilm Schumacher
Hi,

perhaps I totally misunderstood your problem, but why bother with
cassandra for storing in the first place?

If your MR for hadoop is only run once for each file (as you wrote
above), why not copy the data directly to hdfs, run your MR job and use
cassandra as sink?

As hdfs and yarn are more or less completely independent you could
perhaps use the master as ResourceManager (yarn) AND NameNode and
DataNode (hdfs) and launch your MR job directly and as mentioned use
Cassandra as sink for the reduced data. By this you won't need dedicated
hardware, as you only need the hdfs once, process and delete the files
afterwards.

Best wishes,

Wilm


Re: 2014 nosql benchmark

2014-12-18 Thread Wilm Schumacher
Hi,

I'm always interessted in such benchmark experiments, because the
databases evolve so fast, that the race is always open and there is a
lot motion in there.

And of course I askes myself the same question. And I think that this
publication is unreliable. For 4 reasons (from reading very fast,
perhaps there is more):

1.) It is unclear what this is all about. The title is NoSQL
Performance Testing. The subtitle is In-Memory Performance Comparison
of SequoiaDB, Cassandra,  and MongoDB. However, in the introduction
there is not one word about in memory performance. The introduction
could be a general introduction for a general on-disk-nosql benchmark.
So ... only the subtitle (and a short sentence in the Result Summary)
says what this is actually about.

2.) There are very important databases missing. For in memory e.g.
redis. If e.g. redis is not a valid candidate in this race, why is this
so?MySQL is capable of in memory distributed databanking, too.

3.) The methodology is unclear. Perhaps I'm the only one, but what does
Run workload for 30 minutes (workload file workload[1-5])  mean for
mixed read/write ops? Why 30 min? Okay, I can image, that the authors
estimated the throughput, preset the number of 100 Mio rows and designed
it to be larger than the estimated throughput in x minutes. However, all
this information is missing. And why 45% and 22% of RAM? My first Idea
would be a VERY low ration, like 2% or so, and a VERY large ratio, like
80-90%. And than everything in between. Is 22% or 45% somehow a magic
number? Furthermore in the Result summary there 1/2 and 1/4 of RAM are
discussed. Okay, 22% is near 1/4 ... but where does the difference
origin from? And btw. ... 22% of what? Stuff to insert? Stuff already
insererted? It's all deductable, but it's strange that the description
is so sloppy.

4.) There is no repetion of the loads (as I understand). Its one run,
one result ... and it's done. I don't know a lot of cassandra in
in-memory use. But either the experiment should be repeated quite some
runs OR it should be explained why this is not neccessary.

Okay, perhaps 1 is a little picky, and 4 is a little fussy. But 3 is
strange and 2 stinks.

Well, just my first impression. And that's Cassandra is very fast ;).

Best regards

Wilm


Am 19.12.2014 um 06:41 schrieb diwayou:
 i just have read this benchmark pdf, does anyone have some opinion
 about this?
 i think it's not fair about cassandra
 url:http://www.bankmark.de/wp-content/uploads/2014/12/bankmark-20141201-WP-NoSQLBenchmark.pdf‍
 http://msrg.utoronto.ca/papers/NoSQLBenchmark‍